restructured and updated diffphys chapter

2022-03-06 14:44:18 +08:00 · 2022-03-06 14:44:18 +08:00 · 59d3d8c040
commit 59d3d8c040
parent 754d79b061
14 changed files with 196 additions and 198 deletions
--- a/_config.yml
+++ b/_config.yml
@ -4,7 +4,7 @@
 title: Physics-based Deep Learning
 author: N. Thuerey, P. Holl, M. Mueller, P. Schnell, F. Trost, K. Um
 logo: resources/logo.jpg
-copyright: "2021"
+copyright: "2021,2022"
 only_build_toc_files: true

 launch_buttons:
--- a/_toc.yml
+++ b/_toc.yml
@ -22,15 +22,14 @@ parts:
  chapters:
  - file: diffphys.md
  - file: diffphys-code-burgers.ipynb
-  - file: diffphys-discuss.md
-  - file: diffphys-code-ns.ipynb
  - file: diffphys-dpvspinn.md
- caption: Complex Examples with DP
+  - file: diffphys-code-ns.ipynb
+- caption: Differentiable Physics with NNs
  chapters:
  - file: diffphys-examples.md
  - file: diffphys-code-sol.ipynb
-  - file: diffphys-control.ipynb
-  - file: diffphys-outlook.md
+  - file: diffphys-code-control.ipynb
+  - file: diffphys-discuss.md
 - caption: Reinforcement Learning
  chapters:
  - file: reinflearn-intro.md
--- a/diffphys-code-burgers.ipynb
+++ b/diffphys-code-burgers.ipynb
@ -6,11 +6,11 @@
   "source": [
    "# Burgers Optimization with a Differentiable Physics Gradient\n",
    "\n",
-    "To illustrate the process of computing gradients in a _differentiable physics_ (DP) setting, we target the same inverse problem (the reconstruction task) used for the PINN example  in {doc}`physicalloss-code`. The choice of DP as a method has some immediate implications: we start with a discretized PDE, and the evolution of the  system is now fully determined by the resulting numerical solver. Hence, the only real unknown is the initial state. We will still need to re-compute all the states between the initial and target state many times, just now we won't need an NN for this step. Instead, we can rely on our discretized model. \n",
+    "To illustrate the process of computing gradients in a _differentiable physics_ (DP) setting, we target the same inverse problem (the reconstruction task) used for the PINN example  in {doc}`physicalloss-code`. The choice of DP as a method has some immediate implications: we start with a discretized PDE, and the evolution of the  system is now fully determined by the resulting numerical solver. Hence, the only real unknown is the initial state. We will still need to re-compute all the states between the initial and target state many times, just now we won't need an NN for this step. Instead, we rely on the discretization of the model equations. \n",
    "\n",
    "Also, as we choose an initial discretization for the DP approach, the unknown initial state consists of the sampling points of the involved physical fields, and we can simply represent these unknowns as floating point variables. Hence, even for the initial state we do not need to set up an NN. Thus, our Burgers reconstruction problem reduces to a gradient-based optimization without any NN when solving it with DP. Nonetheless, it's a very good starting point to illustrate the process.\n",
    "\n",
-    "First, we'll set up our discretized simulation. Here we can employ phiflow, as shown in the overview section on _Burgers forward simulations_. \n",
+    "First, we'll set up our discretized simulation. Here we employ phiflow, as shown in the overview section on _Burgers forward simulations_. \n",
    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/diffphys-code-burgers.ipynb)\n",
    "\n",
    "\n",
@ -49,7 +49,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can verify that the fields of our simulation are now backed by TensorFlow."
+    "Below we verify that the fields of our simulation are now backed by TensorFlow."
   ]
  },
  {
@ -119,7 +119,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Because we're only constraining time step 16, we could actually omit steps 17 to 31 in this setup. They don't have any degrees of freedom and are not constrained in any way. However, for fairness regarding a comparison with the previous PINN case, we include them.\n",
+    "Because we're only constraining time step 16, we could actually omit steps 17 to 31 in this setup. They don't have any degrees of freedom and are not constrained in any way. However, for fairness regarding a comparison with the previous case, we include them.\n",
    "\n",
    "Note that we've done a lot of calculations here: first the 32 steps of our simulation, and then another 16 steps backwards from the loss. They were recorded by the gradient tape, and used to backpropagate the loss to the initial state of the simulation.\n",
    "\n",
@ -178,7 +178,7 @@
   "source": [
    "## Optimization \n",
    "\n",
-    "Equipped with the gradient we can run a gradient descent optimization. Below, we're using a learning rate of `LR=5`, and we're re-evaluating the loss for the updated state to track convergence. \n",
+    "Equipped with the gradient we now run a gradient descent optimization. Below, we're using a learning rate of `LR=5`, and we're re-evaluating the loss for the updated state to track convergence. \n",
    "\n",
    "In the following code block, we're additionally saving the gradients in a list called `grads`, such that we can visualize them later on. For a regular optimization we could of course discard the gradient after performing an update of the velocity.\n"
   ]
@ -226,7 +226,7 @@
   "metadata": {},
   "source": [
    "\n",
-    "Now we can check well the 16th state of the simulation actually matches the target after the 5 update steps. This is what the loss measures, after all. The next graph shows the constraints (i.e. the solution we'd like to obtain) in green, and the reconstructed state after the initial state `velocity` (which we updated five times via the gradient by now) was updated 16 times by the solver. \n",
+    "Now we'll check well the 16th state of the simulation actually matches the target after the 5 update steps. This is what the loss measures, after all. The next graph shows the constraints (i.e. the solution we'd like to obtain) in green, and the reconstructed state after the initial state `velocity` (which we updated five times via the gradient by now) was updated 16 times by the solver. \n",
    "\n"
   ]
  },
@ -330,7 +330,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Thinking back to the PINN version from {doc}`diffphys-code-burgers`, we have a much lower error here after only 50 steps (by ca. an order of magnitude), and the runtime is also lower (roughly by a factor of 1.5 to 2). This behavior stems fro\n",
+    "Thinking back to the PINN version from {doc}`diffphys-code-burgers`, we have a much lower error here after only 50 steps (by ca. an order of magnitude), and the runtime is also lower (roughly by a factor of 1.5 to 2). This behavior stems from DP providing gradients for the whole solutions with all its discretization points and time steps, rather than localized updates.\n",
    "\n",
    "Let's plot again how well our solution at $t=0.5$ (blue) matches the constraints (green) now:"
   ]
@ -455,7 +455,7 @@
   "source": [
    "## Physics-informed vs. differentiable physics reconstruction\n",
    "\n",
-    "Now we have both versions, the one with the PINN, and the DP version, so let's compare both reconstructions in more detail. (Note: The following cells expect that the Burgers-forward and PINN notebooks were executed in the same environment beforehand.)\n",
+    "Now we have both versions, the one with the PINN, and the DP version, so let's compare both reconstructions in more detail. (Note: The following cells expect that the Burgers-forward and PINN notebooks were executed in the same environment beforehand such that the `.npz` files in the `./temp' directory are available..)\n",
    "\n",
    "Let's first look at the solutions side by side. The code below generates an image with 3 versions, from top to bottom: the \"ground truth\" (GT) solution as given by the regular forward simulation, in the middle the PINN reconstruction, and at the bottom the differentiable physics version.\n"
   ]
@ -562,10 +562,10 @@
    "\n",
    "## Next steps\n",
    "\n",
-    "As with the PINN version, there's variety of things that can be improved and experimented with using the code above:\n",
+    "As before, there's variety of things that can be improved and experimented with using the code above:\n",
    "\n",
    "* You can try to adjust the training parameters to further improve the reconstruction.\n",
-    "* As for the PINN case, you can activate a different optimizer, and observe the changing (not necessarily improved) convergence behavior.\n",
+    "* Activate a different optimizer, and observe the changing (not necessarily improved) convergence behavior.\n",
    "* Vary the number of steps, or the resolution of the simulation and reconstruction.\n"
   ]
  }
--- a/diffphys-code-control.ipynb
+++ b/diffphys-code-control.ipynb
@ -15,7 +15,7 @@
    "Below we will run a very challenging test case as a representative of these inverse problems: we will aim for computing a high dimensional control function that exerts forces over the full course of an incompressible fluid simulation in order to reach a desired goal state for a passively advected marker in the fluid. This means we only have very indirect constraints to be fulfilled (a single state at the end of a sequence), and a large number of degrees of freedom (the control force function is a space-time function with the same degrees of freedom as the flow field itself).\n",
    "\n",
    "The _long-term_ nature of the control is one of the aspects which makes this a tough inverse problem: any changes to the state of the physical system can lead to large change later on in time, and hence a controller needs to anticipate how the system will behave when it is influenced. This means an NN also needs to learn how the underlying physics evolve and change, and this is exactly where the gradients from the DP training come in to guide the learning task towards solution that can reach the goal.\n",
-    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/diffphys-control.ipynb)\n"
+    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/diffphys-code-control.ipynb)\n"
   ]
  },
  {
@ -745,7 +745,7 @@
 "metadata": {
  "colab": {
   "collapsed_sections": [],
-   "name": "diffphys-control.ipynb",
+   "name": "diffphys-code-control.ipynb",
   "provenance": []
  },
  "kernelspec": {
--- a/diffphys-code-ns.ipynb
+++ b/diffphys-code-ns.ipynb
@ -5,7 +5,7 @@
   "source": [
    "# Differentiable Fluid Simulations\n",
    "\n",
-    "Next, we'll target a more complex example with the Navier-Stokes equations as physical model. In line with {doc}`overview-ns-forw`, we'll target a 2D case.\n",
+    "We now target a more complex example with the Navier-Stokes equations as physical model. In line with {doc}`overview-ns-forw`, we'll target a 2D case.\n",
    "\n",
    "As optimization objective we'll consider a more difficult variant of the previous Burgers example: the state of the observed density $s$ should match a given target after $n=20$ steps of simulation. In contrast to before, the observed quantity in the form of the marker field $s$ cannot be changed in any way. Only the initial state of the velocity $\\mathbf{u}_0$ at $t=0$ can be modified. This gives us a split between observable quantities for the loss formulation and quantities that we can interact with during the optimization (or later on via NNs).\n",
    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/diffphys-code-ns.ipynb)\n",
@ -615,7 +615,7 @@
    "This example illustrated how the differentiable physics approach can easily be extended towards significantly more \n",
    "complex PDEs. Above, we've optimized for a mini-batch of 20 steps of a full Navier-Stokes solver.\n",
    "\n",
-    "This is a powerful basis to bring NNs into the picture. As you might have noticed, our degrees of freedom were still a regular grid, and we've jointly solved a single inverse problem. There were three cases to solve as a mini-batch, of course, but nonetheless the setup still represents a direct optimization. Thus, in line with the PINN example of {doc}`physicalloss-code` we've not really dealt with a _machine learning_ task here. However, that will be the topic of the next chapters.\n"
+    "This is a powerful basis to bring NNs into the picture. As you might have noticed, our degrees of freedom were still a regular grid, and we've jointly solved a single inverse problem. There were three cases to solve as a mini-batch, of course, but nonetheless the setup still represents a direct optimization. Thus, in line with the PINN example of {doc}`physicalloss-code` we've not really dealt with a _machine learning_ task here. However, DP training allows for a range of flexible compinations with NNs that will be the topic of the next chapters.\n"
   ],
   "metadata": {
    "id": "jHdQureE_trx"
--- a/diffphys-code-sol.ipynb
+++ b/diffphys-code-sol.ipynb
@ -8,7 +8,7 @@
   "source": [
    "# Reducing Numerical Errors with Deep Learning\n",
    "\n",
-    "First, we'll target numerical errors that arise in the discretization of a continuous PDE $\\mathcal P^*$, i.e. when we formulate $\\mathcal P$. This approach will demonstrate that, despite the lack of closed-form descriptions, discretization errors often are functions with regular and repeating structures and, thus, can be learned by a neural network. Once the network is trained, it can be evaluated locally to improve the solution of a PDE-solver, i.e., to reduce its numerical error. The resulting method is a hybrid one: it will always run (a coarse) PDE solver, and then improve it at runtime with corrections inferred by an NN.\n",
+    "In this example we will target numerical errors that arise in the discretization of a continuous PDE $\\mathcal P^*$, i.e. when we formulate $\\mathcal P$. This approach will demonstrate that, despite the lack of closed-form descriptions, discretization errors often are functions with regular and repeating structures and, thus, can be learned by a neural network. Once the network is trained, it can be evaluated locally to improve the solution of a PDE-solver, i.e., to reduce its numerical error. The resulting method is a hybrid one: it will always run (a coarse) PDE solver, and then improve it at runtime with corrections inferred by an NN.\n",
    "\n",
    " \n",
    "Pretty much all numerical methods contain some form of iterative process: repeated updates over time for explicit solvers, or within a single update step for implicit solvers. \n",
@ -19,7 +19,7 @@
    "\n",
    "## Problem formulation\n",
    "\n",
-    "In the context of reducing errors, it's crucial to have a _differentiable physics solver_, so that the learning process can take the reaction of the solver into account. This interaction is not possible with supervised learning or PINN training. Even small inference errors of a supervised NN can accumulate over time, and lead to a data distribution that differs from the distribution of the pre-computed data. This distribution shift can lead to sub-optimal results, or even cause blow-ups of the solver.\n",
+    "In the context of reducing errors, it's crucial to have a _differentiable physics solver_, so that the learning process can take the reaction of the solver into account. This interaction is not possible with supervised learning or PINN training. Even small inference errors of a supervised NN accumulate over time, and lead to a data distribution that differs from the distribution of the pre-computed data. This distribution shift leads to sub-optimal results, or even cause blow-ups of the solver.\n",
    "\n",
    "In order to learn the error function, we'll consider two different discretizations of the same PDE $\\mathcal P^*$: \n",
    "a _reference_ version, which we assume to be accurate, with a discretized version \n",
@ -100,7 +100,7 @@
    "These states actually evolve over time when training. They don't exist beforehand.\n",
    "\n",
    "**TL;DR**:\n",
-    "We'll train a network $\\mathcal{C}$ to reduce the numerical errors of a simulator with a more accurate reference. It's crucial to have the _source_ solver realized as a differential physics operator, such that it can give gradients for an improved training of $\\mathcal{C}$.\n",
+    "We'll train a network $\\mathcal{C}$ to reduce the numerical errors of a simulator with a more accurate reference. It's crucial to have the _source_ solver realized as a differential physics operator, such that it provides gradients for an improved training of $\\mathcal{C}$.\n",
    "\n",
    "<br>\n",
    "\n",
@ -158,7 +158,7 @@
    "id": "RY1F4kdWPLNG"
   },
   "source": [
-    "Also let's get installing / importing all the necessary libraries out of the way. And while we're at it, we can set the random seed - obviously, 42 is the ultimate choice here \ud83d\ude42"
+    "Also let's get installing / importing all the necessary libraries out of the way. And while we're at it, we set the random seed - obviously, 42 is the ultimate choice here \ud83d\ude42"
   ]
  },
  {
@ -199,7 +199,7 @@
   "source": [
    "## Simulation setup\n",
    "\n",
-    "Now we can set up the _source_ simulation $\\mathcal{P}_{s}$. \n",
+    "Now we set up the _source_ simulation $\\mathcal{P}_{s}$. \n",
    "Note that we won't deal with \n",
    "$\\mathcal{P}_{r}$\n",
    "below: the downsampled reference data is contained in the training data set. It was generated with a four times finer discretization. Below we're focusing on the interaction of the source solver and the NN. \n",
@ -366,7 +366,7 @@
   "source": [
    "Next, we're coming to two functions which are pretty important: they transform the simulation state into an input tensor for the network, and vice versa. Hence, they're the interface between _keras/tensorflow_ and _phiflow_.\n",
    "\n",
-    "The `to_keras` function uses the two vector components via `vector['x']` and `vector['y']` to discard the outermost layer of the velocity field grids. This gives two tensors of equal size that can be combined. \n",
+    "The `to_keras` function uses the two vector components via `vector['x']` and `vector['y']` to discard the outermost layer of the velocity field grids. This gives two tensors of equal size that are concatenated. \n",
    "It then adds a constant channel via `math.ones` that is multiplied by the desired Reynolds number in `ext_const_channel`. The resulting stack of grids is stacked along the `channels` dimensions, and represents an  input to the neural network. \n",
    "\n",
    "After network evaluation, we transform the output tensor back into a phiflow grid via the `to_phiflow` function. \n",
@ -573,7 +573,7 @@
   "source": [
    "Note that the `density` here denotes a passively advected marker field, and not the density of the fluid. Below we'll be focusing on the velocity only, the marker density is tracked purely for visualization purposes.\n",
    "\n",
-    "After all the definitions we can finally run some code. We can define the dataset object with the downloaded data from the first cell."
+    "After all the definitions we can finally run some code. We define the dataset object with the downloaded data from the first cell."
   ]
  },
  {
@ -693,9 +693,9 @@
    "\n",
    "Now comes the **most crucial** step in the whole setup: we define a function that encapsulates the chain of simulation steps and network evaluations in each training step. After all the work defining helper functions, it's actually pretty simple: we create a gradient tape via `tf.GradientTape()` such that we can backpropagate later on. We then loop over `msteps`, call the simulator via `simulator.step` for an input state, and afterwards evaluate the correction via `network(to_keras(...))`. The NN correction is then added to the last simulation state in the `prediction` list (we're actually simply overwriting the last simulated velocity `prediction[-1][1]` with `prediction[-1][1] + correction[-1]`.\n",
    "\n",
-    "One other important thing that's happening here is normalization: the inputs to the network are divided by the standard deviations in `dataset.dataStats`. After evaluating the `network`, we only have a velocity left, so we can simply multiply by the standard deviation of the velocity again (via `* dataset.dataStats['std'][1]` and `[2]`).\n",
+    "One other important thing that's happening here is normalization: the inputs to the network are divided by the standard deviations in `dataset.dataStats`. After evaluating the `network`, we only have a velocity left, so we simply multiply it by the standard deviation of the velocity again (via `* dataset.dataStats['std'][1]` and `[2]`).\n",
    "\n",
-    "The `training_step` function also directly evaluates and returns the loss. Here, we can simply use an $L^2$ loss over the whole sequence, i.e. the iteration over `msteps`. This is requiring a few lines of code because we separately loop over 'x' and 'y' components, in order to normalize and compare to the ground truth values from the training data set.\n",
+    "The `training_step` function also directly evaluates and returns the loss. Here, we simply use an $L^2$ loss over the whole sequence, i.e. the iteration over `msteps`. This is requiring a few lines of code because we separately loop over 'x' and 'y' components, in order to normalize and compare to the ground truth values from the training data set.\n",
    "\n",
    "The \"learning\" happens in the last two lines via `tape.gradient()` and `opt.apply_gradients()`, which then contain the aggregated information about how to change the NN weights to nudge the simulation closer to the reference for the full chain of simulation steps."
   ]
@ -768,7 +768,7 @@
    "id": "c4yLlDM3QfUR"
   },
   "source": [
-    "Once defined, we can prepare this function for executing the training step by calling phiflow's `math.jit_compile()` function. It automatically maps to the correct pre-compilation step of the chosen backend. E.g., for TF this internally creates a computational graph, and optimizes the chain of operations. For JAX, it can even compile optimized GPU code (if JAX is set up correctly). Using the jit compilation can make a huge difference in terms of runtime."
+    "Once defined, we prepare this function for executing the training step by calling phiflow's `math.jit_compile()` function. It automatically maps to the correct pre-compilation step of the chosen backend. E.g., for TF this internally creates a computational graph, and optimizes the chain of operations. For JAX, it can even compile optimized GPU code (if JAX is set up correctly). Thus, using the jit compilation can make a huge difference in terms of runtime."
   ]
  },
  {
@ -950,9 +950,9 @@
    "\n",
    "In order to evaluate the performance of our DL-powered solver, we essentially only need to repeat the inner loop of each training iteration for more steps. While we were limited to `msteps` evaluations at training time, we can now run our solver for arbitrary lengths. This is a good test for how well our solver has learned to keep the data within the desired distribution, and represents a generalization test for longer rollouts.\n",
    "\n",
-    "We can reuse the solver code from above, but in the following, we will consider two simulated versions: for comparison, we'll run one reference simulation in the _source_ space (i.e., without any modifications). This version receives the regular outputs of each evaluation of the simulator, and ignores the learned correction (stored in `steps_source` below). The second version, repeatedly computes the source solver plus the learned correction, and advances this state in the solver (`steps_hybrid`).\n",
+    "We reuse the solver code from above, but in the following, we will consider two simulated versions: for comparison, we'll run one reference simulation in the _source_ space (i.e., without any modifications). This version receives the regular outputs of each evaluation of the simulator, and ignores the learned correction (stored in `steps_source` below). The second version, repeatedly computes the source solver plus the learned correction, and advances this state in the solver (`steps_hybrid`).\n",
    "\n",
-    "We also need a set of new data. Below, we'll download a new set of Reynolds numbers (in between the ones used for training), so that we can later on run the unmodified simulator and the DL-powered one on these cases.\n"
+    "We also need a set of new data. Below, we'll download a new set of Reynolds numbers (in between the ones used for training), on which we will later on run the unmodified simulator and the DL-powered one.\n"
   ]
  },
  {
@ -1032,7 +1032,7 @@
    "id": "sMqRPg2pdIxh"
   },
   "source": [
-    "Next we construct a `math.tensor` as initial state for the centered marker fields, and a staggered grid from the next two indices of the test set batch. Similar to `to_phiflow` above, we can use `phi.math.stack()` to combine two fields of appropriate size as a staggered grid."
+    "Next we construct a `math.tensor` as initial state for the centered marker fields, and a staggered grid from the next two indices of the test set batch. Similar to `to_phiflow` above, we use `phi.math.stack()` to combine two fields of appropriate size as a staggered grid."
   ]
  },
  {
@ -1056,7 +1056,7 @@
    "id": "KhGVceo6dIxl"
   },
   "source": [
-    "Now we can first run the _source_ simulation for 120 steps as baseline:"
+    "Now we first run the _source_ simulation for 120 steps as baseline:"
   ]
  },
  {
@ -1219,7 +1219,7 @@
    "id": "aOQP6iCBdIxs"
   },
   "source": [
-    "Due to the complexity of the training, the performance can vary, but typically the overall MAE is ca. 160% larger for the regular simulation compared to the hybrid simulator. \n",
+    "Due to the complexity of the training, the performance varies but typically the overall MAE is ca. 160% larger for the regular simulation compared to the hybrid simulator. \n",
    "The gap is typically even bigger for other Reynolds numbers within the training data range. \n",
    "The graph above also shows this behavior over time.\n",
    "\n",
@ -1308,7 +1308,7 @@
    "\n",
    "The version produced by the hybrid solver does much better. It preserves the vortex shedding even after more than one hundred updates. Note that both outputs were produced by the same underlying solver. The second version just profits from the learned corrector which manages to revert the numerical errors of the source solver, including its overly strong dissipation. \n",
    "\n",
-    "We can also visually compare how the NN does w.r.t. reference data. The next cell plots one time step of the three versions: the reference data after 50 steps, and the re-simulated version of the source and our hybrid solver, together with a per-cell error of the two:"
+    "We also visually compare how the NN does w.r.t. reference data. The next cell plots one time step of the three versions: the reference data after 50 steps, and the re-simulated version of the source and our hybrid solver, together with a per-cell error of the two:"
   ]
  },
  {
--- a/diffphys-discuss.md
+++ b/diffphys-discuss.md
@ -1,102 +1,42 @@
 Discussion
 =======================

-To summarize, the training via differentiable physics (DP) as described so far allows us
+The previous sections have explained the _differentiable physics_ approach for deep learning, and have given a range of examples: from a very basic gradient calculation, all the way to complex learning setups powered by advanced simulations. This is a good time to take a step back and evaluate: in the end, the differentiable physics components of these approaches are not too complicated. They are largely based on existing numerical methods, with a focus on efficiently using those methods not only to do a forward simulation, but also to compute gradient information. 
+
+The training via differentiable physics (DP) allows us
 to integrate full numerical simulations into the training of deep neural networks.
-As a consequence, this let's the networks learn to _interact_ with these simulations. 
-While we've only hinted at what could be
-achieved via DP approaches it is nonetheless a good time to discuss some 
-additional properties, and summarize the pros and cons.
+What is primarily exciting in this context are the implications that arise from the combination of these numerical methods with deep learning.

+![Divider](resources/divider6.jpg)

-![Divider](resources/divider4.jpg)
+## Integration

+Most importantly, training via differentiable physics allows us to seamlessly bring the two fields together:
+we can obtain _hybrid_ methods, that use the best numerical methods that we have at our disposal for the simulation itself, as well as for the training process. We can then use the trained model to improve forward or backward solves. Thus, in the end, we have a solver that combines a _traditional_ solver and a _learned_ component that in combination can improve the capabilities of numerical methods.

-## Time steps and iterations
+## Interaction

-When using DP approaches for learning applications, 
-there is a lot of flexibility w.r.t. the combination of DP and NN building blocks. 
-As some of the differences are subtle, the following section will go into more detail.
-We'll especially focus on solvers that repeat the PDE and NN evaluations multiple times,
-e.g., to compute multiple states of the physical system over time.
+One key aspect that is important for these hybrids to work well is to let the NN _interact_ with the PDE solver at training time. Differentiable simulations allow a trained model to "explore and experience" the physical environment, and receive directed feedback regarding its interactions throughout the solver iterations. This combination nicely fits into the broader context of machine learning as _differentiable programming_. 

-To re-cap, here's the previous figure about combining NNs and DP operators. 
-In the figure these operators look like a loss term: they typically don't have weights,
-and only provide a gradient that influences the optimization of the NN weights:
+## Generalization

-```{figure} resources/diffphys-shortened.jpg
---
-height: 220px
-name: diffphys-short
---
-The DP approach as described in the previous chapters. A network produces an input to a PDE solver $\mathcal P$, which provides a gradient for training during the backpropagation step.
-```
-
-This setup can be seen as the network receiving information about how it's output influences the outcome of the PDE solver. I.e., the gradient will provide information how to produce an NN output that minimizes the loss. 
-Similar to the previously described _physical losses_ (from {doc}`physicalloss`), this can mean upholding a conservation law.
-
-**Switching the Order** 
-
-However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine a swap of the NN and DP components, giving the following structure:
-
-```{figure} resources/diffphys-switched.jpg
---
-height: 220px
-name: diffphys-switch
---
-A PDE solver produces an output which is processed by an NN.
-```
-
-In this case the PDE solver essentially represents an _on-the-fly_ data generator. This is not necessarily always useful: this setup could be replaced by a pre-computation of the same inputs, as the PDE solver is not influenced by the NN. Hence, we could replace the $\mathcal P$ invocations by a "loading" function. On the other hand, evaluating the PDE solver at training time with a randomized sampling of the parameter domain of interest can lead to an excellent sampling of the data distribution of the input, and hence yield accurate and stable NNs. If done correctly, the solver can alleviate the need to store and load large amounts of data, and instead produce them more quickly at training time, e.g., directly on a GPU.
-
-**Time Stepping** 
-
-In general, there's no combination of NN layers and DP operators that is _forbidden_ (as long as their dimensions are compatible). One that makes particular sense is to "unroll" the iterations of a time stepping process of a simulator, and let the state of a system be influenced by an NN.
-
-In this case we compute a (potentially very long) sequence of PDE solver steps in the forward pass. In-between these solver steps, an NN modifies the state of our system, which is then used to compute the next PDE solver step. During the backpropagation pass, we move backwards through all of these steps to evaluate contributions to the loss function (it can be evaluated in one or more places anywhere in the execution chain), and to backpropagte the gradient information through the DP and NN operators. This unrolling of solver iterations essentially gives feedback to the NN about how it's "actions" influence the state of the physical system and resulting loss. Due to the iterative nature of this process, many errors increase exponentially over the course of iterations, and are extremely difficult to detect in a single evaluation. In these cases it is crucial to provide feedback to the NN at training time who the errors evolve over course of the iterations. Note that in this case, a pre-computation of the states is not possible, as the iterations depend on the state of the NN, which is unknown before training. Hence, a DP-based training is crucial to evaluate the correct gradient information at training time. 
-
-```{figure} resources/diffphys-multistep.jpg
---
-height: 180px
-name: diffphys-mulitstep
---
-Time stepping with interleaved DP and NN operations for $k$ solver iterations. The dashed gray arrows indicate optional intermediate evaluations of loss terms (similar to the solid gray arrow for the last step $k$).
-```
-
-Note that in this picture (and the ones before) we have assumed an _additive_ influence of the NN. Of course, any differentiable operator could be used here to integrate the NN result into the state of the PDE. E.g., multiplicative modifications can be more suitable in certain settings, or in others the NN could modify the parameters of the PDE in addition to or instead of the state space. Likewise, the loss function is problem dependent and can be computed in different ways.
-
-DP setups with many time steps can be difficult to train: the gradients need to backpropagate through the full chain of PDE solver evaluations and NN evaluations. Typically, each of them represents a non-linear and complex function. Hence for larger numbers of steps, the vanishing and exploding gradient problem can make training difficult (see {doc}`diffphys-code-sol` for some practical tips how to alleviate this).
-
-## Alternatives: noise
-
-It is worth mentioning here that other works have proposed perturbing the inputs and 
-the iterations at training time with noise {cite}`sanchez2020learning` (somewhat similar to
-regularizers like dropout). 
-This can help to prevent overfitting to the training states, and hence shares similarities
-with the goals of training with DP. 
-
-However, the noise is typically undirected, and hence not as accurate as training with 
-the actual evolutions of simulations. Hence, this noise can be a good starting point 
-for training that tends to overfit, but if possible, it is preferable to incorporate the
-actual solver in the training loop via a DP approach.
+The hybrid approach also bears particular promise for simulators: it improves generalizing capabilities of the trained models by letting the PDE-solver handle large-scale _changes to the data distribution_ such that the learned model can focus on localized structures not captured by the discretization. While physical models generalize very well, learned models often specialize in data distributions seen at training time. This was, e.g., shown for the models reducing numerical errors of the previous chapter: the trained models can deal with solution manifolds with significant amounts of varying physical behavior, while simpler training variants quickly deteriorate over the course of recurrent time steps.


 ![Divider](resources/divider5.jpg)

-## Summary
-
-To summarize the pros and cons of training NNs via DP:
+To summarize, the pros and cons of training NNs via DP:

 ✅ Pro: 
 - Uses physical model and numerical methods for discretization.
 - Efficiency and accuracy of selected methods carries over to training.
 - Very tight coupling of physical models and NNs possible.
+- Improved generalization via solver interactions.

 ❌ Con: 
 - Not compatible with all simulators (need to provide gradients).
 - Require more heavy machinery (in terms of framework support) than previously discussed methods.

-Here, the last negative point (regarding heavy machinery) is bound to strongly improve in a fairly short amount of time. However, for now it's important to keep in mind that not every simulator is suitable for DP training out of the box. Hence, in this book we'll focus on examples using phiflow, which was designed for interfacing with deep learning frameworks. 
+_Outlook_: the last negative point (regarding heavy machinery) is bound to strongly improve given the current pace of software and API developments in the DL area. However, for now it's important to keep in mind that not every simulator is suitable for DP training out of the box. Hence, in this book we'll focus on examples using phiflow, which was designed for interfacing with deep learning frameworks. 

-Next we can target more some complex scenarios to showcase what can be achieved with differentiable physics.
-This will also illustrate how the right selection of numerical methods for a DP operator yields improvements in terms of training accuracy.
+Training NNs via differentiable physics solvers is a very generic approach that is applicable to a wide range of combinations of PDE-based models and deep learning. In the next chapters, we will target the underlying learning process to obtain even better NN states.
--- a/diffphys-dpvspinn.md
+++ b/diffphys-dpvspinn.md
@ -1,7 +1,7 @@
-Differentiable Physics versus Physics-informed Training
+So Far so Good - a First Discussion
 =======================

-In the previous sections we've seen example reconstructions that used physical residuals as soft constraints, in the form of the PINNs, and reconstructions that used a differentiable physics (DP) solver. While both methods can find minimizers for similar inverse problems, the obtained solutions differ substantially, as does the behavior of the non-linear optimization problem that we get from each formulation. In the following we discuss these differences in more detail, and we will combine conclusions drawn from the behavior of the Burgers case of {doc}`physicalloss-code` and {doc}`diffphys-code-burgers` with observations from research papers.
+In the previous section we've seen an example reconstructions that used physical residuals as soft constraints, in the form of the variant 2 (PINNs), and reconstructions that used a differentiable physics (DP) solver. While both methods can find minimizers for similar inverse problems, the obtained solutions differ substantially, as does the behavior of the non-linear optimization problem that we get from each formulation. In the following we discuss these differences in more detail, and we will combine conclusions drawn from the behavior of the Burgers case of {doc}`physicalloss-code` and {doc}`diffphys-code-burgers` with observations from external research papers {cite}`holl2019pdecontrol`.

 ![Divider](resources/divider3.jpg)

@ -16,9 +16,9 @@ The DP version on the other hand inherently relies on a numerical solver that is

 The reliance on a suitable discretization requires some understanding and knowledge of the problem under consideration. A sub-optimal discretization can impede the learning process or, worst case, lead to diverging training runs. However, given the large body of theory and practical realizations of stable solvers for a wide variety of physical problems, this is typically not an unsurmountable obstacle.

-The PINN approaches on the other hand do not require an a-priori choice of a discretization, and as such seems to be "discretization-less". This, however, is only an advantage on first sight. As they yield solutions in a computer, they naturally _have_ to discretize the problem, but they construct this discretization over the course of the training process, in a way that is not easily controllable from the outside. Thus, the resulting accuracy is determined by how well the training manages to estimate the complexity of the problem for realistic use cases, and how well the training data approximates the unknown regions of the solution.
+The PINN approaches on the other hand do not require an a-priori choice of a discretization, and as such seems to be "discretization-less". This, however, is only an advantage on first sight. As they yield solutions in a computer, they naturally _have_ to discretize the problem. They construct this discretization over the course of the training process, in a way that lies at the mercy of the underlying nonlinear optimization, and is not easily controllable from the outside. Thus, the resulting accuracy is determined by how well the training manages to estimate the complexity of the problem for realistic use cases, and how well the training data approximates the unknown regions of the solution.

-As demonstrated with the Burgers example, the PINN solutions typically have significant difficulties propagating information _backward_ in time. This is closely coupled to the efficiency of the method.
+E.g., as demonstrated with the Burgers example, the PINN solutions typically have significant difficulties propagating information _backward_ in time. This is closely coupled to the efficiency of the method.

 ## Efficiency

@ -32,7 +32,7 @@ For the PINN representation with fully-connected networks on the other hand, we

 That being said, because the DP approaches can cover much larger solution manifolds, the structure of these manifolds is typically also difficult to learn. E.g., when training a network with a larger number of iterations (i.e. a long look-ahead into the future), this typically represents a signal that is more difficult to learn than a short look ahead. 

-As a consequence, these training runs not only take more computational resources per NN iteration, they also need longer to converge. Regarding resources, each computation of the look-ahead potentially requires a large number of simulation steps, and typically a similar amount of resources for the backpropagation step. Regarding convergence, the more complex signal that should be learned can take more training iterations and require larger NN structures. 
+As a consequence, these training runs not only take more computational resources per NN iteration, they also often need longer to converge. Regarding resources, each computation of the look-ahead potentially requires a large number of simulation steps, and typically a similar amount of resources for the backpropagation step. Thus, while they may seem costly and slow to converge at times, this is usually caused by the the more complex signal that needs to be learned. 


 ![Divider](resources/divider2.jpg)
@ -44,7 +44,7 @@ The following table summarizes these pros and cons of physics-informed (PI) and

 | Method   |  ✅ Pro   |  ❌ Con  |
 |----------|-------------|------------|
-| **PI** | - Analytic derivatives via backpropagation.  | - Expensive evaluation of NN, as well as derivative calculations. | 
+| **PI** | - Analytic derivatives via backpropagation.  | - Expensive evaluation of NN, and very costly derivative calculations. | 
 |          | - Easy to implement.  | - Incompatible with existing numerical methods.     | 
 |          |                  | - No control of discretization.  | 
 | | | |
@ -54,6 +54,4 @@ The following table summarizes these pros and cons of physics-informed (PI) and

 As a summary, both methods are definitely interesting, and have a lot of potential. There are numerous more complicated extensions and algorithmic modifications that change and improve on the various negative aspects we have discussed for both sides.

-However, as of this writing, the physics-informed (PI) approach has clear limitations when it comes to performance and compatibility with existing numerical methods. Thus, when knowledge of the problem at hand is available, which typically is the case when we choose a suitable PDE model to constrain the learning process, employing a differentiable physics  solver can significantly improve the training process as well as the quality of the obtained solution. Next, we will target more complex settings, i.e., fluids with Navier-Stokes, to illustrate this in more detail.
-
-
+However, as of this writing, the physics-informed (PI) approach has clear limitations when it comes to performance and compatibility with existing numerical methods. Thus, when knowledge of the problem at hand is available, which typically is the case when we choose a suitable PDE model to constrain the learning process, employing a differentiable physics solver can significantly improve the training process as well as the quality of the obtained solution. So, in the following we'll focus on DP variants, and illustrate their capabilities with more complex scenarios in the next chapters. First, we'll consider a case that very efficiently computes space-time gradients for a transient fluid simulations.
--- a/diffphys-examples.md
+++ b/diffphys-examples.md
@ -1,6 +1,83 @@
-Complex Examples Overview
+Integrating DP into NN Training
 =======================

+% ## Time steps and iterations
+
+We'll now target integrations of differentiable physics (DP) setups into NNs.
+When using DP approaches for learning applications, 
+there is a lot of flexibility w.r.t. the combination of DP and NN building blocks. 
+As some of the differences are subtle, the following section will go into more detail.
+We'll especially focus on solvers that repeat the PDE and NN evaluations multiple times,
+e.g., to compute multiple states of the physical system over time.
+
+To re-cap, here's the previous figure about combining NNs and DP operators. 
+In the figure these operators look like a loss term: they typically don't have weights,
+and only provide a gradient that influences the optimization of the NN weights:
+
+```{figure} resources/diffphys-shortened.jpg
+---
+height: 220px
+name: diffphys-short
+---
+The DP approach as described in the previous chapters. A network produces an input to a PDE solver $\mathcal P$, which provides a gradient for training during the backpropagation step.
+```
+
+This setup can be seen as the network receiving information about how it's output influences the outcome of the PDE solver. I.e., the gradient will provide information how to produce an NN output that minimizes the loss. 
+Similar to the previously described _physical losses_ (from {doc}`physicalloss`), this can mean upholding a conservation law.
+
+**Switching the Order** 
+
+However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine a swap of the NN and DP components, giving the following structure:
+
+```{figure} resources/diffphys-switched.jpg
+---
+height: 220px
+name: diffphys-switch
+---
+A PDE solver produces an output which is processed by an NN.
+```
+
+In this case the PDE solver essentially represents an _on-the-fly_ data generator. That's not necessarily always useful: this setup could be replaced by a pre-computation of the same inputs, as the PDE solver is not influenced by the NN. Hence, we could replace the $\mathcal P$ invocations by a "loading" function. On the other hand, evaluating the PDE solver at training time with a randomized sampling of the parameter domain of interest can lead to an excellent sampling of the data distribution of the input, and improve the NN training. If done correctly, the solver can also alleviate the need to store and load large amounts of data, and instead produce them more quickly at training time, e.g., directly on a GPU.
+
+**Time Stepping** 
+
+In general, there's no combination of NN layers and DP operators that is _forbidden_ (as long as their dimensions are compatible). One that makes particular sense is to "unroll" the iterations of a time stepping process of a simulator, and let the state of a system be influenced by an NN.
+
+In this case we compute a (potentially very long) sequence of PDE solver steps in the forward pass. In-between these solver steps, an NN modifies the state of our system, which is then used to compute the next PDE solver step. During the backpropagation pass, we move backwards through all of these steps to evaluate contributions to the loss function (it can be evaluated in one or more places anywhere in the execution chain), and to backprop the gradient information through the DP and NN operators. This unrolling of solver iterations essentially gives feedback to the NN about how it's "actions" influence the state of the physical system and resulting loss. Due to the iterative nature of this process, many errors start out very small, and then slowly increase exponentially over the course of iterations. Hence they are extremely difficult to detect in a single evaluation, e.g., from a simple supervised training setup. In these cases it is crucial to provide feedback to the NN at training time who the errors evolve over course of the iterations. Additionally, a pre-computation of the states is not possible for such iterative cases, as the iterations depend on the state of the NN. Naturally, the NN state is unknown before training time and changes while being trained. Hence, a DP-based training is crucial to provide the NN with gradients about how it influences the solver iterations.
+
+```{figure} resources/diffphys-multistep.jpg
+---
+height: 180px
+name: diffphys-mulitstep
+---
+Time stepping with interleaved DP and NN operations for $k$ solver iterations. The dashed gray arrows indicate optional intermediate evaluations of loss terms (similar to the solid gray arrow for the last step $k$).
+```
+
+Note that in this picture (and the ones before) we have assumed an _additive_ influence of the NN. Of course, any differentiable operator could be used here to integrate the NN result into the state of the PDE. E.g., multiplicative modifications can be more suitable in certain settings, or in others the NN could modify the parameters of the PDE in addition to or instead of the state space. Likewise, the loss function is problem dependent and can be computed in different ways.
+
+DP setups with many time steps can be difficult to train: the gradients need to backpropagate through the full chain of PDE solver evaluations and NN evaluations. Typically, each of them represents a non-linear and complex function. Hence for larger numbers of steps, the vanishing and exploding gradient problem can make training difficult (see {doc}`diffphys-code-sol` for some practical tips how to alleviate this).
+
+![Divider](resources/divider4.jpg)
+
+
+## Alternatives: noise
+
+It is worth mentioning here that other works have proposed perturbing the inputs and 
+the iterations at training time with noise {cite}`sanchez2020learning` (somewhat similar to
+regularizers like dropout). 
+This can help to prevent overfitting to the training states, and hence shares similarities
+with the goals of training with DP. 
+
+However, the noise is typically undirected, and hence not as accurate as training with 
+the actual evolutions of simulations. Hence, this noise can be a good starting point 
+for training setups that tend to overfit. However, if possible, it is preferable to incorporate the
+actual solver in th
+e training loop via a DP approach.
+
+---
+
+## Complex Examples
+
 The following sections will give code examples of more complex cases to 
 show what can be achieved via differentiable physics training.

--- a/diffphys-outlook.md
+++ b/diffphys-outlook.md
@ -1,24 +0,0 @@
-Summary and Discussion
-=======================
-
-The previous sections have explained the _differentiable physics_ approach for deep learning, and have given a range of examples: from a very basic gradient calculation, all the way to complex learning setups powered by advanced simulations. This is a good time to take a step back and evaluate: in the end, the differentiable physics components of these approaches are not too complicated. They are largely based on existing numerical methods, with a focus on efficiently using those methods not only to do a forward simulation, but also to compute gradient information. What is primarily exciting in this context are the implications that arise from the combination of these numerical methods with deep learning.
-
-![Divider](resources/divider6.jpg)
-
-## Integration
-
-Most importantly, training via differentiable physics allows us to seamlessly bring the two fields together:
-we can obtain _hybrid_ methods, that use the best numerical methods that we have at our disposal for the simulation itself, as well as for the training process. We can then use the trained model to improve forward or backward solves. Thus, in the end, we have a solver that combines a _traditional_ solver and a _learned_ component that in combination can improve the capabilities of numerical methods.
-
-## Interaction
-
-One key aspect that is important for these hybrids to work well is to let the NN _interact_ with the PDE solver at training time. Differentiable simulations allow a trained model to "explore and experience" the physical environment, and receive directed feedback regarding its interactions throughout the solver iterations. This combination nicely fits into the broader context of machine learning as _differentiable programming_. 
-
-## Generalization
-
-The hybrid approach also bears particular promise for simulators: it improves generalizing capabilities of the trained models by letting the PDE-solver handle large-scale _changes to the data distribution_ such that the learned model can focus on localized structures not captured by the discretization. While physical models generalize very well, learned models often specialize in data distributions seen at training time. This was, e.g., shown for the models reducing numerical errors of the previous chapter: the trained models can deal with solution manifolds with significant amounts of varying physical behavior, while simpler training variants quickly deteriorate over the course of recurrent time steps.
-
-![Divider](resources/divider7.jpg)
-
-Training NNs via differentiable physics solvers is a very generic approach that is applicable to a wide range of combinations of PDE-based models and deep learning. Nonetheless, the next chapters will discuss several variants that are orthogonal to the general DP version, or can yield benefits in more specialized settings.
-
--- a/diffphys.md
+++ b/diffphys.md
@ -3,8 +3,8 @@ Introduction to Differentiable Physics

 As a next step towards a tighter and more generic combination of deep learning
 methods and physical simulations we will target incorporating _differentiable 
-simulations_ into the learning process. In the following, we'll shorten
-that to "differentiable physics" (DP).
+numerical simulations_ into the learning process. In the following, we'll shorten
+these "differentiable numerical simulations of physical systems" to just "differentiable physics" (DP).

 The central goal of these methods is to use existing numerical solvers, and equip
 them with functionality to compute gradients with respect to their inputs.
@ -42,15 +42,15 @@ for discretizing continuous models.
 Let's assume we have a continuous formulation $\mathcal P^*(\mathbf{x}, \nu)$ of the physical quantity of 
 interest $\mathbf{u}(\mathbf{x}, t): \mathbb R^d \times \mathbb R^+ \rightarrow \mathbb R^d$,
 with model parameters $\nu$ (e.g., diffusion, viscosity, or conductivity constants).
-The component of $\mathbf{u}$ will be denoted by a numbered subscript, i.e.,
+The components of $\mathbf{u}$ will be denoted by a numbered subscript, i.e.,
 $\mathbf{u} = (u_1,u_2,\dots,u_d)^T$.
 %and a corresponding discrete version that describes the evolution of this quantity over time: $\mathbf{u}_t = \mathcal P(\mathbf{x}, \mathbf{u}, t)$.
 Typically, we are interested in the temporal evolution of such a system.
 Discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
-that we can re-arrange to compute a future state after a time step $\Delta t$. 
+that we re-arrange to compute a future state after a time step $\Delta t$. 
 The state at $t+\Delta t$ is computed via sequence of
 operations $\mathcal P_1, \mathcal P_2 \dots \mathcal P_m$ such that
-$\mathbf{u}(t+\Delta t) = \mathcal P_1 \circ \mathcal P_2 \circ \dots \mathcal P_m ( \mathbf{u}(t),\nu )$,
+$\mathbf{u}(t+\Delta t) = \mathcal P_m \circ \dots \mathcal P_2 \circ \mathcal P_1 ( \mathbf{u}(t),\nu )$,
 where $\circ$ denotes function decomposition, i.e. $f(g(x)) = f \circ g(x)$.

 ```{note} 
@ -93,7 +93,7 @@ $\mathbf{u}$ to another, the Jacobian is square and symmetric here. Of course th
 for general model equations, but non-square Jacobian matrices would not cause any problems for differentiable 
 simulations.

-In practice, we can rely on the _reverse mode_ differentiation that all modern DL
+In practice, we rely on the _reverse mode_ differentiation that all modern DL
 frameworks provide, and focus on computing a matrix vector product of the Jacobian transpose
 with a vector $\mathbf{a}$, i.e. the expression: 
 $
@ -158,7 +158,7 @@ Thus, if we want to employ DL for model equations that we don't have a proper gr
 idea to directly go for learning via a DP approach. However, if we don't really understand our model, we probably
 should go back to studying it a bit more anyway...

-Also, in practice we can be _greedy_ with the derivative operators, and only 
+Also, in practice we should be _greedy_ with the derivative operators, and only 
 provide those which are relevant for the learning task. E.g., if our network 
 never produces the parameter $\nu$ in the example above, and it doesn't appear in our
 loss formulation, we will never encounter a $\partial/\partial \nu$ derivative
@ -186,7 +186,7 @@ $$
  \frac{\partial d}{\partial{t}} + \mathbf{u} \cdot \nabla d = 0 
 $$

-Instead of using this formulation as a residual equation right away (as in {doc}`physicalloss`), 
+Instead of using this formulation as a residual equation right away (as in v2 of {doc}`physicalloss`), 
 we can discretize it with our favorite mesh and discretization scheme,
 to obtain a formulation that updates the state of our system over time. This is a standard
 procedure for a _forward_ solve.
@ -201,20 +201,20 @@ $$
 $$

 As a simple example of an inverse problem and learning task, let's consider the problem of
-finding an unknown motion $\mathbf{u}$: 
-this motion should transform a given initial scalar density state $d^{~0}$ at time $t^0$ 
+finding a velocity field $\mathbf{u}$.
+This velocity should transform a given initial scalar density state $d^{~0}$ at time $t^0$ 
 into a state that's evolved by $\mathcal P$ to a later "end" time $t^e$ 
 with a certain shape or configuration $d^{\text{target}}$.
-Informally, we'd like to find a motion that deforms $d^{~0}$ through the PDE model into a target state.
+Informally, we'd like to find a flow that deforms $d^{~0}$ through the PDE model into a target state.
 The simplest way to express this goal is via an $L^2$ loss between the two states. So we want
 to minimize the loss function $L=|d(t^e) - d^{\text{target}}|^2$. 

 Note that as described here this inverse problem is a pure optimization task: there's no NN involved,
-and our goal is to obtain $\mathbf{u}$. We do not want to apply this motion to other, unseen _test data_,
+and our goal is to obtain $\mathbf{u}$. We do not want to apply this velocity to other, unseen _test data_,
 as would be custom in a real learning task.

 The final state of our marker density $d(t^e)$ is fully determined by the evolution 
-of $\mathcal P$ via $\mathbf{u}$, which gives the following minimization problem:
+from $\mathcal P$ via $\mathbf{u}$, which gives the following minimization problem:

 $$
    \text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2
@ -225,11 +225,15 @@ _gradient descent_ (GD), where the
 gradient is determined by the differentiable physics approach described earlier in this chapter.
 Once things are working with GD, we can relatively easily switch to better optimizers or bring
 an NN into the picture, hence it's always a good starting point.
+To make things easier to read below, we'll omit the transpose of the Jacobians in the following. 
+Unfortunately, the Jacobian is defined this way, but we actually never need the un-transposed one.
+Keep in mind that in practice we're dealing with tranposed Jacobians $\big( \frac{ \partial a }{ \partial b} \big)^T$
+that are "abbreviated" by $\frac{ \partial a }{ \partial b}$.

 As the discretized velocity field $\mathbf{u}$ contains all our degrees of freedom,
 all we need to do is to update the velocity by an amount 
 $\Delta \mathbf{u} = \partial L / \partial \mathbf{u}$, 
-which can be decomposed into 
+which is decomposed into 
 $\Delta \mathbf{u} = 
 \frac{ \partial d }{ \partial \mathbf{u}}
 \frac{ \partial L }{ \partial d} $.
@ -244,7 +248,7 @@ $$

 If $d$ is represented as a vector, e.g., for one entry per cell of a mesh, 
 $\frac{ \partial L }{ \partial d}$ will likewise be a column vector of equivalent size.
-This is thanks to the fact that $L$ is always a scalar loss function, and hence the Jacobian
+This stems from the fact that $L$ is always a scalar loss function, and so the Jacobian
 matrix will have a dimension of 1 along the $L$ dimension.
 Intuitively, this vector will simply contain the differences between $d$ at the end time
 in comparison to the target densities $d^{\text{target}}$.
@ -293,7 +297,7 @@ $$ \begin{aligned}
 height: 150px
 name: advection-upwind
 ---
-1st-order upwinding uses a simple one-sided finite-difference stencil that takes into account the direction of the motion
+1st-order upwinding uses a simple one-sided finite-difference stencil that takes into account the direction of the flow
 ```

 Thus, for a negative $u_i$, we're using $u_i^+$ to look in the opposite direction of the velocity, i.e., _backward_ in terms of the motion. $u_i^-$ will be zero in this case. For positive $u_i$ it's vice versa, and we'll get a zero'ed $u_i^+$, and a backward difference stencil via $u_i^-$.
@ -309,16 +313,13 @@ the change of the velocity $u_i$ depends on the spatial derivatives of the densi
 Due to the first order upwinding, we only include two neighbors (higher order methods would depend on
 additional entries of $d$)

-% velocity derivative , Just for completeness: another derivative we could compute here is
-
 In practice this step is equivalent to evaluating a transposed matrix multiplication.
 If we rewrite the calculation above as 
 $ \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t) = A \mathbf{u}$, 
-then $\partial \mathcal P / \partial \mathbf{u} = A^T$.
+then $\big( \partial \mathcal P / \partial \mathbf{u} \big)^T = A^T$.
 However, in many practical cases, a matrix free implementation of this multiplication might 
 be preferable to actually constructing $A$.

-% density derivative
 Another derivative that we can consider for the advection scheme is that w.r.t. the previous
 density state, i.e. $d_i(t)$, which is $d_i$ in the shortened notation.
 $\partial \mathcal P / \partial d_i$ for cell $i$ from above gives $1 + \frac{u_i \Delta t }{ \Delta x}$. However, for the full gradient we'd need to add the potential contributions from cells $i+1$ and $i-1$, depending on the sign of their velocities. This derivative will come into play in the next section.
@ -337,23 +338,23 @@ state. This can involve a large number of evaluations of our advection scheme vi

 This sounds challenging at first:
 e.g., one could try to insert equation {eq}`eq:advection` at time $t-\Delta t$
-into equation {eq}`eq:advection` at $t$ and repeat this process recursively until
+into equation {eq}`eq:advection` at time $t$ and repeat this process recursively until
 we have a single expression relating $d^{~0}$ to the targets. However, thanks
-to the linear nature of the Jacobians, we can treat each advection step, i.e.,
+to the linear nature of the Jacobians, we treat each advection step, i.e.,
 each invocation of our PDE $\mathcal P$ as a seperate, modular
 operation. And each of these invocations follows the procedure described 
 in the previous section.

-Hence, given the machinery above, the backtrace is fairly simple to realize: 
+Given the machinery above, the backtrace is fairly simple to realize: 
 for each of the advection steps
-in $\mathcal P$ we can compute a Jacobian product with the _incoming_ vector of derivatives
+in $\mathcal P$ we compute a Jacobian product with the _incoming_ vector of derivatives
 from the loss $L$ or a previous advection step. We repeat this until we have traced the chain from the
 loss with $d^{\text{target}}$ all the way back to $d^{~0}$. 
-Theoretically, the velocity $\mathbf{u}$ could be a function of time, in which
-case we'd get a gradient $\Delta \mathbf{u}(t)$ for every time step $t$. To simplify things
+Theoretically, the velocity $\mathbf{u}$ could be a function of time like $d$, in which
+case we'd get a gradient $\Delta \mathbf{u}(t)$ for every time step $t$. However, to simplify things
 below, let's we assume we have field that is constant in time, i.e., we're
 reusing the same velocities $\mathbf{u}$ for every advection via $\mathcal P$. Now, each time step
-will give us a contribution to $\Delta \mathbf{u}$ which we can accumulate for all steps.
+will give us a contribution to $\Delta \mathbf{u}$ which we accumulate for all steps.

 $$ \begin{aligned}
    \Delta \mathbf{u} =& 
@ -375,12 +376,15 @@ $$ \begin{aligned}

 Here the last term above contains the full backtrace of the marker density to time $t^0$. 
 The terms of this sum look unwieldy 
-at first, but they contain a lot of similar Jacobians, and in practice can be computed efficiently
-by backtracing through the sequence of computational steps in the forward evaluation of our PDE.
+at first, but looking closely, each line simply adds an additional Jacobian for one time step on the left hand side.
+This follows from the chain rule, as shown in the two-operator case above.
+So the terms of the sum contain a lot of similar Jacobians, and in practice can be computed efficiently
+by backtracing through the sequence of computational steps that resulted from the forward evaluation of our PDE.
+(Note that, as mentioned above, we've omitted the tranpose of the Jacobians here.)

 This structure also makes clear that the process is very similar to the regular training
-process of an NN: the evaluations of these Jacobian vector products is exactly what
-a deep learning framework does for training an NN (we just have weights $\theta$ instead
+process of an NN: the evaluations of these Jacobian vector products from nested function calls
+is exactly what a deep learning framework does for training an NN (we just have weights $\theta$ instead
 of a velocity field there). And hence all we need to do in practice is to provide a custom 
 function the Jacobian vector product for $\mathcal P$.

@ -401,25 +405,27 @@ For fluids, we typically have
 $\mathbf{u}^{n} = \mathbf{u} - \nabla p$, with
 $\nabla^2 p = \nabla \cdot \mathbf{u}$. Here, $\mathbf{u}^{n}$ denotes the _new_, divergence-free
 velocity field. This step is typically crucial to enforce the hard-constraint $\nabla \cdot \mathbf{u}=0$,
-and is often called _Chorin Projection_, or _Helmholtz decomposition_. It is also closely related to the fundamental theorem of vector calculus.
+and also goes under the name of _Chorin Projection_, or _Helmholtz decomposition_. 
+It is a direct consequence of the fundamental theorem of vector calculus.

 If we now introduce an NN that modifies $\mathbf{u}$ in a solver, we inevitably have to
 backpropagate through the Poisson solve. I.e., we need a gradient for $\mathbf{u}^{n}$, which in this
 notation takes the form $\partial \mathbf{u}^{n} / \partial \mathbf{u}$.

-In combination, $\mathbf{u}^{n} = \mathbf{u} - \nabla \left(  (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients are simple to compute. The main difficulty lies in obtaining the
-matrix inverse $(\nabla^2)^{-1}$ from Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear). 
+In combination, we aim for computing $\mathbf{u}^{n} = \mathbf{u} - \nabla \left(  (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients are simple to compute. The main difficulty lies in obtaining the
+matrix inverse $(\nabla^2)^{-1}$ from Poisson's equation (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear). 

-In practice, the matrix vector product for $(\nabla^2)^{-1} b$ with $b=\nabla \cdot \mathbf{u}$ is not explicitly computed via matrix operations, but approximated with a (potentially matrix-free) iterative solver. E.g., conjugate gradient (CG) methods are a very popular choice here. Thus, we could treat this iterative solver as a function $S$,
-with $p = S(\nabla \cdot \mathbf{u})$. Note that matrix inversion is a non-linear process, despite the matrix itself being linear. As solvers like CG are also based on matrix and vector operations, we could decompose $S$ into a sequence of simpler operations $S(x) = S_n( S_{n-1}(...S_{1}(x)))$, and backpropagate through each of them. This is certainly possible, but not a good idea: it can introduce numerical problems, and can be very slow.
-As mentioned above, by default DL frameworks store the internal states for every differentiable operator like the $S_i()$ in this example, and hence we'd organize and keep $n$ intermediate states in memory. These states are completely uninteresting for our original PDE, though. They're just intermediate states of the CG solver.
+In practice, the matrix vector product for $(\nabla^2)^{-1} b$ with $b=\nabla \cdot \mathbf{u}$ is not explicitly computed via matrix operations, but approximated with a (potentially matrix-free) iterative solver. E.g., conjugate gradient (CG) methods are a very popular choice here. Thus, we theoretically could treat this iterative solver as a function $\mathcal{S}$,
+with $p = \mathcal{S}(\nabla \cdot \mathbf{u})$. 
+It's worth noting that matrix inversion is a non-linear process, despite the matrix itself being linear. As solvers like CG are also based on matrix and vector operations, we could decompose $\mathcal{S}$ into a sequence of simpler operations over the course of all solver iterations as $\mathcal{S}(x) = \mathcal{S}_n( \mathcal{S}_{n-1}(...\mathcal{S}_{1}(x)))$, and backpropagate through each of them. This is certainly possible, but not a good idea: it can introduce numerical problems, and will be very slow.
+As mentioned above, by default DL frameworks store the internal states for every differentiable operator like the $\mathcal{S}_i()$ in this example, and hence we'd organize and keep a potentially huge number of intermediate states in memory. These states are completely uninteresting for our original PDE, though. They're just intermediate states of the CG solver.

 If we take a step back and look at $p = (\nabla^2)^{-1} b$, it's gradient $\partial p / \partial b$
-is just $((\nabla^2)^{-1})^T$. And in this case, $(\nabla^2)$ is a symmetric matrix, and so $((\nabla^2)^{-1})^T=(\nabla^2)^{-1}$. This is the identical inverse matrix that we encountered in the original equation above, and hence we can re-use our unmodified iterative solver to compute the gradient. We don't need to take it apart and slow it down by storing intermediate states. However, the iterative solver computes the matrix-vector-products for $(\nabla^2)^{-1} b$. So what is $b$ during backpropagation? In an optimization setting we'll always have our loss function $L$ at the end of the forward chain. The backpropagation step will then give a gradient for the output, let's assume it is $\partial L/\partial p$ here, which needs to be propagated to the earlier operations of the forward pass. Thus, we can simply invoke our iterative solve during the backward pass to compute $\partial p / \partial b = S(\partial L/\partial p)$. And assuming that we've chosen a good solver as $S$ for the forward pass, we get exactly the same performance and accuracy in the backwards pass.
+is just $((\nabla^2)^{-1})^T$. And in this case, $(\nabla^2)$ is a symmetric matrix, and so $((\nabla^2)^{-1})^T=(\nabla^2)^{-1}$. This is the identical inverse matrix that we encountered in the original equation above, and hence we re-use our unmodified iterative solver to compute the gradient. We don't need to take it apart and slow it down by storing intermediate states. However, the iterative solver computes the matrix-vector-products for $(\nabla^2)^{-1} b$. So what is $b$ during backpropagation? In an optimization setting we'll always have our loss function $L$ at the end of the forward chain. The backpropagation step will then give a gradient for the output, let's assume it is $\partial L/\partial p$ here, which needs to be propagated to the earlier operations of the forward pass. Thus, we simply invoke our iterative solve during the backward pass to compute $\partial p / \partial b = \mathcal{S}(\partial L/\partial p)$. And assuming that we've chosen a good solver as $\mathcal{S}$ for the forward pass, we get exactly the same performance and accuracy in the backwards pass.

 If you're interested in a code example, the [differentiate-pressure example]( https://github.com/tum-pbs/PhiFlow/blob/master/demos/differentiate_pressure.py) of phiflow uses exactly this process for an optimization through a pressure projection step: a flow field that is constrained on the right side, is optimized for the content on the left, such that it matches the target on the right after a pressure projection step.

-The main take-away here is: it is important _not to blindly backpropagate_ through the forward computation, but to think about which steps of the analytic equations for the forward pass to compute gradients for. In cases like the above, we can often find improved analytic expressions for the gradients, which we can then compute numerically.
+The main take-away here is: it is important _not to blindly backpropagate_ through the forward computation, but to think about which steps of the analytic equations for the forward pass to compute gradients for. In cases like the above, we can often find improved analytic expressions for the gradients, which we then approximate numerically.

 ```{admonition} Implicit Function Theorem & Time
 :class: tip
@ -429,12 +435,10 @@ The process above essentially yields an _implicit derivative_. Instead of explic

 **Time**: we _can_ actually consider the steps of an iterative solver as a virtual "time",
 and backpropagate through these steps. In line with other DP approaches, this enables an NN to _interact_ with an iterative solver. An example is to learn initial guesses of CG solvers from {cite}`um2020sol`. 
-[Details can be found here](https://github.com/tum-pbs/CG-Solver-in-the-Loop)
+[Details and code can be found here.](https://github.com/tum-pbs/CG-Solver-in-the-Loop)
 ```


---
-
 ## Summary of differentiable physics so far

 To summarize, using differentiable physical simulations 
@ -444,9 +448,13 @@ this makes it possible to let NNs seamlessly interact with physical solvers.

 We'd previously fully discard our physical model and solver
 once the NN is trained: in the example from {doc}`physicalloss-code` 
-the NN gives us the solution directly, bypassing
-any solver or model equation. With the DP approach we can train an NN alongside
+the NN gives us the solution directly, bypassing any solver or model equation. 
+The DP approach substantially differs from the physics-informed NNs (v2) from {doc}`physicalloss`,
+it has more in common with the controlled discretizations (v1). They are essentially a subset, or partial
+application of DP training.
+
+However in contrast to both residual approaches, DP makes it possible to train an NN alongside
 a numerical solver, and thus we can make use of the physical model (as represented by 
 the solver) later on at inference time. This allows us to move beyond solving single
-inverse problems, and can yield NNs that quite robustly generalize to new inputs.
-Let's revisit this sample problem in the context of DPs.
+inverse problems, and yields NNs that quite robustly generalize to new inputs.
+Let's revisit the example problem from {doc}`physicalloss-code` in the context of DPs.
--- a/physgrad-hig.md
+++ b/physgrad-hig.md
@ -154,7 +154,7 @@ The third graph on the right side of figure {numref}`hig-toy-example-bad` shows

 Note that for all examples so far, we've improved upon the _differentiable physics_ (DP) training from the previous chapters. I.e., we've focused on combinations of neural networks and PDE solving operators. The latter need to be differentiable for training with regular SGD, as well as for HIG-based training. 

-In contrast, for training with physical gradients (from {doc}`physgrad`), we even needed to provide a full inverse solver. As shown there, this has advantages, but differentiates PGs from DP and HIGs. Thus, the HIGs share more similarities with, e.g., {doc}`diffphys-code-sol` and  {doc}`diffphys-control`, than with the example {doc}`physgrad-code`.
+In contrast, for training with physical gradients (from {doc}`physgrad`), we even needed to provide a full inverse solver. As shown there, this has advantages, but differentiates PGs from DP and HIGs. Thus, the HIGs share more similarities with, e.g., {doc}`diffphys-code-sol` and  {doc}`diffphys-code-control`, than with the example {doc}`physgrad-code`.

 This is a good time to give a specific code example of how to train physical NNs with HIGs: we'll look at a classic case, a system of coupled oscillators.

--- a/reinflearn-code.ipynb
+++ b/reinflearn-code.ipynb
@ -8,7 +8,7 @@
   "source": [
    "# Controlling Burgers' Equation with Reinforcement Learning\n",
    "\n",
-    "In the following, we will target inverse problems with Burgers equation as a testbed for reinforcement learning (RL). The setup is similar to the inverse problems previously targeted with differentiable physics (DP) training (cf. {doc}`diffphys-control`), and hence we'll also directly compare to these approaches below. Similar to before, Burgers equation is simple but non-linear with interesting dynamics, and hence a good starting point for RL experiments. In the following, the goal is to train a control force estimator network that should predict the forces needed to generate smooth transitions between two given states. \n",
+    "In the following, we will target inverse problems with Burgers equation as a testbed for reinforcement learning (RL). The setup is similar to the inverse problems previously targeted with differentiable physics (DP) training (cf. {doc}`diffphys-code-control`), and hence we'll also directly compare to these approaches below. Similar to before, Burgers equation is simple but non-linear with interesting dynamics, and hence a good starting point for RL experiments. In the following, the goal is to train a control force estimator network that should predict the forces needed to generate smooth transitions between two given states. \n",
    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/reinflearn-code.ipynb)\n",
    "\n"
   ]
@ -40,7 +40,7 @@
    "This example uses the reinforcement learning framework [stable_baselines3](https://github.com/DLR-RM/stable-baselines3) with [PPO](https://arxiv.org/abs/1707.06347v2) as reinforcement learning algorithm.\n",
    "For the physical simulation, version 1.5.1 of the differentiable PDE solver [\u03a6<sub>Flow</sub>](https://github.com/tum-pbs/PhiFlow) is used. \n",
    "\n",
-    "After the RL training is completed, we'll additionally compare it to a differentiable physics approach using a \"control force estimator\" (CFE) network from {doc}`diffphys-control` (as introduced by {cite}`holl2019pdecontrol`)."
+    "After the RL training is completed, we'll additionally compare it to a differentiable physics approach using a \"control force estimator\" (CFE) network from {doc}`diffphys-code-control` (as introduced by {cite}`holl2019pdecontrol`)."
   ]
  },
  {
@ -196,7 +196,7 @@
   "source": [
    "## Training via reinforcement learning\n",
    "\n",
-    "Next we set up the RL environment. The PPO approach uses a dedicated value estimator network (the \"critic\") to predict the sum of rewards generated from a certain state. These predicted rewards are then used to update a policy network (the \"actor\") which, analogously to the CFE network of {doc}`diffphys-control`, predicts the forces to control the simulation."
+    "Next we set up the RL environment. The PPO approach uses a dedicated value estimator network (the \"critic\") to predict the sum of rewards generated from a certain state. These predicted rewards are then used to update a policy network (the \"actor\") which, analogously to the CFE network of {doc}`diffphys-code-control`, predicts the forces to control the simulation."
   ]
  },
  {
@ -462,7 +462,7 @@
   "source": [
    "## Differentiable physics training\n",
    "\n",
-    "To classify the results of the reinforcement learning method, we now compare them to an approach using differentiable physics training. In contrast to the full approach from {doc}`diffphys-control` which includes a second _OP_ network, we aim for a direct control here. The OP network represents a separate \"physics-predictor\", which is omitted here for fairness when comparing with the RL version.\n",
+    "To classify the results of the reinforcement learning method, we now compare them to an approach using differentiable physics training. In contrast to the full approach from {doc}`diffphys-code-control` which includes a second _OP_ network, we aim for a direct control here. The OP network represents a separate \"physics-predictor\", which is omitted here for fairness when comparing with the RL version.\n",
    "\n",
    "The DP approach has access to the gradient data provided by the differentiable solver, making it possible to trace the loss over multiple time steps and enabling the model to comprehend long term effects of generated forces better. The reinforcement learning algorithm, on the other hand, is not limited by training set size like the DP algorithm, as new training samples are generated on policy. However, this also introduces additional simulation overhead during training, which can increase the time needed for convergence. "
   ]
--- a/reinflearn-intro.md
+++ b/reinflearn-intro.md
@ -1,7 +1,7 @@
 Introduction to Reinforcement Learning
 =======================

-Deep reinforcement learning, referred to as just _reinforcement learning_ (RL) from now on, is a class of methods in the larger field of deep learning that lets an artificial intelligence agent explore the interactions with a surrounding environment. While doing this, the agent receives reward signals for its actions and tries to discern which actions contribute to higher rewards, to adapt its behavior accordingly. RL has been very successful at playing games such as Go {cite}`silver2017mastering`, and it bears promise for engineering applications such as robotics.
+Deep reinforcement learning, which we'll just call _reinforcement learning_ (RL) from now on, is a class of methods in the larger field of deep learning that lets an artificial intelligence agent explore the interactions with a surrounding environment. While doing this, the agent receives reward signals for its actions and tries to discern which actions contribute to higher rewards, to adapt its behavior accordingly. RL has been very successful at playing games such as Go {cite}`silver2017mastering`, and it bears promise for engineering applications such as robotics.

 The setup for RL generally consists of two parts: the environment and the agent. The environment receives actions $a$ from the agent while supplying it with observations in the form of states $s$, and rewards $r$. The observations represent the fraction of the information from the respective environment state that the agent is able to perceive. The rewards are given by a predefined function, usually tailored to the environment and might contain, e.g., a game score, a penalty for wrong actions or a bounty for successfully finished tasks.