diff --git a/diffphys-code-burgers.ipynb b/diffphys-code-burgers.ipynb
index 9b5ba45..b6ffb9f 100644
--- a/diffphys-code-burgers.ipynb
+++ b/diffphys-code-burgers.ipynb
@@ -180,9 +180,9 @@
    "source": [
     "## Optimization \n",
     "\n",
-    "Based on the gradient, we can now take a step in the opposite direction to bring the loss down (instead of increasing it). We're using a learning rate `LR=5` for this step. We're also re-evaluating the loss for the updated state to check how we did. \n",
+    "Equipped with the gradient we can run a gradient descent optimization. Below, we're using a learning rate of `LR=5`, and we're re-evaluating the loss for the updated state to track convergence. \n",
     "\n",
-    "In the following code block, we're additionally saving all these gradients in a list called `grads`, such that we can visualize them later on. (Normally, we could discard each gradient after performing an update step for `velocity`.)\n"
+    "In the following code block, we're additionally saving the gradients in a list called `grads`, such that we can visualize them later on. For a regular optimization we could of course discard the gradient after performing an update of the velocity.\n"
    ]
   },
   {
@@ -268,7 +268,7 @@
    "source": [
     "This seems to be going in the right direction! It's definitely not perfect, but we've only computed 5 GD update steps so far. The two peaks with a positive velocity on the left side of the shock and the negative peak on the right side are starting to show.\n",
     "\n",
-    "This is a good indicator that the backpropagation of gradients through all of our 16 simulated steps is behaving correctly, and that it's driving the solution in the right direction. This hints at how powerful this setup is: the gradient that we obtain from each of the simulation steps (and each operation within them) can easily be chained together into more complex sequences. In the example above, we're backpropagating through all 16 steps of the simulation, and we could easily enlarge this \"look-ahead\" of the optimization with minor changes to the code.\n",
+    "This is a good indicator that the backpropagation of gradients through all of our 16 simulated steps is behaving correctly, and that it's driving the solution in the right direction. The graph above only hints at how powerful the setup is: the gradient that we obtain from each of the simulation steps (and each operation within them) can easily be chained together into more complex sequences. In the example above, we're backpropagating through all 16 steps of the simulation, and we could easily enlarge this \"look-ahead\" of the optimization with minor changes to the code.\n",
     "\n"
    ]
   },
@@ -404,7 +404,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Naturally, this is a tougher task: the optimization receives direct feedback what the state at $t=0.5$ should look like, but due to the non-linear model equation, we typically have a large number of solutions that exactly or numerically very closely satisfy the constraints. Hence, our minimizer not necessarily finds the exact state we started from. However, it's still quite close in this Burgers scenario.\n",
+    "Naturally, this is a tougher task: the optimization receives direct feedback what the state at $t=0.5$ should look like, but due to the non-linear model equation, we typically have a large number of solutions that exactly or numerically very closely satisfy the constraints. Hence, our minimizer does not necessarily find the exact state we started from. However, it's still quite close in this Burgers scenario.\n",
     "\n",
     "Before measuring the overall error of the reconstruction, let's visualize the full evolution of our system over time as this also yields the solution in the form of a numpy array that we can compare to the other versions:"
    ]
@@ -506,7 +506,7 @@
    "source": [
     "It's quite clearly visible here that the PINN solution (in the middle) recovers the overall shape of the solution, hence the temporal constraints are at least partially fulfilled. However, it doesn't manage to capture the amplitudes of the GT solution very well.\n",
     "\n",
-    "The reconstruction from the optimization with a differentiable solver (at the bottom) is much closer to the ground truth thanks to an improved flow of gradients over the whole course of the sequence. In addition, it can leverage the grid-based discretization for both forwards as well as backwards passes, and in this way provide a more accurate signal to the unknown initial state. It is nonetheless visible, that the reconstruction lacks certain \"sharper\" features of the GT version, e.g., visible in the bottom left corner of the solution image.\n",
+    "The reconstruction from the optimization with a differentiable solver (at the bottom) is much closer to the ground truth thanks to an improved flow of gradients over the whole course of the sequence. In addition, it can leverage the grid-based discretization for both forward as well as backward passes, and in this way provide a more accurate signal to the unknown initial state. It is nonetheless visible that the reconstruction lacks certain \"sharper\" features of the GT version, e.g., visible in the bottom left corner of the solution image.\n",
     "\n",
     "Let's quantify these errors over the whole sequence:"
    ]
@@ -556,7 +556,7 @@
     "\n",
     "This difference also shows clearly in the jointly visualized image at the bottom: the magnitudes of the errors of the DP reconstruction are much closer to zero, as indicated by the purple color above.\n",
     "\n",
-    "A simple direct reconstruction problem like this one is always a good initial test for a DP solver, e.g., before moving to more complex setups like coupling it with an NN. If the direct optimization does not converge, there's probably still something fundamentally wrong, and there's no point involving an NN. \n",
+    "A simple direct reconstruction problem like this one is always a good initial test for a DP solver. It can be tested independently before moving on to more complex setups, e.g., coupling it with an NN. If the direct optimization does not converge, there's probably still something fundamentally wrong, and there's no point involving an NN. \n",
     "\n",
     "Now we have a first example to show similarities and differences of the two approaches. In the next section, we'll present a discussion of the findings so far, before moving to more complex cases in the following chapter.\n",
     "\n",
diff --git a/diffphys-discuss.md b/diffphys-discuss.md
index 48c867f..d8f0bad 100644
--- a/diffphys-discuss.md
+++ b/diffphys-discuss.md
@@ -1,9 +1,9 @@
 Discussion
 =======================
 
-The training via differentiable physics as described so far allows us
+To summarize, the training via differentiable physics (DP) as described so far allows us
 to integrate full numerical simulations into the training of deep neural networks.
-As a consequence, this let's the networks learn to interact with these simulations. 
+As a consequence, this let's the networks learn to _interact_ with these simulations. 
 While we've only hinted at what could be
 achieved via DP approaches it is nonetheless a good time to discuss some 
 additional properties, and summarize the pros and cons.
@@ -14,11 +14,14 @@ additional properties, and summarize the pros and cons.
 
 ## Time steps and iterations
 
-When using differentiable physics (DP) approaches for learning application, 
-there is a large amount of flexibility w.r.t. the combination of DP and NN building blocks. 
+When using DP approaches for learning application, 
+there is a lot of flexibility w.r.t. the combination of DP and NN building blocks. 
+As some of the differences are subtle, the following section will go into more detail
 
-Just as a reminder, this is the previously shown _overview_ figure to illustrate the combination 
-of NNs and DP operators. Here, these operators look like a loss term: they typically don't have weights,
+**XXX**
+
+To re-cap, this is the previous figure illustrating NNs with DP operators. 
+Here, these operators look like a loss term: they typically don't have weights,
 and only provide a gradient that influences the optimization of the NN weights:
 
 ```{figure} resources/diffphys-shortened.jpg
@@ -96,4 +99,4 @@ To summarize the pros and cons of training NNs via DP:
 Here, the last negative point (regarding heavy machinery) is bound to strongly improve in a fairly short amount of time. However, for now it's important to keep in mind that not every simulator is suitable for DP training out of the box. Hence, in this book we'll focus on examples using phiflow, which was designed for interfacing with deep learning frameworks. 
 
 Next we can target more some complex scenarios to showcase what can be achieved with differentiable physics.
-This will also illustrate how the right selection of a numerical methods for a DP operator yields improvements in terms of training accuracy.
+This will also illustrate how the right selection of numerical methods for a DP operator yields improvements in terms of training accuracy.
diff --git a/diffphys.md b/diffphys.md
index c26afdc..c92200a 100644
--- a/diffphys.md
+++ b/diffphys.md
@@ -362,9 +362,9 @@ $$ \begin{aligned}
         \frac{ \partial d(t^e - \Delta t) }{ \partial \mathbf{u}}
         \frac{ \partial d(t^e) }{ \partial d(t^e - \Delta t) }
         \frac{ \partial L }{ \partial d(t^e)}
-        + \\
+        \\
     & 
-        \cdots + \\
+        + \ \cdots \ + \\
     & 
        \Big( \frac{ \partial d(t^0) }{ \partial \mathbf{u}} \cdots 
         \frac{ \partial d(t^e - \Delta t) }{ \partial d(t^e - 2 \Delta t) }