numerous smaller fixes in physgrad chapter

2022-04-24 13:59:09 +02:00 · 2022-04-24 13:59:09 +02:00 · 429baed362
commit 429baed362
parent 6427365d55
10 changed files with 155 additions and 115 deletions
--- a/intro.md
+++ b/intro.md
@ -7,7 +7,7 @@ name: pbdl-logo-large
 ---
 ```

-Welcome to the _Physics-based Deep Learning Book_ (v0.1) 👋
+Welcome to the _Physics-based Deep Learning Book_ (v0.2) 👋

 **TL;DR**: 
 This document contains a practical and comprehensive introduction of everything
@ -15,11 +15,18 @@ related to deep learning in the context of physical simulations.
 As much as possible, all topics come with hands-on code examples in the 
 form of Jupyter notebooks to quickly get started.
 Beyond standard _supervised_ learning from data, we'll look at _physical loss_ constraints, 
-more tightly coupled learning algorithms with _differentiable simulations_, as well as 
+more tightly coupled learning algorithms with _differentiable simulations_, 
+training algorithms tailored to physics problems,
+as well as 
 reinforcement learning and uncertainty modeling.
 We live in exciting times: these methods have a huge potential to fundamentally 
 change what computer simulations can achieve.

+```{note} 
+_What's new in v0.2?_
+For readers familiar with v0.1 of this text, the brand new chapter on improved learning methods for physics problems is highly recommended: starting with {doc}`physgrad`.
+```
+
 ---

 ## Coming up
--- a/overview.md
+++ b/overview.md
@ -107,6 +107,10 @@ The key aspects that we will address in the following are:
 - explain how to use deep learning techniques to solve PDE problems,
 - how to combine them with **existing knowledge** of physics,
 - without **discarding** our knowledge about numerical methods.
+
+At the same time, it's worth noting what we won't be covering:
+- introductions to deep learning and numerical simulations,
+- we're neither aiming for a broad survey of research articles in this area.
 ```

 The resulting methods have a huge potential to improve
--- a/physgrad-code.ipynb
+++ b/physgrad-code.ipynb
--- a/physgrad-comparison.ipynb
+++ b/physgrad-comparison.ipynb
@ -21,17 +21,14 @@
    "Specifically, we'll use the following  $\\mathbf{y}$ and $L$:\n",
    "\n",
    "$\\quad \\mathbf{y}(\\mathbf{x}) = \\mathbf{y}(x_0,x_1) = \\begin{bmatrix} x_0 \\\\ x_1^2 \\end{bmatrix}$, \n",
+    "\n",
    "i.e. $\\mathbf{y}$ only squares the second component of its input, and\n",
-    "\n",
-    "$\\quad L(\\mathbf{y}) = |\\mathbf{y}|^2 = y_0^2 + y_1^2  \\  $ \n",
+    "$L(\\mathbf{y}) = |\\mathbf{y}|^2 = y_0^2 + y_1^2  \\  $ \n",
    "represents a simple squared $L^2$ loss.\n",
-    "\n",
    "As starting point for some example optimizations we'll use \n",
    "$\\mathbf{x} = \\begin{bmatrix} \n",
    "  3 \\\\ 3\n",
-    "\\end{bmatrix}$ as initial guess for solving the following simple minimization problem:\n",
-    "\n",
-    "$\\quad \\text{arg min}_{\\mathbf{x}} \\ L(\\mathbf{x}).$\n",
+    "\\end{bmatrix}$ as initial guess for solving the following simple minimization problem: $\\text{arg min}_{\\mathbf{x}} \\ L(\\mathbf{x}).$\n",
    "\n",
    "For us as humans it's quite obvious that $[0 \\ 0]^T$ is the right answer, but let's see how quickly the different optimization algorithms discussed in the previous section can find that solution. And while $\\mathbf{y}$ is a very simple function, it is nonlinear due to its $x_1^2$.\n",
    "\n",
@ -57,7 +54,7 @@
    "height: 150px\n",
    "name: three-spaces\n",
    "---\n",
-    "We're targeting inverse problems to retrieve an entry in $\\mathbf x$ from a loss computed in terms of output from a physics simulator $\\mathbf y$. Hence in a forward pass, we transformm from $\\mathbf x$ to $\\mathbf y$, and then compute a loss $L$. The backwards pass transforms back to $\\mathbf x$. Thus, the accuracy in terms of $\\mathbf x$ is the most crucial one, but we can likewise track progress of an optimization in terms of $\\mathbf y$ and $L$.\n",
+    "We're targeting inverse problems to retrieve an entry in $\\mathbf x$ from a loss computed in terms of output from a physics simulator $\\mathbf y$. Hence in a forward pass, we transform from $\\mathbf x$ to $\\mathbf y$, and then compute a loss $L$. The backwards pass transforms back to $\\mathbf x$. Thus, the accuracy in terms of $\\mathbf x$ is the most crucial one, but we can likewise track progress of an optimization in terms of $\\mathbf y$ and $L$.\n",
    "```\n",
    "\n"
   ]
@ -562,7 +559,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This illustrates that the inverse simulator variant, PG in red, does even better than Newton's method in orange. It yields a trajectory that is better aligned with the ideal _diagonal_ trajectory, and its final state is closer to the origin. A key ingredient here is the inverse function for $\\mathbf{y}$, which provided higher order terms than the second-order approximation for Newton's method. Despite the simplicity of the problem, Newton's method has problems finding the right search direction. For the inverse simulator update, on the other hand, the higher order information yields an improved direction for the optimization.\n",
+    "This illustrates that the inverse simulator variant, PG in red, does even better than Newton's method in orange. It yields a trajectory that is better aligned with the ideal _diagonal_ trajectory, and its final state is closer to the origin. A key ingredient here is the inverse function for $\\mathbf{y}$, which provided higher order terms than the second-order approximation for Newton's method. This improves the scale-invariance of the optimization. Despite the simplicity of the problem, Newton's method has problems finding the right search direction. For the inverse simulator update, on the other hand, the higher order information yields an improved direction for the optimization.\n",
    "\n",
    "This difference also shows in first update step for each method: below we measure how well it is aligned with the diagonal."
   ]
@ -768,7 +765,7 @@
    "The main takeaways of this section are the following.\n",
    "* GD easily yields \"unbalanced\" updates, and gets stuck.\n",
    "* Newtons method does better, but is far from optimal.\n",
-    "* the higher-order information of the invese simulator  outperform both, even if it is applied only partially (we still used Newton's method for $L$ above).\n",
+    "* the higher-order information of the inverse simulator  outperform both, even if it is applied only partially (we still used Newton's method for $L$ above).\n",
    "* Also, the methods (and in general the choice of optimizer) strongly affects progress in latent spaces, as shown for $\\mathbf{y}$ above.\n",
    "    \n",
    "In the next sections we can build on these observations to use PGs for training NNs via invertible physical models."
--- a/physgrad-discuss.md
+++ b/physgrad-discuss.md
@ -1,7 +1,7 @@
 Discussion of Improved Gradients
 =======================

-At this point it's a good time to take another step back, and assess the different methods of the previous chapters. For deep learning applications, we can broadly distinguish three approaches: the _regular_ differentiable physics (DP) training, the training with half-inverse gradients (HIGs), and using the scale-invariant physics updates (SIPs). Unfortunately, we can't simply discard two of them, and focus on a single approach for all future endeavours. However, discussing the pros and cons sheds light on some fundamental aspects of physics-based deep learning.
+At this point it's a good time to take another step back, and assess the different methods introduced so far. For deep learning applications, we can broadly distinguish three approaches: the _regular_ differentiable physics (DP) training, the training with half-inverse gradients (HIGs), and using the scale-invariant physics updates (SIPs). Unfortunately, we can't simply discard two of them, and focus on a single approach for all future endeavours. However, discussing the pros and cons sheds light on some fundamental aspects of physics-based deep learning.

 ![Divider](resources/divider7.jpg)

@ -9,7 +9,7 @@ At this point it's a good time to take another step back, and assess the differe

 First and foremost, a central motivation for improved updates is the need to address the scaling issues of the learning problems. This is not a completely new problem: numerous deep learning algorithms were proposed to address these for training NNs. However, the combination of NNs with physical simulations brings new challenges that at the same time provide new angles to tackle this problem. On the negative side, we have additional, highly non-linear operators from the PDE models. On the positive side, these operators typically do not have free parameters during learning, and thus can be treated with different, tailored methods.

-This is exactly where HIGs and SIPs come in: instead of treating the physical simulation like the rest of the NNs (this is the DP approach), they show how much can be achieved with custom inverse solvers (SIPs) or a custom numerical inversion (HIGs).
+This is exactly where HIGs and SIPs come in: instead of treating the physical simulation like the rest of the NNs (this is the DP approach), they show how much can be achieved with custom inverse solvers (SIPs) or a custom numerical inversion (HIGs). Both methods make important steps towards _scale-invariant_ training.

 ## Computational Resources

@ -40,7 +40,7 @@ Even when the inversion is only done for the physics simulation component (as wi

 ❌ Con SIP: 
 - Require inverse simulators (at least local ones).
- Less wide-spread availability than, e.g., differentiable physics simulators.
+- Only makes the physics component scale-invariant.

 ---

@ -56,10 +56,10 @@ The HIGs on the other hand, go back to first order information in the form of Ja

 ---

-However, in both cases, the resulting models can give a performance that we simply can't obtain by, e.g., training longer with a simpler DP or supervised approach. So, if we plan to evaluate these models often, e.g., shipping them in an application, this increased one-time cost can pay off in the long run.
+In both cases, the resulting neural networks can yield a performance that we simply can't obtain by, e.g., training longer with a simpler DP or supervised approach. So, if we plan to evaluate these models often, e.g., shipping them in an application, this increased one-time cost will pay off in the long run.

 This concludes the chapter on improved learning methods for physics-based NNs. 
 It's clearly an active topic of research, with plenty of room for new methods, but the algorithms here already
 indicate the potential of tailored learning algorithms for physical problems. 
 This also concludes the focus on numerical simulations as DL components. In the next chapter, we'll instead
-focus on a different statistical viewpoint, the inclusion of uncertainty.
+focus on a different statistical viewpoint, namely the inclusion of uncertainty.
--- a/physgrad-hig-code.ipynb
+++ b/physgrad-hig-code.ipynb
--- a/physgrad-hig.md
+++ b/physgrad-hig.md
@ -18,7 +18,7 @@ As a drawback, HIGs:
 - require an SVD for a large Jacobian matrix,
 - and are based on first-order information (similar to regular gradients). 

-Howver, in contrast to regular gradients, they use the full Jacobian matrix. So as we'll see below, they typically outperform regular SGD and Adam significantly.
+Howver, in contrast to regular gradients, they use the full Jacobian matrix. So as we'll see below, they typically outperform regular GD and Adam significantly.

 ```

@ -73,7 +73,7 @@ A visual overview of the different spaces involved in HIG training. Most importa

 ## Constructing the Jacobian

-The formulation above hides one important aspect of HIGs: the search direction we compute not only jointly takes into account the scaling of neural network and physics, but can also incorporate information from all the samples in a mini-batch. This has the advantage of finding the optimal direction (in an $L^2$ sense) to minimize the loss, instead of averaging directions as done with SGD or Adam.
+The formulation above hides one important aspect of HIGs: the search direction we compute not only jointly takes into account the scaling of neural network and physics, but can also incorporate information from all the samples in a mini-batch. This has the advantage of finding the optimal direction (in an $L^2$ sense) to minimize the loss, instead of averaging directions as done with GD or Adam.

 To achieve, this, the Jacobian matrix for $\partial y / \partial \theta$ is concatenated from the individual Jacobians of each sample in a mini-batch. Let $x_i,y_i$ denote input and output of sample $i$ in a mini-batch, respectively, then the final Jacobian is constructed via all the 
 $\frac{\partial y_i}{\partial \theta}\big\vert_{x_i}$ as
@ -119,7 +119,7 @@ We'll use a small neural network with a single hidden layer consisting of 7 neur

 ## Well-conditioned

-Let's first look at the well-conditioned case with $\lambda=1$. In the following image, we'll compare Adam as the most popular SGD-representative, Gauss-Newton (GN) as "classical" method, and the HIGs. These methods are evaluated w.r.t. three aspects: naturally, it's interesting to see how the loss evolves. In addition, we'll consider the distribution of neuron activations from the resulting neural network states (more on that below). Finally, it's also interesting to observe how the optimization influences the resulting target states (in $y$ space) produced by the neural network. Note that the $y$-space graph below shows only a single, but fairly representative, $x,y$ pair. The other two show quantities from a larger set of validation inputs.
+Let's first look at the well-conditioned case with $\lambda=1$. In the following image, we'll compare Adam as the most popular GD-representative, Gauss-Newton (GN) as "classical" method, and the HIGs. These methods are evaluated w.r.t. three aspects: naturally, it's interesting to see how the loss evolves. In addition, we'll consider the distribution of neuron activations from the resulting neural network states (more on that below). Finally, it's also interesting to observe how the optimization influences the resulting target states (in $y$ space) produced by the neural network. Note that the $y$-space graph below shows only a single, but fairly representative, $x,y$ pair. The other two show quantities from a larger set of validation inputs.

 ```{figure} resources/physgrad-hig-toy-example-good.jpg
 ---
@ -161,7 +161,7 @@ The third graph on the right side of figure {numref}`hig-toy-example-bad` shows

 ## Summary of Half-Inverse Gradients

-Note that for all examples so far, we've improved upon the _differentiable physics_ (DP) training from the previous chapters. I.e., we've focused on combinations of neural networks and PDE solving operators. The latter need to be differentiable for training with regular SGD, as well as for HIG-based training. 
+Note that for all examples so far, we've improved upon the _differentiable physics_ (DP) training from the previous chapters. I.e., we've focused on combinations of neural networks and PDE solving operators. The latter need to be differentiable for training with regular GD, as well as for HIG-based training. 

 In contrast, for training with SIPs (from {doc}`physgrad-nn`), we even needed to provide a full inverse solver. As shown there, this has advantages, but differentiates SIPs from DP and HIGs. Thus, the HIGs share more similarities with, e.g., {doc}`diffphys-code-sol` and  {doc}`diffphys-code-control`, than with the example {doc}`physgrad-code`.

--- a/physgrad-nn.md
+++ b/physgrad-nn.md
@ -14,7 +14,7 @@ In contrast to the previous sections and {doc}`overview-equations`, we are targe
 This gives the following minimization problem with $i$ denoting the indices of a mini-batch:

 $$
-    \text{arg min}_\theta \sum_{i} \frac 1 2 \| \mathcal P\big(f(y^*_i ; \theta)\big) - y^*_i \|_2^2 
+    \text{arg min}_\theta \sum_{i} \frac 1 2 | \mathcal P\big(f(y^*_i ; \theta)\big) - y^*_i |_2^2 
 $$ (eq:unsupervised-training)


@ -45,7 +45,7 @@ To update the weights $\theta$ of the NN $f$, we perform the following update st

 * Given a set of inputs $y^*$, evaluate the forward pass to compute the NN prediction $x = f(y^*; \theta)$
 * Compute $y$ via a forward simulation ($y = \mathcal P(x)$) and invoke the (local) inverse simulator $P^{-1}(y; x)$ to obtain the step $\Delta x_{\text{PG}} = \mathcal P^{-1} (y + \eta \Delta y; x)$ with $\Delta y = y^* - y$
-* Evaluate the network loss, e.g., $L = \frac 1 2 || x - \tilde x ||_2^2$ with $\tilde x = x+\Delta x_{\text{PG}}$, and perform a Newton step treating $\tilde x$ as a constant 
+* Evaluate the network loss, e.g., $L = \frac 1 2 | x - \tilde x |_2^2$ with $\tilde x = x+\Delta x_{\text{PG}}$, and perform a Newton step treating $\tilde x$ as a constant 
 * Use GD (or a GD-based optimizer like Adam) to propagate the change in $x$ to the network weights $\theta$ with a learning rate $\eta_{\text{NN}}$


@ -118,9 +118,9 @@ This typically makes the learning task more difficult, as we repeatedly backprop
 Let's illustrate the convergence behavior of SIP training and how it depends on characteristics of $\mathcal P$ with an example {cite}`holl2021pg`.
 We consider the synthetic two-dimensional function 
 %$$\mathcal P(x) = \left(\frac{\sin(\hat x_1)}{\xi}, \xi \cdot \hat x_2 \right) \quad \text{with} \quad \hat x = R_\phi \cdot x$$
-$$\mathcal P(x) = \left(\sin(\hat x_1) / \xi, \  \hat x_2 \cdot \xi \right) \quad \text{with} \quad \hat x = \gamma \cdot R_\phi \cdot x , $$
+$$\mathcal P(x) = \left(\sin(\hat x_1) / \xi, \  \hat x_2 \cdot \xi \right) \quad \text{with} \quad \hat x = R_\phi \cdot x , $$
 % 
-where $R_\phi \in \mathrm{SO}(2)$ denotes a rotation matrix and $\gamma > 0$.
+where $R_\phi \in \mathrm{SO}(2)$ denotes a rotation matrix.
 The parameters $\xi$ and $\phi$ allow us to continuously change the characteristics of the system.
 The value of $\xi$ determines the conditioning of $\mathcal P$ with large $\xi$ representing ill-conditioned problems while $\phi$ describes the coupling of $x_1$ and $x_2$. When $\phi=0$, the off-diagonal elements of the Hessian vanish and the problem factors into two independent problems.

--- a/physgrad.md
+++ b/physgrad.md
@ -1,4 +1,4 @@
-Towards Gradient Inversion
+Scale-Invariance and Inversion
 =======================

 In the following we will question some fundamental aspects of the formulations so far, namely the update step computed via gradients.
@ -10,7 +10,9 @@ Not too surprising after this introduction: A central insight of the following c
 It turns out that both supervised and DP gradients have their pros and cons, and leave room for custom methods that are aware of the physics operators. 
 In particular, we'll show how scaling problems of DP gradients affect NN training (as outlined in {cite}`holl2021pg`),
 and revisit the problems of multi-modal solutions.
-Finally, we'll explain several alternatives to prevent these issues. It turns out that a key property that is missing in regular gradients is a proper _inversion_ of the Jacobian matrix.
+Finally, we'll explain several alternatives to prevent these issues. 
+
+% It turns out that a key property that is missing in regular gradients is a proper _inversion_ of the Jacobian matrix.


 ```{admonition} A preview of this chapter
@ -18,13 +20,46 @@ Finally, we'll explain several alternatives to prevent these issues. It turns ou

 Below, we'll proceed in the following steps:
 - Show how the properties of different optimizers and the associated scaling issues can negatively affect NN training.
- Identify what is missing in our training runs with GD or Adam so far. Spoiler: it is a proper _inversion_ of the Jacobian matrix.
- We'll explain two alternatives to prevent these problems: an analytical full-, and a numerical half-inversion scheme.
+- Identify the problem with our GD or Adam training runs so far. Spoiler: they're missing an _inversion_ process to make the training scale-invariant.
+- We'll then explain two alternatives to alleviate these problems: an analytical full-, and a numerical half-inversion scheme.

 ```

 % note, re-introduce multi-modality at some point...

+## The crux of the matter
+
+Before diving into the details of different optimizers, the following paragraphs should provide some intuition for why this is important. As mentioned above, all methods discussed so far have used gradients, and the main reason for moving towards different updates is that they have some fundamental scaling issues in multi-dimensional settings.
+
+For 1D problems, this can easily be "fixed" by choosing a good learning rate, but interestingly, as soon
+as we go to 2D, things become more tricky. Let's consider a very simple toy "physics" function in two dimensions, which simply applies an exponent $\alpha$ to the second component. Afterwards we're computing an $L^2$ "loss" of the result:
+
+$$ \mathcal P(x_1,x_2) = 
+\begin{bmatrix} 
+  x_1 \\
+  x_2^{~\alpha}
+\end{bmatrix}  \text{ with }  L(\mathcal P) = |\mathcal P|^2 
+$$
+
+For $\alpha=1$ everything is very simple: we're faced with a radial symmetric loss landscape, and $x_1$ and $x_2$ behave in the same way. The gradient $\nabla_x = (\partial L / \partial x)^T$ is perpendicular to the isolevels of the loss landscape, and hence an update with $-\eta \nabla_x$ points directly to the minimum at 0. This is a setting we're dealing with for classical deep learning scenarios, like most supervised learning cases or classification problems. This example is visualized on the left of the following figure.
+
+```{figure} resources/physgrad-scaling.jpg
+---
+height: 200px
+name: physgrad-scaling
+---
+Loss landscapes in $x$ for different $\alpha$ of the 2D example problem, with an example update step $- \nabla_x$ shown in green for each case.
+```
+
+However, within this book we're targeting _physical_ learning problems, and hence we have physical functions integrated into the learning process, as discussed at length for differentiable physics approaches. This is fundamentally different! The physics functions pretty much always will introduce a scaling of the different components. In our toy problem we can mimic this by choosing different values for $\alpha$, as shown in the middle and right graphs of the figure above.
+
+For larger $\alpha$, the loss landscape away from the minimum steepens along $x_2$. As a consequence, the gradients grow along this direction. If we don't want our optimization to blow up, we'll need to choose a smaller learning rate $\eta$, reducing progress along $x_1$. The gradient of course stays perpendicular to the loss. In this example we'll move quickly along $x_2$ until we're close to the x axis, and then only very slowly creep left towards the minimum. Even worse, as we'll show below, regular updates actually apply the square of the scaling! 
+And in settings with many dimensions, it will be extremely difficult to find a good learning rate.
+Thus, to make proper progress, we somehow need to account for the different scaling of the components of multi-dimensional functions. This requires some form of _inversion_, as we'll outline in detail below. 
+
+Note that inversion, naturally, does not mean negation ($g^{-1} \ne -g$ 🙄). A negated gradient would definitely move in the wrong direction. We need an update that still points towards a decreasing loss, but accounts for differently scaled dimensions. Hence, a central aim in the following will be _scale-invariance_.
+
+Definition of *scale-invariance*: a scale-invariant optimization for a given function yields the same result for different parametrizations (i.e. scalings) of the function.


 ![Divider](resources/divider3.jpg)
@ -32,7 +67,7 @@ Below, we'll proceed in the following steps:

 ## Traditional optimization methods

-As before, let $L(x)$ be a scalar loss function, subject to minimization. The goal is to compute a step in terms of the input parameters $x$ , denoted by $\Delta x$. Below, we'll compute different versions of $\Delta x$ that will be distuingished by a subscript.
+We'll now evaluate and discuss how different optimizers perform in comparison. As before, let $L(x)$ be a scalar loss function, subject to minimization. The goal is to compute a step in terms of the input parameters $x$ , denoted by $\Delta x$. Below, we'll compute different versions of $\Delta x$ that will be distinguished by a subscript.

 All NNs of the previous chapters were trained with gradient descent (GD) via backpropagation. GD with backprop was also employed for the PDE solver (_simulator_) $\mathcal P$, resulting in the DP training approach.
 When we simplify the setting, and leave out the NN for a moment, this gives the minimization problem
@ -79,6 +114,7 @@ where $\eta$ is the scalar learning rate.
 The Jacobian $\frac{\partial L}{\partial x}$ describes how the loss reacts to small changes of the input.
 Surprisingly, this very widely used update has a number of undesirable properties that we'll highlight in the following. Note that we've naturally applied this update in supervised settings such as {doc}`supervised-airfoils`, but we've also used it in the differentiable physics approaches. E.g., in {doc}`diffphys-code-sol` we've computed the derivative of the fluid solver. In the latter case, we've still only updated the NN parameters, but the fluid solver Jacobian was part of equation {eq}`GD-update`, as shown in {eq}`loss-deriv`.

+We'll jointly evaluate GD and several other methods with respect to a range of categories: their handling of units, function sensitivity, and behavior near optima. While these topics are related, they illustrate differences and similarities of the approaches.


 **Units** 📏
@ -92,9 +128,8 @@ One could argue that units aren't very important for the parameters of NNs, but

 **Function sensitivity** 🔍

-GD has also inherent problems when functions are not _normalized_.
-This can be illustrated with a very simple example:
-consider the function $L(x) = c \cdot x$.
+As illustrated above, GD has also inherent problems when functions are not _normalized_.
+Consider the function $L(x) = c \cdot x$.
 Then the parameter updates of GD scale with $c$, i.e. $\Delta x_{\text{GD}} = -\eta \cdot c$, and 
 $L(x+\Delta x_{\text{GD}})$ will even have terms on the order of $c^2$.
 If $L$ is normalized via $c=1$, everything's fine. But in practice, we'll often
@ -113,7 +148,7 @@ For insensitive functions where _large changes_ in the input don't change the ou
 Such sensitivity problems can occur easily in complex functions such as deep neural networks where the layers are typically not fully normalized.
 Normalization in combination with correct setting of the learning rate $\eta$ can be used to counteract this behavior in NNs to some extent, but these tools are not available when optimizing physics simulations.
 Applying normalization to a simulation anywhere but after the last solver step would destroy the state of the simulation.
-Adjusting the learning rate is also difficult in practice, e.g. when simulation parameters at different time steps are optimized simultaneously or when the magnitude of the simulation output varies w.r.t. the initial state.
+Adjusting the learning rate is also difficult in practice, e.g., when simulation parameters at different time steps are optimized simultaneously or when the magnitude of the simulation output varies w.r.t. the initial state.


 **Convergence near optimum** 💎
@ -205,7 +240,7 @@ are still a very active research topic, and hence many extensions have been prop
 ## Inverse gradients

 As a first step towards fixing the aforementioned issues,
-we'll consider what we'll call _inverse_ gradients (IGs).
+we'll consider what we'll call _inverse_ gradients (IGs). These methods actually use an inverse of the Jacobian, but as we always have a scalar loss at the end of the computational chain, this results in a gradient vector.
 Unfortunately, they come with their own set of problems, which is why they only represent an intermediate step (we'll revisit them in a more practical form later on).

 Instead of $L$ (which is scalar), let's consider optimization problems for a generic, potentially non-scalar function $y(x)$. 
@ -221,7 +256,7 @@ Here, the Jacobian $\frac{\partial x}{\partial y}$, which is similar to the inve
 The crucial step is the inversion, which of course requires the Jacobian matrix to be invertible. This is a problem somewhat similar to the inversion of the Hessian, and we'll revisit this issue below. However, if we can invert the Jacobian, this has some very nice properties.

 Note that instead of using a learning rate, here the step size is determined by the desired increase or decrease of the value of the output, $\Delta y$. Thus, we need to choose a $\Delta y$ instead of an $\eta$, but effectively has the same role: it controls the step size of the optimization.
-It the simplest case, we can compute it as a step towards the ground truth via $\Delta y = \eta (y^* - y)$.
+It the simplest case, we can compute it as a step towards the ground truth via $\Delta y = \eta ~ (y^* - y)$.
 This $\Delta y$ will show up frequently in the following equations, and make them look quite different to the ones above at first sight. 


@ -267,7 +302,7 @@ Thus, we now consider the fact that inverse gradients are linearizations of inve

 ## Inverse simulators

-So far we've discussed the problems of existing methods, and a common theme among the methods that do better, Newton and IGs, is that the regular gradient is not sufficient. We somehow need to address it's problems with some form of _inversion_. Before going into details of NN training and numerical methods to perform this inversion, we will consider one additional "special" case that will further illustrate the need for inversion: if we can make use of an _inverse simulator_, this likewise addresses many of the inherent issues of GD. It actually represents the ideal setting for computing update steps for the physics simulation part.
+So far we've discussed the problems of existing methods, and a common theme among the methods that do better, Newton and IGs, is that the regular gradient is not sufficient. We somehow need to address it's problems with some form of _inversion_ to arrive at scale invariance. Before going into details of NN training and numerical methods to perform this inversion, we will consider one additional "special" case that will further illustrate the need for inversion: if we can make use of an _inverse simulator_, this likewise addresses many of the inherent issues of GD. It actually represents the ideal setting for computing update steps for the physics simulation part.

 Let $y = \mathcal P(x)$ be a forward simulation, and $\mathcal P(y)^{-1}=x$ denote its inverse. 
 In contrast to the inversion of Jacobian or Hessian matrices from before, $\mathcal P^{-1}$ denotes a full inverse of all functions of $\mathcal P$. 
@ -347,9 +382,9 @@ apply the fundamental theorem of calculus to rewrite the ratio $\Delta x_{\text{
 % where we've integrated over a trajectory in $x$, and 
 % focused on 1D for simplicity. Likewise, by integrating over $z$ we can obtain:

-$\begin{aligned}
+$$\begin{aligned}
    \frac{\Delta x_{\text{PG}}}{\Delta y} = \frac{\int_{y_0}^{y_0+\Delta y} \frac{\partial x}{\partial y} \, dy}{\Delta y}
-\end{aligned}$
+\end{aligned}$$

 Here the expressions inside the integral is the local gradient, and we assume it exists at all points between $y_0$ and $y_0+\Delta y_0$.
 The local gradients are averaged along the path connecting the state before the update with the state after the update.
@ -377,7 +412,7 @@ More formally, $\lim_{y \rightarrow y_0} \frac{\mathcal P^{-1}(y; x_0) - P^{-1}(
 Local inverse functions can exist, even when a global inverse does not.

 Non-injective functions can be inverted, for example, by choosing the closest $x$ to $x_0$ such that $\mathcal P(x) = y$.
-As an example, consider $\mathcal P(x) = x^2$. It doesn't have a global inverse as two solutions ($\pm$) exist for each $y$. However, we can easily construct a local inverse by choosing the closer one of the two solutions, the positive $x$ in this example. 
+As an example, consider $\mathcal P(x) = x^2$. It doesn't have a global inverse as two solutions ($\pm$) exist for each $y$. However, we can easily construct a local inverse by choosing the closest solution taking from an initial guess.

 For differentiable functions, a local inverse is guaranteed to exist by the inverse function theorem as long as the Jacobian is non-singular.
 That is because the inverse Jacobian $\frac{\partial x}{\partial y}$ itself is a local inverse function, albeit, with being first-order, not the most accurate one.
@ -423,5 +458,5 @@ In the worst case, we can therefore fall back to the regular gradient.

 Also, we have turned the step w.r.t. $L$ into a step in $y$ space: $\Delta y$. 
 However, this does not prescribe a unique way to compute $\Delta y$ since the derivative $\frac{\partial y}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial y}$ puts almost no restrictions on $\Delta y$.
-Instead, we use a Newton step from equation {eq}`quasi-newton-update` to determine $\Delta y$ where $\eta$ controls the step size of the optimization steps. We will explain this in more detail in connection with the introduction of NNs in the next section.
+Instead, we use a Newton step from equation {eq}`quasi-newton-update` to determine $\Delta y$ where $\eta$ controls the step size of the optimization steps. We will explain this in more detail in connection with the introduction of NNs after the following code example.

--- a/resources/physgrad-scaling.jpg
+++ b/resources/physgrad-scaling.jpg