PG text update

2021-03-26 10:12:16 +08:00 · 2021-03-26 10:12:16 +08:00 · 145fa437c1
commit 145fa437c1
parent 5bf263a60a
3 changed files with 115 additions and 40 deletions
--- a/overview-equations.md
+++ b/overview-equations.md
@ -44,8 +44,7 @@ read chapters 6 to 9 of the [Deep Learning book](https://www.deeplearningbook.or
 especially the sections about [MLPs](https://www.deeplearningbook.org/contents/mlp.html) 
 and "Conv-Nets", i.e. [CNNs](https://www.deeplearningbook.org/contents/convnets.html).

-```{admonition} Note: Classification vs Regression
-:class: tip
+```{note} Classification vs Regression

 The classic ML distinction between _classification_ and _regression_ problems is not so important here:
 we only deal with _regression_ problems in the following.
--- a/physgrad-nn.md
+++ b/physgrad-nn.md
@ -1,23 +1,19 @@
 Physical Gradients and NNs
 =======================

-Re-cap?
+The discussion in the previous two sections already hints at physical gradients (PGs) being a powerful tool for optimization. However, we've actually cheated a bit in the previous code example {doc}`physgrad-comparison` and used PGs in a way that will be explained in more detail below. 

-...
-
-
-## Training via Physical Gradients
-
-
-**TODO, add details ...  fomr chap5 pdf**
-
-The discussion above already hints at PGs being a powerful tool for optimization. However, as it stands, they're restricted to functions with square Jacobians. Hence we can't directly use them in optimizations or learning problems, which typically have scalar objective functions.
+By default, PGs would be restricted to functions with square Jacobians. Hence we wouldn't be able to directly use them in optimizations or learning problems, which typically have scalar objective functions.
 In this section, we will first show how PGs can be integrated into the optimization pipeline to optimize scalar objectives.

-Consider a scalar objective function $L(z)$ that depends on the result of an invertible simulator $z = \mathcal P(x)$.
-Applying the chain rule and substituting the IG for the PG, the update becomes
+## Physical Gradients and Loss Functions

-$\begin{aligned}
+As before, we consider a scalar objective function $L(z)$ that depends on the result of an invertible simulator $z = \mathcal P(x)$. In {doc}`physgrad` we've outlined the inverse gradient (IG) update $\Delta x = \frac{\partial x}{\partial L} \cdot \Delta L$, where $\Delta L$ denotes a step to take in terms of the loss. 
+
+By applying the chain rule and substituting the IG $\frac{\partial x}{\partial L}$ for the PG, we obtain 
+
+$$
+\begin{aligned}
    \Delta x
    &= \frac{\partial x}{\partial L} \cdot \Delta L
    \\
@ -28,16 +24,108 @@ $\begin{aligned}
    &= \mathcal P^{-1}_{(x_0,z_0)}(z_0 + \Delta z) - x_0 + \mathcal O(\Delta z^2)
    .
 \end{aligned}
-$
+$$

-This equation does not prescribe a unique way to compute $\Delta z$ since the derivative $\frac{\partial z}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial z}$ puts almost no restrictions on $\Delta z$.
-Instead, we use equation {eq}`quasi-newton-update` to determine $\Delta z$ where $\eta$ controls the step size of the optimization steps.
-Unlike with quasi-Newton methods, where the Hessian of the full system is required, here, the Hessian is needed only for $L(z)$ and its computation can be completely forgone in many cases.
+This equation has turned the step w.r.t. $L$ into a step in $z$ space: $\Delta z$. 
+However, it does not prescribe a unique way to compute $\Delta z$ since the derivative $\frac{\partial z}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial z}$ puts almost no restrictions on $\Delta z$.
+Instead, we use a Newton step (equation {eq}`quasi-newton-update`) to determine $\Delta z$ where $\eta$ controls the step size of the optimization steps.

-Consider the case $L(z) = \frac 1 2 || z^\textrm{predicted} - z^\textrm{target}||_2^2$ which is the most common supervised objective function.
+Here an obvious questions is: Doesn't this leave us with the distadvantage of having to compute the inverse Hessian, as dicussed before?
+Luckily, unlike with regular Newton or quasi-Newton methods, where the Hessian of the full system is required, here, the Hessian is needed only for $L(z)$. Even better, for many typical $L$ its computation can be completely forgone.
+
+E.g., consider the case $L(z) = \frac 1 2 || z^\textrm{predicted} - z^\textrm{target}||_2^2$ which is the most common supervised objective function.
 Here $\frac{\partial L}{\partial z} = z^\textrm{predicted} - z^\textrm{target}$ and $\frac{\partial^2 L}{\partial z^2} = 1$.
 Using equation {eq}`quasi-newton-update`, we get $\Delta z = \eta \cdot (z^\textrm{target} - z^\textrm{predicted})$ which can be computed without evaluating the Hessian.

-Once $\Delta z$ is determined, the gradient can be backpropagated to earlier time steps using the inverse simulator.
+Once $\Delta z$ is determined, the gradient can be backpropagated to earlier time steps using the inverse simulator $\mathcal P^{-1}$. We've already used this combination of a Newton step for the loss and PGs for the PDE in {doc}`physgrad-comparison`.


+## NN Training 
+
+The previous step gives us an update for the input of the discretized PDE $\mathcal P^{-1}(x)$, i.e. a $\Delta x$. If $x$ was an output of an NN, we can then use established DL algorithms to backpropagate the desired change to the weights of the network.
+We have a large collection of powerful methodologies for training neural networks at our disposal, 
+so it is crucial that we can continue using them for training the NN components.
+On the other hand, due to the problems of GD for physical simulations (as outlined in {doc}`physgrad`),  
+we aim for using PGs to accurately optimize through the simulation.
+
+Consider the following setup: 
+A neural network makes a prediction $x = \mathrm{NN}(a \,;\, \theta)$ about a physical state based on some input $a$ and the network weights $\theta$.
+The prediction is passed to a physics simulation that computes a later state $z = \mathcal P(x)$, and hence
+the objective $L(z)$ depends on the result of the simulation. 
+
+
+```{admonition} Combined training algorithm
+:class: tip
+
+To train the weights $\theta$ of the NN, we then perform the following updates:
+
+* Evaluate $\Delta z$ via a Newton step as outlined above
+* Compute the PG $\Delta x = \mathcal P^{-1}_{(x, z)}(z + \Delta z) - x$ using an inverse simulator
+* Use GD or a GD-based optimizer to compute the updates to the network weights, $\Delta\theta = \eta_\textrm{NN} \cdot \frac{\partial y}{\partial\theta} \cdot \Delta y$
+
+```
+
+The combined optimization algorithm depends on both the **learning rate** $\eta_\textrm{NN}$ for the network as well as the step size $\eta$ from above, which factors into $\Delta z$.
+To first order, the effective learning rate of the network weights is $\eta_\textrm{eff} = \eta \cdot \eta_\textrm{NN}$.
+We recommend setting $\eta$ as large as the accuracy of the inverse simulator allows, before choosing $\eta_\textrm{NN} = \eta_\textrm{eff} / \eta$ to achieve the target network learning rate.
+This allows for nonlinearities of the simulator to be maximally helpful in adjusting the optimization direction.
+
+
+**Note:**
+For simple objectives like a loss of the form $L=|z - z^*|^2$, this procedure can be easily integrated into an  GD autodiff pipeline by replacing the gradient of the simulator only.
+This gives an effective objective function for the network
+
+$$
+L_\mathrm{NN} = \frac 1 2  | x - \mathcal P_{(x,z)}^{-1}(z + \Delta z) |^2
+$$
+
+where $\mathcal P_{(x,z)}^{-1}(z + \Delta z)$ is treated as a constant.
+
+
+## Iterations and Time Dependence
+
+The above procedure describes the optimization of neural networks that make a single prediction.
+This is suitable for scenarios to reconstruct the state of a system at $t_0$ given the state at a $t_e > t_0$ or to estimate an optimal initial state to match certain conditions at $t_e$.
+
+However, our method can also be applied to more complex setups involving multiple objectives at different times and multiple network interactions at different times. 
+Such scenarios arise e.g. in control tasks, where a network induces small forces at every time step to reach a certain physical state at $t_e$. It also occurs in correction tasks where a network tries to improve the simulation quality by performing corrections at every time step.
+
+In these scenarios, the process above (Newton step for loss, PG step for physics, GD for the NN) is iteratively repeated, e.g., over the course of different time steps, leading to a series of additive terms in $L$.
+This typically makes the learning task more difficult, as we repeatedly backpropagate through the iterations of the physical solver and the NN, but the PG learning algorithm above extends to these case just like a regular GD training.
+
+## Time Reversal
+
+The inverse function of a simulator is typically the time-reversed physical process.
+In some cases, simply inverting the time axis of the forward simulator, $t \rightarrow -t$, can yield an adequate global inverse simulator.
+%
+Unless the simulator destroys information in practice, e.g., due to accumulated numerical errors or stiff linear systems, this straightforward approach is often a good starting point for an inverse simulation, or to formulate a _local_ inverse simulation.
+
+
+---
+
+## A Learning Toolbox
+
+Taking a step back, what we have here is a flexible "toolbox" for propagating update steps
+through different parts of a system to be optimized. An important takeaway message is that
+the regular gradients we are working with for training NNs are not the best choice when PDEs are 
+involved. In these situations we can get much better information about how to direct the
+optimization than the localized first-order information that regular gradients provide.
+
+Above we've motivated a combination of inverse simulations, Newton steps, and regular gradients.
+In general, it's a good idea to consider separately for each piece that makes up a learning
+task what information we can get out of it for training an NN. The approach explained so far
+gives us a _toolbox_ to concatenate update steps coming from the different sources, and due
+to the very active research in this area we'll surely discover new and improved ways to compute
+these updates.
+
+
+```{figure} resources/placeholder.png
+---
+height: 220px
+name: pg-training
+---
+TODO, visual overview of toolbox  , combinations 
+```
+
+In the next section's we'll show examples of training physics-based NNs 
+with invertible simulations. (These will follow soon, stay tuned.)
--- a/physgrad.md
+++ b/physgrad.md
@ -15,7 +15,7 @@ In the former case, the simulator is only required at training time, while in th

 Below, we'll proceed in the following steps:
 - we'll first show the problems with regular gradient descent, especially for functions that combine small and large scales,
- a central insight will be that an inverse gradient is a lot more meaningful than the regular one,
+- a central insight will be that an _inverse gradient_ is a lot more meaningful than the regular one,
 - finally, we'll show how to use inverse functions (and especially inverse PDE solvers) to compute a very accurate update that includes higher-order terms.

 ```
@ -144,7 +144,7 @@ This behavior stems from the fact that the Hessian of a function composition car

 Consider a function composition $L(z(x))$, with $L$ as above, and an additional function $z(x)$.
 Then the Hessian $\frac{d^2L}{dx^2} = \frac{\partial^2L}{\partial z^2} \left( \frac{\partial z}{\partial x} \right)^2 + \frac{\partial L}{\partial z} \cdot \frac{\partial^2 z}{\partial x^2}$ depends on the square of the inner gradient $\frac{\partial z}{\partial x}$. 
-This means that the Hessian is influenced by the _later_ derivatives of a back-propagation pass, 
+This means that the Hessian is influenced by the _later_ derivatives of a backpropagation pass, 
 and as a consequence, the update of any latent space is unknown during the computation of the gradients.

 % chain of function evaluations: Hessian of an outer function is influenced by inner ones; inversion corrects and yields quantity similar to IG, but nonetheless influenced by "later" derivatives
@ -167,8 +167,10 @@ While this may not seem like a big restriction, note that many common neural net
 %
 Related to this is the problem that higher-order derivatives tend to change more quickly when traversing the parameter space, making them more prone to high-frequency noise in the loss landscape.

-Quasi-Newton methods are a very active research topic, and hence many extensions have been proposed that can alleviate some of these problems in certain settings. E.g., the memory requirement problem can be sidestepped by storing only lower-dimensional vectors that can be used to approximate the Hessian. However, these difficulties illustrate the problems that often arise when applying methods like BFGS.
-
+```{note} 
+_Quasi-Newton Methods_
+are still a very active research topic, and hence many extensions have been proposed that can alleviate some of these problems in certain settings. E.g., the memory requirement problem can be sidestepped by storing only lower-dimensional vectors that can be used to approximate the Hessian. However, these difficulties illustrate the problems that often arise when applying methods like BFGS.
+```

 %\nt{In contrast to these classic algorithms, we will show how to leverage invertible physical models to efficiently compute physical update steps. In certain scenarios, such as simple loss functions, computing the inverse  gradient via the inverse Hessian will also provide a useful building block for our final algorithm.}
 %, and how to they can be used to improve the training of neural networks.
@ -181,7 +183,7 @@ Quasi-Newton methods are a very active research topic, and hence many extensions

 ## Derivation of Physical Gradients

-As a first step towards _physical gradients_, we introduce _inverse_ gradients (IGs), 
+As a first step towards _physical_ gradients, we introduce _inverse_ gradients (IGs), 
 which naturally solve many of the aforementioned problems.

 Instead of $L$ (which is scalar), let's consider the function $z(x)$. We define the update
@ -342,17 +344,3 @@ The physical gradient instead allows us to more accurately backpropagate through

 Before moving on to including PGs in NN training processes, the next example will illustrate the differences between these approaches with a practical example.

-
-
-
-**TODO, sometime, integrate comments below?**
-
-Old Note: 
-The inverse function to a simulator is typically the time-reversed physical process.
-In some cases, simply inverting the time axis of the forward simulator, $t \rightarrow -t$, can yield an adequate global inverse simulator.
-%
-In cases where this straightforward approach does not work, e.g. because the simulator destroys information in practice, one can usually formulate local inverse functions that solve these problems.
-
-??? We then consider settings involving neural networks interacting with the simulation and show how GD can be combined with the PG pipeline for network training.
-???
-