pbdl-book/physgrad-nn.md

Scale Invariant Physics Training
=======================

The discussion in the previous two sections already hints at inversion of gradients being a important step for optimization and learning. 
We will now integrate the update step $\Delta x_{\text{PG}}$ into NN training, and give details of the two way process of inverse simulator and Newton step for the loss that was already used in the previous code from {doc}`physgrad-comparison`. 

As hinted at in the IG section of {doc}`physgrad`, we're focusing on NN solutions of _inverse problems_ below. That means we have $y = \mathcal P(x)$, and our goal is to train an NN representation $f$ such that $f(y;\theta)=x$. This is a slightly more constrained setting than what we've considered for the differentiable physics (DP) training. Also, as we're targeting optimization algorithms now, we won't explicitly denote DP approaches: all of the following variants involve physics simulators, and the gradient descent (GD) versions as well as its variants (such as Adam) use DP training.

```{note}
Important to keep in mind:
In contrast to the previous sections and {doc}`overview-equations`, we are targeting inverse problems, and hence $y$ is the input to the network: $f(y;\theta)$. Correspondingly, it outputs $x$, and the ground truth solutions are denoted by $x^*$.
```


%By default, PGs would be restricted to functions with square Jacobians. Hence we wouldn't be able to directly use them in optimizations or learning problems, which typically have scalar objective functions.
%xxx really? just in-out relationships? xxx


<!-- In this section, we will first show how PGs can be integrated into the optimization pipeline to optimize scalar objectives.

## Physical Gradients and loss functions

As before, we consider a scalar objective function $L(y)$ that depends on the result of an invertible simulator $y = \mathcal P(x)$. In {doc}`physgrad` we've outlined the inverse gradient (IG) update $\Delta x = \frac{\partial x}{\partial L} \cdot \Delta L$, where $\Delta L$ denotes a step to take in terms of the loss. 

By applying the chain rule and substituting the IG $\frac{\partial x}{\partial L}$ for the PG, we obtain 

$$
\begin{aligned}
    \Delta x
    &= \frac{\partial x}{\partial L} \cdot \Delta L
    \\
    &= \frac{\partial x}{\partial y} \left( \frac{\partial y}{\partial L} \cdot \Delta L \right)
    \\
    &= \frac{\partial x}{\partial y} \cdot \Delta y
    \\
    &= \mathcal P^{-1}_{(x_0,y_0)}(y_0 + \Delta y) - x_0 + \mathcal O(\Delta y^2)
    .
\end{aligned}
$$

This equation has turned the step w.r.t. $L$ into a step in $y$ space: $\Delta y$. 
However, it does not prescribe a unique way to compute $\Delta y$ since the derivative $\frac{\partial y}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial y}$ puts almost no restrictions on $\Delta y$.
Instead, we use a Newton step (equation {eq}`quasi-newton-update`) to determine $\Delta y$ where $\eta$ controls the step size of the optimization steps. -->


## NN training 

To integrate the update step from equation {eq}`PG-def` into the training process for an NN, we consider three components: the NN itself, the physics simulator, and the loss function. 
To join these three pieces together, we use the following algorithm. As introduced by Holl et al. {cite}`holl2021pg`, we'll denote this training process as _scale-invariant physics_ (SIP) training.

% gives us an update for the input of the discretized PDE $\mathcal P^{-1}(x)$, i.e. a $\Delta x$. If $x$ was an output of an NN, we can then use established DL algorithms to backpropagate the desired change to the weights of the network.

% Consider the following setup:  A neural network $f()$ makes a prediction $x = f(a \,;\, \theta)$ about a physical state based on some input $a$ and the network weights $\theta$. The prediction is passed to a physics simulation that computes a later state $y = \mathcal P(x)$, and hence the objective $L(y)$ depends on the result of the simulation. 


```{admonition} Scale-Invariant Physics (SIP) Training
:class: tip

To update the weights $\theta$ of the NN $f$, we perform the following update step:

* Given a set of inputs $y^*$, evaluate the forward pass to compute the NN prediction $x = f(y^*; \theta)$
* Compute $y$ via a forward simulation ($y = \mathcal P(x)$) and invoke the (local) inverse simulator $P^{-1}(y; x)$ to obtain the step $\Delta x_{\text{PG}} = \mathcal P^{-1} (y + \eta \Delta y; x)$ with $\Delta y = y^* - y$
* Evaluate the network loss, e.g., $L = \frac 1 2 || x - \tilde x ||_2^2$ with $\tilde x = x+\Delta x_{\text{PG}}$, and perform a Newton step treating $\tilde x$ as a constant 
* Use GD (or a GD-based optimizer like Adam) to propagate the change in $x$ to the network weights $\theta$ with a learning rate $\eta_{\text{NN}}


```

% xxx TODO, make clear, we're solving the inverse problem $f(y; \theta)=x$

% * Compute the scale-invariant update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y + \Delta y; x_0) - x$ using an inverse simulator


This combined optimization algorithm depends on both the learning rate $\eta_\textrm{NN}$ for the network as well as the step size $\eta$ from above, which factors into $\Delta y$.
To first order, the effective learning rate of the network weights is $\eta_\textrm{eff} = \eta \cdot \eta_\textrm{NN}$.
We recommend setting $\eta$ as large as the accuracy of the inverse simulator allows. In many cases $\eta=1$ is possible, otherwise $\eta_\textrm{NN}$ should be adjusted accordingly.
This allows for nonlinearities of the simulator to be maximally helpful in adjusting the optimization direction.

This algorithm combines the inverse simulator to compute accurate, higher-order updates with traditional training schemes for NN representations. This is an attractive property, as we have a large collection of powerful methodologies for training NNs that stay relevant in this way. The treatment of the loss functions as "glue" between NN and physics component plays a central role here. 


## Loss functions

In the above algorithm, we have assumed an $L^2$ loss, and without further explanation introduced a Newton step to propagate the inverse simulator step to the NN. Below, we explain and justify this treatment in more detail.

%Here an obvious questions is: Doesn't this leave us with the disadvantage of having to compute the inverse Hessian, as discussed before?

The central reason for introducing a Newton step is the improved accuracy for the loss derivative.
Unlike with regular Newton or the quasi-Newton methods from equation {eq}`quasi-newton-update`, we do not need the Hessian of the full system. 
Instead, the Hessian is only needed for $L(y)$. 
This makes Newton's method attractive again.
Even better, for many typical $L$ its computation can be completely forgone.

E.g., consider the most common supervised objective function, $L(y) = \frac 1 2 | y - y^*|_2^2$ as already put to use above. $y$ denotes the predicted, and $y^*$ the target value.
We then have $\frac{\partial L}{\partial y} = y - y^*$ and $\frac{\partial^2 L}{\partial y^2} = 1$.
Using equation {eq}`quasi-newton-update`, we get $\Delta y = \eta \cdot (y^* - y)$ which can be computed without evaluating the Hessian.

Once $\Delta y$ is determined, the gradient can be backpropagated to earlier time steps using the inverse simulator $\mathcal P^{-1}$. We've already used this combination of a Newton step for the loss and an inverse simulator for the PDE in {doc}`physgrad-comparison`.

The loss here acts as a _proxy_ to embed the update from the inverse simulator into the network training pipeline. 
It is not to be confused with a traditional supervised loss in $x$ space.
Due to the dependency of $\mathcal P^{-1}$ on the prediction $y$, it does not average multiple modes of solutions in $x$.
To demonstrate this, consider the case that GD is being used as solver for the inverse simulation.
Then the total loss is purely defined in $y$ space, reducing to a regular first-order optimization. 
Hence, the proxy loss function simply connects the computational graphs of inverse physics and NN for backpropagation.


***xxx continue ***


## Iterations and time dependence

The above procedure describes the optimization of neural networks that make a single prediction.
This is suitable for scenarios to reconstruct the state of a system at $t_0$ given the state at a $t_e > t_0$ or to estimate an optimal initial state to match certain conditions at $t_e$.

However, our method can also be applied to more complex setups involving multiple objectives at different times and multiple network interactions at different times. 
Such scenarios arise e.g. in control tasks, where a network induces small forces at every time step to reach a certain physical state at $t_e$. It also occurs in correction tasks where a network tries to improve the simulation quality by performing corrections at every time step.

In these scenarios, the process above (Newton step for loss, PG step for physics, GD for the NN) is iteratively repeated, e.g., over the course of different time steps, leading to a series of additive terms in $L$.
This typically makes the learning task more difficult, as we repeatedly backpropagate through the iterations of the physical solver and the NN, but the PG learning algorithm above extends to these case just like a regular GD training.

## Time reversal

The inverse function of a simulator is typically the time-reversed physical process.
In some cases, simply inverting the time axis of the forward simulator, $t \rightarrow -t$, can yield an adequate global inverse simulator.
%
Unless the simulator destroys information in practice, e.g., due to accumulated numerical errors or stiff linear systems, this straightforward approach is often a good starting point for an inverse simulation, or to formulate a _local_ inverse simulation.


---

## A learning toolbox

***rather discuss similarities with supervised?***

Taking a step back, what we have here is a flexible "toolbox" for propagating update steps
through different parts of a system to be optimized. An important takeaway message is that
the regular gradients we are working with for training NNs are not the best choice when PDEs are 
involved. In these situations we can get much better information about how to direct the
optimization than the localized first-order information that regular gradients provide.

Above we've motivated a combination of inverse simulations, Newton steps, and regular gradients.
In general, it's a good idea to consider separately for each piece that makes up a learning
task what information we can get out of it for training an NN. The approach explained so far
gives us a _toolbox_ to concatenate update steps coming from the different sources, and due
to the very active research in this area we'll surely discover new and improved ways to compute
these updates.


```{figure} resources/placeholder.png
---
height: 220px
name: pg-toolbox
---
TODO, visual overview of toolbox  , combinations 
```

Details of PGs and additional examples can be found in the corresponding paper {cite}`holl2021pg`.
In the next section's we'll show examples of training physics-based NNs 
with invertible simulations. (These will follow soon, stay tuned.)


---

## Inverse simulator updates in action

TODO example from SIP ICML paper?
started SIP training part 2022-03-17 06:29:28 +01:00			`Scale Invariant Physics Training`
physgrad update 2021-03-20 05:45:31 +01:00			`=======================`

started SIP training part 2022-03-17 06:29:28 +01:00			`The discussion in the previous two sections already hints at inversion of gradients being a important step for optimization and learning.`
			We will now integrate the update step $\Delta x_{\text{PG}}$ into NN training, and give details of the two way process of inverse simulator and Newton step for the loss that was already used in the previous code from {doc}`physgrad-comparison`.
physgrad update 2021-03-20 05:45:31 +01:00
started SIP training part 2022-03-17 06:29:28 +01:00			As hinted at in the IG section of {doc}`physgrad`, we're focusing on NN solutions of _inverse problems_ below. That means we have $y = \mathcal P(x)$, and our goal is to train an NN representation $f$ such that $f(y;\theta)=x$. This is a slightly more constrained setting than what we've considered for the differentiable physics (DP) training. Also, as we're targeting optimization algorithms now, we won't explicitly denote DP approaches: all of the following variants involve physics simulators, and the gradient descent (GD) versions as well as its variants (such as Adam) use DP training.

			```{note}
			`Important to keep in mind:`
			In contrast to the previous sections and {doc}`overview-equations`, we are targeting inverse problems, and hence $y$ is the input to the network: $f(y;\theta)$. Correspondingly, it outputs $x$, and the ground truth solutions are denoted by $x^*$.
			```


			`%By default, PGs would be restricted to functions with square Jacobians. Hence we wouldn't be able to directly use them in optimizations or learning problems, which typically have scalar objective functions.`
			`%xxx really? just in-out relationships? xxx`


			`<!-- In this section, we will first show how PGs can be integrated into the optimization pipeline to optimize scalar objectives.`
physgrad update 2021-03-20 05:45:31 +01:00
unified caps of headings 2021-04-12 03:19:00 +02:00			`## Physical Gradients and loss functions`
physgrad update 2021-03-20 05:45:31 +01:00
renamed z to y, sensitivity section updates 2022-01-21 16:19:01 +01:00			As before, we consider a scalar objective function $L(y)$ that depends on the result of an invertible simulator $y = \mathcal P(x)$. In {doc}`physgrad` we've outlined the inverse gradient (IG) update $\Delta x = \frac{\partial x}{\partial L} \cdot \Delta L$, where $\Delta L$ denotes a step to take in terms of the loss.
physgrad update 2021-03-20 05:45:31 +01:00
PG text update 2021-03-26 03:12:16 +01:00			`By applying the chain rule and substituting the IG $\frac{\partial x}{\partial L}$ for the PG, we obtain`
physgrad update 2021-03-20 05:45:31 +01:00
PG text update 2021-03-26 03:12:16 +01:00			`$$`
			`\begin{aligned}`
physgrad update 2021-03-20 05:45:31 +01:00			`\Delta x`
			`&= \frac{\partial x}{\partial L} \cdot \Delta L`
			`\\`
renamed z to y, sensitivity section updates 2022-01-21 16:19:01 +01:00			`&= \frac{\partial x}{\partial y} \left( \frac{\partial y}{\partial L} \cdot \Delta L \right)`
physgrad update 2021-03-20 05:45:31 +01:00			`\\`
renamed z to y, sensitivity section updates 2022-01-21 16:19:01 +01:00			`&= \frac{\partial x}{\partial y} \cdot \Delta y`
physgrad update 2021-03-20 05:45:31 +01:00			`\\`
renamed z to y, sensitivity section updates 2022-01-21 16:19:01 +01:00			`&= \mathcal P^{-1}_{(x_0,y_0)}(y_0 + \Delta y) - x_0 + \mathcal O(\Delta y^2)`
physgrad update 2021-03-20 05:45:31 +01:00			`.`
			`\end{aligned}`
PG text update 2021-03-26 03:12:16 +01:00			`$$`
physgrad update 2021-03-20 05:45:31 +01:00
renamed z to y, sensitivity section updates 2022-01-21 16:19:01 +01:00			`This equation has turned the step w.r.t. $L$ into a step in $y$ space: $\Delta y$.`
			`However, it does not prescribe a unique way to compute $\Delta y$ since the derivative $\frac{\partial y}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial y}$ puts almost no restrictions on $\Delta y$.`
started SIP training part 2022-03-17 06:29:28 +01:00			Instead, we use a Newton step (equation {eq}`quasi-newton-update`) to determine $\Delta y$ where $\eta$ controls the step size of the optimization steps. -->
physgrad update 2021-03-20 05:45:31 +01:00
PG text update 2021-03-26 03:12:16 +01:00
physgrad update 2021-03-20 05:45:31 +01:00
PG text update 2021-03-26 03:12:16 +01:00

unified caps of headings 2021-04-12 03:19:00 +02:00			`## NN training`
PG text update 2021-03-26 03:12:16 +01:00
started SIP training part 2022-03-17 06:29:28 +01:00			To integrate the update step from equation {eq}`PG-def` into the training process for an NN, we consider three components: the NN itself, the physics simulator, and the loss function.
			To join these three pieces together, we use the following algorithm. As introduced by Holl et al. {cite}`holl2021pg`, we'll denote this training process as _scale-invariant physics_ (SIP) training.

			`% gives us an update for the input of the discretized PDE $\mathcal P^{-1}(x)$, i.e. a $\Delta x$. If $x$ was an output of an NN, we can then use established DL algorithms to backpropagate the desired change to the weights of the network.`
PG text update 2021-03-26 03:12:16 +01:00
started SIP training part 2022-03-17 06:29:28 +01:00			`% Consider the following setup: A neural network $f()$ makes a prediction $x = f(a \,;\, \theta)$ about a physical state based on some input $a$ and the network weights $\theta$. The prediction is passed to a physics simulation that computes a later state $y = \mathcal P(x)$, and hence the objective $L(y)$ depends on the result of the simulation.`
PG text update 2021-03-26 03:12:16 +01:00

started SIP training part 2022-03-17 06:29:28 +01:00			```{admonition} Scale-Invariant Physics (SIP) Training
PG text update 2021-03-26 03:12:16 +01:00			`:class: tip`

started SIP training part 2022-03-17 06:29:28 +01:00			`To update the weights $\theta$ of the NN $f$, we perform the following update step:`

			`* Given a set of inputs $y^$, evaluate the forward pass to compute the NN prediction $x = f(y^; \theta)$`
			`* Compute $y$ via a forward simulation ($y = \mathcal P(x)$) and invoke the (local) inverse simulator $P^{-1}(y; x)$ to obtain the step $\Delta x_{\text{PG}} = \mathcal P^{-1} (y + \eta \Delta y; x)$ with $\Delta y = y^* - y$`
			`* Evaluate the network loss, e.g., $L = \frac 1 2 \|\| x - \tilde x \|\|_2^2$ with $\tilde x = x+\Delta x_{\text{PG}}$, and perform a Newton step treating $\tilde x$ as a constant`
			`* Use GD (or a GD-based optimizer like Adam) to propagate the change in $x$ to the network weights $\theta$ with a learning rate $\eta_{\text{NN}}`
PG text update 2021-03-26 03:12:16 +01:00

			```

started SIP training part 2022-03-17 06:29:28 +01:00			`% xxx TODO, make clear, we're solving the inverse problem $f(y; \theta)=x$`

			`% * Compute the scale-invariant update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y + \Delta y; x_0) - x$ using an inverse simulator`


			`This combined optimization algorithm depends on both the learning rate $\eta_\textrm{NN}$ for the network as well as the step size $\eta$ from above, which factors into $\Delta y$.`
PG text update 2021-03-26 03:12:16 +01:00			`To first order, the effective learning rate of the network weights is $\eta_\textrm{eff} = \eta \cdot \eta_\textrm{NN}$.`
started SIP training part 2022-03-17 06:29:28 +01:00			`We recommend setting $\eta$ as large as the accuracy of the inverse simulator allows. In many cases $\eta=1$ is possible, otherwise $\eta_\textrm{NN}$ should be adjusted accordingly.`
PG text update 2021-03-26 03:12:16 +01:00			`This allows for nonlinearities of the simulator to be maximally helpful in adjusting the optimization direction.`

started SIP training part 2022-03-17 06:29:28 +01:00			`This algorithm combines the inverse simulator to compute accurate, higher-order updates with traditional training schemes for NN representations. This is an attractive property, as we have a large collection of powerful methodologies for training NNs that stay relevant in this way. The treatment of the loss functions as "glue" between NN and physics component plays a central role here.`
PG text update 2021-03-26 03:12:16 +01:00

started SIP training part 2022-03-17 06:29:28 +01:00			`## Loss functions`

			`In the above algorithm, we have assumed an $L^2$ loss, and without further explanation introduced a Newton step to propagate the inverse simulator step to the NN. Below, we explain and justify this treatment in more detail.`

			`%Here an obvious questions is: Doesn't this leave us with the disadvantage of having to compute the inverse Hessian, as discussed before?`

			`The central reason for introducing a Newton step is the improved accuracy for the loss derivative.`
			Unlike with regular Newton or the quasi-Newton methods from equation {eq}`quasi-newton-update`, we do not need the Hessian of the full system.
			`Instead, the Hessian is only needed for $L(y)$.`
			`This makes Newton's method attractive again.`
			`Even better, for many typical $L$ its computation can be completely forgone.`

			`E.g., consider the most common supervised objective function, $L(y) = \frac 1 2 \| y - y^\|_2^2$ as already put to use above. $y$ denotes the predicted, and $y^$ the target value.`
			`We then have $\frac{\partial L}{\partial y} = y - y^*$ and $\frac{\partial^2 L}{\partial y^2} = 1$.`
			Using equation {eq}`quasi-newton-update`, we get $\Delta y = \eta \cdot (y^* - y)$ which can be computed without evaluating the Hessian.

			Once $\Delta y$ is determined, the gradient can be backpropagated to earlier time steps using the inverse simulator $\mathcal P^{-1}$. We've already used this combination of a Newton step for the loss and an inverse simulator for the PDE in {doc}`physgrad-comparison`.

			`The loss here acts as a _proxy_ to embed the update from the inverse simulator into the network training pipeline.`
			`It is not to be confused with a traditional supervised loss in $x$ space.`
			`Due to the dependency of $\mathcal P^{-1}$ on the prediction $y$, it does not average multiple modes of solutions in $x$.`
			`To demonstrate this, consider the case that GD is being used as solver for the inverse simulation.`
			`Then the total loss is purely defined in $y$ space, reducing to a regular first-order optimization.`
			`Hence, the proxy loss function simply connects the computational graphs of inverse physics and NN for backpropagation.`


			`*xxx continue *`
PG text update 2021-03-26 03:12:16 +01:00


unified caps of headings 2021-04-12 03:19:00 +02:00			`## Iterations and time dependence`
PG text update 2021-03-26 03:12:16 +01:00
			`The above procedure describes the optimization of neural networks that make a single prediction.`
			`This is suitable for scenarios to reconstruct the state of a system at $t_0$ given the state at a $t_e > t_0$ or to estimate an optimal initial state to match certain conditions at $t_e$.`

			`However, our method can also be applied to more complex setups involving multiple objectives at different times and multiple network interactions at different times.`
			`Such scenarios arise e.g. in control tasks, where a network induces small forces at every time step to reach a certain physical state at $t_e$. It also occurs in correction tasks where a network tries to improve the simulation quality by performing corrections at every time step.`

			`In these scenarios, the process above (Newton step for loss, PG step for physics, GD for the NN) is iteratively repeated, e.g., over the course of different time steps, leading to a series of additive terms in $L$.`
			`This typically makes the learning task more difficult, as we repeatedly backpropagate through the iterations of the physical solver and the NN, but the PG learning algorithm above extends to these case just like a regular GD training.`

unified caps of headings 2021-04-12 03:19:00 +02:00			`## Time reversal`
PG text update 2021-03-26 03:12:16 +01:00
			`The inverse function of a simulator is typically the time-reversed physical process.`
			`In some cases, simply inverting the time axis of the forward simulator, $t \rightarrow -t$, can yield an adequate global inverse simulator.`
			`%`
			`Unless the simulator destroys information in practice, e.g., due to accumulated numerical errors or stiff linear systems, this straightforward approach is often a good starting point for an inverse simulation, or to formulate a _local_ inverse simulation.`


			`---`

unified caps of headings 2021-04-12 03:19:00 +02:00			`## A learning toolbox`
PG text update 2021-03-26 03:12:16 +01:00
started SIP training part 2022-03-17 06:29:28 +01:00			`*rather discuss similarities with supervised?*`

PG text update 2021-03-26 03:12:16 +01:00			`Taking a step back, what we have here is a flexible "toolbox" for propagating update steps`
			`through different parts of a system to be optimized. An important takeaway message is that`
			`the regular gradients we are working with for training NNs are not the best choice when PDEs are`
			`involved. In these situations we can get much better information about how to direct the`
			`optimization than the localized first-order information that regular gradients provide.`

			`Above we've motivated a combination of inverse simulations, Newton steps, and regular gradients.`
			`In general, it's a good idea to consider separately for each piece that makes up a learning`
			`task what information we can get out of it for training an NN. The approach explained so far`
			`gives us a _toolbox_ to concatenate update steps coming from the different sources, and due`
			`to the very active research in this area we'll surely discover new and improved ways to compute`
			`these updates.`

physgrad update 2021-03-20 05:45:31 +01:00
PG text update 2021-03-26 03:12:16 +01:00			```{figure} resources/placeholder.png
			`---`
			`height: 220px`
toc updates 2021-03-31 09:00:09 +02:00			`name: pg-toolbox`
PG text update 2021-03-26 03:12:16 +01:00			`---`
			`TODO, visual overview of toolbox , combinations`
			```
physgrad update 2021-03-20 05:45:31 +01:00
update PG chapter, fixing typos 2021-06-27 16:49:32 +02:00			Details of PGs and additional examples can be found in the corresponding paper {cite}`holl2021pg`.
PG text update 2021-03-26 03:12:16 +01:00			`In the next section's we'll show examples of training physics-based NNs`
			`with invertible simulations. (These will follow soon, stay tuned.)`
started SIP training part 2022-03-17 06:29:28 +01:00

			`---`

			`## Inverse simulator updates in action`

			`TODO example from SIP ICML paper?`