started SIP training part

This commit is contained in:
NT 2022-03-17 13:29:28 +08:00
parent 627bd1f94f
commit e5b3776c3b
3 changed files with 115 additions and 64 deletions

View File

@ -4,9 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Physical Gradients for X\n",
"# Practical Example with SIP Training\n",
"\n",
"placeholder only, insert more complex PG example \n",
"placeholder only, insert more complex SIP example \n",
"\n",
"..."
]

View File

@ -1,10 +1,22 @@
Physical Gradients and NNs
Scale Invariant Physics Training
=======================
The discussion in the previous two sections already hints at physical gradients (PGs) being a powerful tool for optimization. However, we've actually cheated a bit in the previous code example {doc}`physgrad-comparison` and used PGs in a way that will be explained in more detail below.
The discussion in the previous two sections already hints at inversion of gradients being a important step for optimization and learning.
We will now integrate the update step $\Delta x_{\text{PG}}$ into NN training, and give details of the two way process of inverse simulator and Newton step for the loss that was already used in the previous code from {doc}`physgrad-comparison`.
By default, PGs would be restricted to functions with square Jacobians. Hence we wouldn't be able to directly use them in optimizations or learning problems, which typically have scalar objective functions.
In this section, we will first show how PGs can be integrated into the optimization pipeline to optimize scalar objectives.
As hinted at in the IG section of {doc}`physgrad`, we're focusing on NN solutions of _inverse problems_ below. That means we have $y = \mathcal P(x)$, and our goal is to train an NN representation $f$ such that $f(y;\theta)=x$. This is a slightly more constrained setting than what we've considered for the differentiable physics (DP) training. Also, as we're targeting optimization algorithms now, we won't explicitly denote DP approaches: all of the following variants involve physics simulators, and the gradient descent (GD) versions as well as its variants (such as Adam) use DP training.
```{note}
Important to keep in mind:
In contrast to the previous sections and {doc}`overview-equations`, we are targeting inverse problems, and hence $y$ is the input to the network: $f(y;\theta)$. Correspondingly, it outputs $x$, and the ground truth solutions are denoted by $x^*$.
```
%By default, PGs would be restricted to functions with square Jacobians. Hence we wouldn't be able to directly use them in optimizations or learning problems, which typically have scalar objective functions.
%xxx really? just in-out relationships? xxx
<!-- In this section, we will first show how PGs can be integrated into the optimization pipeline to optimize scalar objectives.
## Physical Gradients and loss functions
@ -28,58 +40,76 @@ $$
This equation has turned the step w.r.t. $L$ into a step in $y$ space: $\Delta y$.
However, it does not prescribe a unique way to compute $\Delta y$ since the derivative $\frac{\partial y}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial y}$ puts almost no restrictions on $\Delta y$.
Instead, we use a Newton step (equation {eq}`quasi-newton-update`) to determine $\Delta y$ where $\eta$ controls the step size of the optimization steps.
Instead, we use a Newton step (equation {eq}`quasi-newton-update`) to determine $\Delta y$ where $\eta$ controls the step size of the optimization steps. -->
Here an obvious questions is: Doesn't this leave us with the disadvantage of having to compute the inverse Hessian, as discussed before?
Luckily, unlike with regular Newton or quasi-Newton methods, where the Hessian of the full system is required, here, the Hessian is needed only for $L(y)$. Even better, for many typical $L$ its computation can be completely forgone.
E.g., consider the case $L(y) = \frac 1 2 || y^\textrm{predicted} - y^\textrm{target}||_2^2$ which is the most common supervised objective function.
Here $\frac{\partial L}{\partial y} = y^\textrm{predicted} - y^\textrm{target}$ and $\frac{\partial^2 L}{\partial y^2} = 1$.
Using equation {eq}`quasi-newton-update`, we get $\Delta y = \eta \cdot (y^\textrm{target} - y^\textrm{predicted})$ which can be computed without evaluating the Hessian.
Once $\Delta y$ is determined, the gradient can be backpropagated to earlier time steps using the inverse simulator $\mathcal P^{-1}$. We've already used this combination of a Newton step for the loss and PGs for the PDE in {doc}`physgrad-comparison`.
## NN training
The previous step gives us an update for the input of the discretized PDE $\mathcal P^{-1}(x)$, i.e. a $\Delta x$. If $x$ was an output of an NN, we can then use established DL algorithms to backpropagate the desired change to the weights of the network.
We have a large collection of powerful methodologies for training neural networks at our disposal,
so it is crucial that we can continue using them for training the NN components.
On the other hand, due to the problems of GD for physical simulations (as outlined in {doc}`physgrad`),
we aim for using PGs to accurately optimize through the simulation.
To integrate the update step from equation {eq}`PG-def` into the training process for an NN, we consider three components: the NN itself, the physics simulator, and the loss function.
To join these three pieces together, we use the following algorithm. As introduced by Holl et al. {cite}`holl2021pg`, we'll denote this training process as _scale-invariant physics_ (SIP) training.
Consider the following setup:
A neural network makes a prediction $x = \mathrm{NN}(a \,;\, \theta)$ about a physical state based on some input $a$ and the network weights $\theta$.
The prediction is passed to a physics simulation that computes a later state $y = \mathcal P(x)$, and hence
the objective $L(y)$ depends on the result of the simulation.
% gives us an update for the input of the discretized PDE $\mathcal P^{-1}(x)$, i.e. a $\Delta x$. If $x$ was an output of an NN, we can then use established DL algorithms to backpropagate the desired change to the weights of the network.
% Consider the following setup: A neural network $f()$ makes a prediction $x = f(a \,;\, \theta)$ about a physical state based on some input $a$ and the network weights $\theta$. The prediction is passed to a physics simulation that computes a later state $y = \mathcal P(x)$, and hence the objective $L(y)$ depends on the result of the simulation.
```{admonition} Combined training algorithm
```{admonition} Scale-Invariant Physics (SIP) Training
:class: tip
To train the weights $\theta$ of the NN, we then perform the following updates:
To update the weights $\theta$ of the NN $f$, we perform the following update step:
* Given a set of inputs $y^*$, evaluate the forward pass to compute the NN prediction $x = f(y^*; \theta)$
* Compute $y$ via a forward simulation ($y = \mathcal P(x)$) and invoke the (local) inverse simulator $P^{-1}(y; x)$ to obtain the step $\Delta x_{\text{PG}} = \mathcal P^{-1} (y + \eta \Delta y; x)$ with $\Delta y = y^* - y$
* Evaluate the network loss, e.g., $L = \frac 1 2 || x - \tilde x ||_2^2$ with $\tilde x = x+\Delta x_{\text{PG}}$, and perform a Newton step treating $\tilde x$ as a constant
* Use GD (or a GD-based optimizer like Adam) to propagate the change in $x$ to the network weights $\theta$ with a learning rate $\eta_{\text{NN}}
* Evaluate $\Delta y$ via a Newton step as outlined above
* Compute the PG $\Delta x = \mathcal P^{-1}_{(x, y)}(y + \Delta y) - x$ using an inverse simulator
* Use GD or a GD-based optimizer to compute the updates to the network weights, $\Delta\theta = \eta_\textrm{NN} \cdot \frac{\partial y}{\partial\theta} \cdot \Delta y$
```
The combined optimization algorithm depends on both the **learning rate** $\eta_\textrm{NN}$ for the network as well as the step size $\eta$ from above, which factors into $\Delta y$.
% xxx TODO, make clear, we're solving the inverse problem $f(y; \theta)=x$
% * Compute the scale-invariant update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y + \Delta y; x_0) - x$ using an inverse simulator
This combined optimization algorithm depends on both the learning rate $\eta_\textrm{NN}$ for the network as well as the step size $\eta$ from above, which factors into $\Delta y$.
To first order, the effective learning rate of the network weights is $\eta_\textrm{eff} = \eta \cdot \eta_\textrm{NN}$.
We recommend setting $\eta$ as large as the accuracy of the inverse simulator allows, before choosing $\eta_\textrm{NN} = \eta_\textrm{eff} / \eta$ to achieve the target network learning rate.
We recommend setting $\eta$ as large as the accuracy of the inverse simulator allows. In many cases $\eta=1$ is possible, otherwise $\eta_\textrm{NN}$ should be adjusted accordingly.
This allows for nonlinearities of the simulator to be maximally helpful in adjusting the optimization direction.
This algorithm combines the inverse simulator to compute accurate, higher-order updates with traditional training schemes for NN representations. This is an attractive property, as we have a large collection of powerful methodologies for training NNs that stay relevant in this way. The treatment of the loss functions as "glue" between NN and physics component plays a central role here.
**Note:**
For simple objectives like a loss of the form $L=|y - y^*|^2$, this procedure can be easily integrated into an GD autodiff pipeline by replacing the gradient of the simulator only.
This gives an effective objective function for the network
$$
L_\mathrm{NN} = \frac 1 2 | x - \mathcal P_{(x,y)}^{-1}(y + \Delta y) |^2
$$
## Loss functions
In the above algorithm, we have assumed an $L^2$ loss, and without further explanation introduced a Newton step to propagate the inverse simulator step to the NN. Below, we explain and justify this treatment in more detail.
%Here an obvious questions is: Doesn't this leave us with the disadvantage of having to compute the inverse Hessian, as discussed before?
The central reason for introducing a Newton step is the improved accuracy for the loss derivative.
Unlike with regular Newton or the quasi-Newton methods from equation {eq}`quasi-newton-update`, we do not need the Hessian of the full system.
Instead, the Hessian is only needed for $L(y)$.
This makes Newton's method attractive again.
Even better, for many typical $L$ its computation can be completely forgone.
E.g., consider the most common supervised objective function, $L(y) = \frac 1 2 | y - y^*|_2^2$ as already put to use above. $y$ denotes the predicted, and $y^*$ the target value.
We then have $\frac{\partial L}{\partial y} = y - y^*$ and $\frac{\partial^2 L}{\partial y^2} = 1$.
Using equation {eq}`quasi-newton-update`, we get $\Delta y = \eta \cdot (y^* - y)$ which can be computed without evaluating the Hessian.
Once $\Delta y$ is determined, the gradient can be backpropagated to earlier time steps using the inverse simulator $\mathcal P^{-1}$. We've already used this combination of a Newton step for the loss and an inverse simulator for the PDE in {doc}`physgrad-comparison`.
The loss here acts as a _proxy_ to embed the update from the inverse simulator into the network training pipeline.
It is not to be confused with a traditional supervised loss in $x$ space.
Due to the dependency of $\mathcal P^{-1}$ on the prediction $y$, it does not average multiple modes of solutions in $x$.
To demonstrate this, consider the case that GD is being used as solver for the inverse simulation.
Then the total loss is purely defined in $y$ space, reducing to a regular first-order optimization.
Hence, the proxy loss function simply connects the computational graphs of inverse physics and NN for backpropagation.
***xxx continue ***
where $\mathcal P_{(x,y)}^{-1}(y + \Delta y)$ is treated as a constant.
## Iterations and time dependence
@ -105,6 +135,8 @@ Unless the simulator destroys information in practice, e.g., due to accumulated
## A learning toolbox
***rather discuss similarities with supervised?***
Taking a step back, what we have here is a flexible "toolbox" for propagating update steps
through different parts of a system to be optimized. An important takeaway message is that
the regular gradients we are working with for training NNs are not the best choice when PDEs are
@ -130,3 +162,11 @@ TODO, visual overview of toolbox , combinations
Details of PGs and additional examples can be found in the corresponding paper {cite}`holl2021pg`.
In the next section's we'll show examples of training physics-based NNs
with invertible simulations. (These will follow soon, stay tuned.)
---
## Inverse simulator updates in action
TODO example from SIP ICML paper?

View File

@ -34,21 +34,6 @@ XXX notes, open issues XXX
%- 2 remedies coming up:
% 1) Treating network and simulator as separate systems instead of a single black box, we'll derive different and improved update steps that replaces the gradient of the simulator. As this gradient is closely related to a regular gradient, but computed via physical model equations, we refer to this update (proposed by Holl et al. {cite}`holl2021pg`) as the _physical gradient_ (PG).
% [toolbox, but requires perfect inversion]
% 2) Treating them jointly, -> HIGs
% [analytical, more practical approach]
%```{admonition} Looking ahead
%:class: tip
%Below, we'll proceed in the following steps:
%- we'll first show the problems with regular gradient descent, especially for functions that combine small and large scales,
%- a central insight will be that an _inverse gradient_ is a lot more meaningful than the regular one,
%- finally, we'll show how to use inverse functions (and especially inverse PDE solvers) to compute a very accurate update that includes higher-order terms.
%```
![Divider](resources/divider3.jpg)
@ -227,8 +212,8 @@ As a first step towards fixing the aforementioned issues,
we'll consider what we'll call _inverse_ gradients (IGs).
Unfortunately, they come with their own set of problems, which is why they only represent an intermediate step (we'll revisit them in a more practical form later on).
Instead of $L$ (which is scalar), let's consider a general, potentially non-scalar function $y(x)$.
This will typically be the physical simulator later on, but to keep things general we'll call it $y$ for now.
Instead of $L$ (which is scalar), let's consider optimization problems for a generic, potentially non-scalar function $y(x)$.
This will typically be the physical simulator $\mathcal P$ later on, but to keep things general and readable, we'll call it $y$ for now. This setup implies an inverse problem: for $y = \mathcal P(x)$ we want to find an $x$ given a target $y^*$.
We define the update
$$
@ -300,7 +285,7 @@ which, somewhat surprisingly, is not a minimization problem anymore if we consid
Now, instead of evaluating $\mathcal P^{-1}$ once to obtain the solution, we can iteratively update a current approximation of the solution $x_0$ with an update that we'll call $\Delta x_{\text{PG}}$ when employing the inverse physical simulator.
It also turns out to be a good idea to employ a _local_ inverse that is conditioned on an initial guess for the solution $x$. We'll denote this local inverse with $\mathcal P^{-1}(y^*; x)$. As there are potentially large regions in $x$-space that satisfy reaching $y^*$, we'd like to find the one closest to the current guess. This is important to obtain well behaved solutions in multi-modal settings, where we'd like to avoid the solution manifold to consist of a set of very scattered points.
It also turns out to be a good idea to employ a _local_ inverse that is conditioned on an initial guess for the solution $x$. We'll denote this local inverse with $\mathcal P^{-1}(y^*; x)$. As there are potentially very different $x$-space locations that result in very similar $y^*$, we'd like to find the one closest to the current guess. This is important to obtain well behaved solutions in multi-modal settings, where we'd like to avoid the solution manifold to consist of a set of very scattered points.
Equipped with these changes, we can formulate an optimization problem where a current state of the optimization $x_0$, with $y_0 = \mathcal P(x_0)$, is updated with
@ -309,7 +294,7 @@ $$
$$ (PG-def)
Here the step in $y$-space, $\Delta y$, is either the full distance $y^*-y_0$ or a part of it, in line with the learning rate from above, the the $y$-step used for IGs.
When applying the update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y_0 + \Delta y) - x_0$ it will produce $\mathcal P(x_0 + \Delta x) = y_0 + \Delta y$ exactly, despite $\mathcal P$ being a potentially highly nonlinear function.
When applying the update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y_0 + \Delta y; x_0) - x_0$ it will produce $\mathcal P(x_0 + \Delta x) = y_0 + \Delta y$ exactly, despite $\mathcal P$ being a potentially highly nonlinear function.
When rewriting this update in the typical gradient format, $\frac{\Delta x_{\text{PG}}}{\Delta y}$ replaces the gradient from the IG update above {eq}`IG-def`, and gives $\Delta x$.
This expression yields a first iterative method that makes use of $\mathcal P^{-1}$, and as such leverages all its information, such as higher-order terms.
@ -352,16 +337,10 @@ As long as the physical process does _not destroy_ information, the Jacobian is
In fact, it is believed that information in our universe cannot be destroyed so any physical process could in theory be inverted as long as we have perfect knowledge of the state.
Hence, it's not unreasonable to expect that $\mathcal P^{-1}$ can be formulated in many settings.
Note that equation {eq}`PG-def` is equal to the IG from the section above up to first order, but contains nonlinear terms, i.e.
$ \Delta x_{\text{PG}} / \Delta y = \frac{\partial x}{\partial y} + \mathcal O(\Delta y^2) $.
The accuracy of the update depends on the fidelity of the inverse function $\mathcal P^{-1}$.
We can define an upper limit to the error of the local inverse using the local gradient $\frac{\partial x}{\partial y}$.
In the worst case, we can therefore fall back to the regular gradient.
% We now show that these terms can help produce more stable updates than the IG alone, provided that $\mathcal P_{(x_0,z_0)}^{-1}$ is a sufficiently good approximation of the true inverse.
% Let $\mathcal P^{-1}(z)$ be the true inverse function to $\mathcal P(x)$, assuming that $\mathcal P$ is fully invertible.
**Fundamental theorem of calculus**
### Fundamental theorem of calculus
To more clearly illustrate the advantages in non-linear settings, we
apply the fundamental theorem of calculus to rewrite the ratio $\Delta x_{\text{PG}} / \Delta y$ from above. This gives,
@ -387,7 +366,7 @@ This effectively amounts to _smoothing the objective landscape_ of an optimizati
The equations naturally generalize to higher dimensions by replacing the integral with a path integral along any differentiable path connecting $x_0$ and $x_0 + \Delta x_{\text{PG}}$ and replacing the local gradient by the local gradient in the direction of the path.
**Global and local inverse simulators**
### Global and local inverse simulators
Let $\mathcal P$ be a function with a square Jacobian and $y = \mathcal P(x)$.
A global inverse function $\mathcal P^{-1}$ is defined only for bijective $\mathcal P$.
@ -401,7 +380,7 @@ For the generic $\mathcal P^{-1}$ to exist $\mathcal P$ would need to be globall
By contrast, a _local inverse_ only needs to exist and be accurate in the vicinity of $(x_0, y_0)$.
If a global inverse $\mathcal P^{-1}(y)$ exists, the local inverse approximates it and matches it exactly as $y \rightarrow y_0$.
More formally, $\lim_{y \rightarrow y_0} \frac{\mathcal P^{-1}_{(x_0, y_0)}(y) - P^{-1}(y)}{|y - y_0|} = 0$.
More formally, $\lim_{y \rightarrow y_0} \frac{\mathcal P^{-1}(y; x_0) - P^{-1}(y)}{|y - y_0|} = 0$.
Local inverse functions can exist, even when a global inverse does not.
Non-injective functions can be inverted, for example, by choosing the closest $x$ to $x_0$ such that $\mathcal P(x) = y$.
@ -410,4 +389,36 @@ That is because the inverse Jacobian $\frac{\partial x}{\partial y}$ itself is a
Even when the Jacobian is singular (because the function is not injective, chaotic or noisy), we can usually find good local inverse functions.
### Integrating a loss function
Since introducing IGs, we've only considered a simulator with an output $y$. Now we can re-introduce the loss function $L$.
As before, we consider minimization problems with a scalar objective function $L(y)$ that depends on the result of an invertible simulator $y = \mathcal P(x)$.
%In {doc}`physgrad`
In {eq}`` we've introduced the inverse gradient (IG) update, which gives $\Delta x = \frac{\partial x}{\partial L} \cdot \Delta L$ when the loss function is included.
Here, $\Delta L$ denotes a step to take in terms of the loss.
By applying the chain rule and substituting the IG $\frac{\partial x}{\partial L}$ for the update from the inverse physics simulator from equation {eq}`PG-def`, we obtain, up to first order:
$$
\begin{aligned}
\Delta x_{\text{PG}}
&= \frac{\partial x}{\partial L} \cdot \Delta L
\\
&= \frac{\partial x}{\partial y} \left( \frac{\partial y}{\partial L} \cdot \Delta L \right)
\\
&= \frac{\partial x}{\partial y} \cdot \Delta y
\\
&= \mathcal P^{-1}(y_0 + \Delta y; x_0) - x_0 + \mathcal O(\Delta y^2)
\end{aligned}
$$
These equations show that equation {eq}`PG-def` is equal to the IG from the section above up to first order, but contains nonlinear terms, i.e.
$ \Delta x_{\text{PG}} / \Delta y = \frac{\partial x}{\partial y} + \mathcal O(\Delta y^2) $.
The accuracy of the update depends on the fidelity of the inverse function $\mathcal P^{-1}$.
We can define an upper limit to the error of the local inverse using the local gradient $\frac{\partial x}{\partial y}$.
In the worst case, we can therefore fall back to the regular gradient.
Also, we have turned the step w.r.t. $L$ into a step in $y$ space: $\Delta y$.
However, this does not prescribe a unique way to compute $\Delta y$ since the derivative $\frac{\partial y}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial y}$ puts almost no restrictions on $\Delta y$.
Instead, we use a Newton step (equation {eq}`quasi-newton-update`) to determine $\Delta y$ where $\eta$ controls the step size of the optimization steps. We will explain this in more detail in connection with the introduction of NNs in the next section.