update physgrad intro
This commit is contained in:
parent
e6a02ccb29
commit
ac4faf08bb
@ -23,7 +23,7 @@ This gives a minimization problem to find $f(x;\theta)$ such that $e$ is minimiz
|
||||
In the simplest case, we can use an $L^2$ error, giving
|
||||
|
||||
$$
|
||||
\text{arg min}_{\theta} | f(x;\theta) - y^* |_2^2
|
||||
\text{arg min}_{\theta} | f(x;\theta) - y^* |_2^2 .
|
||||
$$ (learn-l2)
|
||||
|
||||
We typically optimize, i.e. _train_,
|
||||
|
147
physgrad.md
147
physgrad.md
@ -22,25 +22,24 @@ Finally, we'll explain several alternatives to prevent these problems. It turns
|
||||
Below, we'll proceed in the following steps:
|
||||
- Show how scaling issues and multi-modality can negatively affect NN training.
|
||||
- Spoiler: What was missing in our training runs with GD or Adam so far is a proper _inversion_ of the Jacobian matrix.
|
||||
- We'll explain two alternatives to prevent these problems: an anlytical full, and a numerical half-inversion.
|
||||
- We'll explain two alternatives to prevent these problems: an analytical full-, and a numerical half-inversion.
|
||||
|
||||
```
|
||||
|
||||
|
||||
XXX notes, open issues XXX
|
||||
- GD - is "diff. phys." , rename? add supervised before?
|
||||
- comparison notebook: add legends to plot
|
||||
- double check func-comp w QNewton, "later" derivatives of backprop means what?
|
||||
|
||||
|
||||
|
||||
%- 2 remedies coming up:
|
||||
% 1) Treating network and simulator as separate systems instead of a single black box, we'll derive different and improved update steps that replaces the gradient of the simulator. As this gradient is closely related to a regular gradient, but computed via physical model equations, we refer to this update (proposed by Holl et al. {cite}`holl2021pg`) as the _physical gradient_ (PG).
|
||||
% [toolbox, but requires perfect inversion]
|
||||
% 2) Treating them jointly, -> HIGs
|
||||
% [analytical, more practical approach]
|
||||
|
||||
|
||||
XXX PG physgrad chapter notes from dec 23 XXX
|
||||
- GD - is "diff. phys." , rename? add supervised before?
|
||||
- comparison notebook: add legends to plot
|
||||
- summary "tightest possible" bad -> rather, illustrates what ideal direction can do
|
||||
- double check func-comp w QNewton, "later" derivatives of backprop means what?
|
||||
- remove IGs?
|
||||
|
||||
|
||||
%```{admonition} Looking ahead
|
||||
%:class: tip
|
||||
%Below, we'll proceed in the following steps:
|
||||
@ -57,7 +56,11 @@ XXX PG physgrad chapter notes from dec 23 XXX
|
||||
|
||||
As before, let $L(x)$ be a scalar loss function, subject to minimization. The goal is to compute a step in terms of the input parameters $x$ , denoted by $\Delta x$. The different versions of $\Delta x$ will be denoted by a subscript.
|
||||
|
||||
All NNs of the previous chapters were trained with gradient descent (GD) via backpropagation. GD with backprop was also employed for the PDE solver (_simulator_) $\mathcal P$, with an evaluation chain $L(\mathcal P(x))$. As a central quantity, this gives the composite gradient
|
||||
All NNs of the previous chapters were trained with gradient descent (GD) via backpropagation. GD with backprop was also employed for the PDE solver (_simulator_) $\mathcal P$.
|
||||
% , with an evaluation chain $L(\mathcal P(x))$.
|
||||
When we simplify the setting, and leave out the NN for a moment, this gives the minimization problem
|
||||
$\text{arg min}_{x} L(x)$ with $L(x) = \frac 1 2 \| \mathcal P(x) - y^* \|_2^2$.
|
||||
As a central quantity, we have the composite gradient
|
||||
$(\partial L / \partial x)^T$ of the loss function $L$:
|
||||
|
||||
$$
|
||||
@ -280,52 +283,80 @@ Thus, we now consider the fact that inverse gradients are linearizations of inve
|
||||
|
||||
---
|
||||
|
||||
## Inverse simulators
|
||||
|
||||
### Inverse simulators
|
||||
So far we've discussed the problems of existing methods, and a common theme among the methods that do better, Newton and IGs, is that the regular gradient is not sufficient. We somehow need to address it's problems with some form of _inversion_. Before going into details of NN training and numerical methods to perform this inversion, we will consider one additional "special" case that will further illustrate the need for inversion: if we can make use of an _inverse simulator_, this likewise addresses many of the inherent issues of GD. It actually represents the ideal setting for computing update steps for the physics simulation part.
|
||||
|
||||
Let $y = \mathcal P(x)$ be a forward simulation, and $\mathcal P(y)^{-1}=x$ denote its inverse.
|
||||
In contrast to the inversion of Jacobian or Hessian matrices from before, $\mathcal P(^{-1}$ denotes a full inverse of all functions of $\mathcal P$.
|
||||
Employing the inverse solver in the minimization problem above yields
|
||||
|
||||
$$
|
||||
\text{arg min}_{x} \frac 1 2 \| \mathcal P^{-1}(y^*) \|_2^2 ,
|
||||
$$ (pg-inverse-problem)
|
||||
|
||||
which, somewhat surprisingly, is not a minimization problem anymore if we consider single cases with one $x,y^*$ pair. We basically just need to solve the inverse problem by evaluating $\mathcal P^{-1}(y^*)$ to obtain $x$. As we plan to bring back NNs and more complex scenarios soon, let's assume that we are still dealing with a collection of $y^*$ targets, and non-obvious solutions $x$. One example could be that we're looking for an $x$ that yields multiple $y^*$ targets with minimal distortions in terms of $L^2$.
|
||||
|
||||
Now, instead of evaluating $\mathcal P^{-1}$ once to obtain the solution, we can iteratively update a current approximation of the solution $x_0$ with an update that we'll call $\Delta x_{\text{PG}}$ when employing the inverse physical simulator.
|
||||
|
||||
It also turns out to be a good idea to employ a _local_ inverse that is conditioned on an initial guess for the solution $x$. We'll denote this local inverse with $\mathcal P^{-1}(y^*; x)$. As there are potentially large regions in $x$-space that satisfy reaching $y^*$, we'd like to find the one closest to the current guess. This is important to obtain well behaved solutions in multi-modal settings, where we'd like to avoid the solution manifold to consist of a set of very scattered points.
|
||||
|
||||
Equipped with these changes, we can formulate an optimization problem where a current state of the optimization $x_0$, with $y_0 = \mathcal P(x_0)$, is updated with
|
||||
|
||||
$$
|
||||
\frac{\Delta x_{\text{PG}} }{\Delta y} \equiv \big( \mathcal P^{-1} (y_0 + \Delta y; x_0) - x_0 \big) .
|
||||
$$ (PG-def)
|
||||
|
||||
Here the step in $y$-space, $\Delta y$, is either the full distance $y^*-y_0$ or a part of it, in line with the learning rate from above, the the $y$-step used for IGs.
|
||||
When applying the update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y_0 + \Delta y) - x_0$ it will produce $\mathcal P(x_0 + \Delta x) = y_0 + \Delta y$ exactly, despite $\mathcal P$ being a potentially highly nonlinear function.
|
||||
When rewriting this update in the typical gradient format, $\frac{\Delta x_{\text{PG}}}{\Delta y}$ replaces the gradient from the IG update above {eq}`IG-def`, and gives $\Delta x$.
|
||||
|
||||
This expression yields a first iterative method that makes use of $\mathcal P^{-1}$, and as such leverages all its information, such as higher-order terms.
|
||||
|
||||
## Summary
|
||||
|
||||
The update obtained with a regular gradient descent method has surprising shortcomings.
|
||||
Classical, inversion-based methods like IGs and Newton's method remove some of these shortcomings,
|
||||
with the somewhat theoretical construct of the update from inverse simulators ($\Delta x_{\text{PG}}$)
|
||||
including the most higher-order terms.
|
||||
As such, it is interesting to consider as an "ideal" setting for improved (inverted) update steps.
|
||||
|
||||
In contrast to the second- and first-order approximations from Newton's method and IGs, it can potentially take highly nonlinear effects into account. This comes at the cost of requiring an expression and discretization for a local inverse solver, but the main goal of the following sections is to illustrate how much we can gain from including all the higher-order information. Note that all three methods successfully include a rescaling of the search direction via inversion, in contrast to the previously discussed GD training. All of these methods represent different forms of differentiable physics, though.
|
||||
|
||||
Before moving on to including improved updates in NN training processes, we will discuss some additional theoretical aspects,
|
||||
and then illustrate the differences between these approaches with a practical example.
|
||||
|
||||
|
||||
|
||||
```{note}
|
||||
The following sections will provide an in-depth look ("deep-dive"), into
|
||||
optimizations with inverse solvers. If you're interested in practical examples
|
||||
and connections to NNs, feel free to skip ahead to {doc}`physgrad-code` or
|
||||
{doc}`physgrad-nn`, respectively.
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
## Deep Dive into Inverse simulators
|
||||
|
||||
We'll now derive and discuss the $\Delta x_{\text{PG}}$ update in more detail.
|
||||
Physical processes can be described as a trajectory in state space where each point represents one possible configuration of the system.
|
||||
A simulator typically takes one such state space vector and computes a new one at another time.
|
||||
The Jacobian of the simulator is, therefore, necessarily square.
|
||||
%
|
||||
As long as the physical process does _not destroy_ information, the Jacobian is non-singular.
|
||||
In fact, it is believed that information in our universe cannot be destroyed so any physical process could in theory be inverted as long as we have perfect knowledge of the state.
|
||||
Hence, it's not unreasonable to expect that $\mathcal P^{-1}$ can be formulated in many settings.
|
||||
|
||||
XXX
|
||||
|
||||
??? While evaluating the IGs directly can be done through matrix inversion or taking the derivative of an inverse simulator, ???
|
||||
|
||||
We now consider a somewhat theoretical construct: what can we do if we have access to an
|
||||
|
||||
XXX
|
||||
|
||||
... what happens if we use the inverse simulator directly in backpropagation.
|
||||
Let $y = \mathcal P(x)$ be a forward simulation, and $\mathcal P(y)^{-1}=x$ its inverse (we assume it exists for now, but below we'll relax that assumption).
|
||||
Equipped with the inverse we now define an update that we'll call the **physical gradient** (PG) {cite}`holl2021pg` in the following as
|
||||
|
||||
$$
|
||||
\frac{\Delta x_{\text{PG}} }{\Delta y} \equiv \big( \mathcal P^{-1} (y_0 + \Delta y) - x_0 \big)
|
||||
$$ (PG-def)
|
||||
|
||||
|
||||
% Original: \begin{equation} \label{eq:pg-def} \frac{\Delta x}{\Delta z} \equiv \mathcal P_{(x_0,z_0)}^{-1} (z_0 + \Delta z) - x_0 = \frac{\partial x}{\partial z} + \mathcal O(\Delta z^2)
|
||||
|
||||
% add ? $ / \Delta z $ on the right!? the above only gives $\Delta x$, see below
|
||||
|
||||
Note that this PG is equal to the IG from the section above up to first order, but contains nonlinear terms, i.e.
|
||||
Note that equation {eq}`PG-def` is equal to the IG from the section above up to first order, but contains nonlinear terms, i.e.
|
||||
$ \Delta x_{\text{PG}} / \Delta y = \frac{\partial x}{\partial y} + \mathcal O(\Delta y^2) $.
|
||||
%
|
||||
The accuracy of the update also depends on the fidelity of the inverse function $\mathcal P^{-1}$.
|
||||
The accuracy of the update depends on the fidelity of the inverse function $\mathcal P^{-1}$.
|
||||
We can define an upper limit to the error of the local inverse using the local gradient $\frac{\partial x}{\partial y}$.
|
||||
In the worst case, we can therefore fall back to the regular gradient.
|
||||
|
||||
% We now show that these terms can help produce more stable updates than the IG alone, provided that $\mathcal P_{(x_0,z_0)}^{-1}$ is a sufficiently good approximation of the true inverse.
|
||||
% Let $\mathcal P^{-1}(z)$ be the true inverse function to $\mathcal P(x)$, assuming that $\mathcal P$ is fully invertible.
|
||||
|
||||
The intuition for why the PG update is a good one is that when
|
||||
applying the update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y_0 + \Delta y) - x_0$ it will produce $\mathcal P(x_0 + \Delta x) = y_0 + \Delta y$ exactly, despite $\mathcal P$ being a potentially highly nonlinear function.
|
||||
When rewriting this update in the typical gradient format, $\frac{\Delta x_{\text{PG}}}{\Delta y}$ replaces the gradient from the IG update above, and gives $\Delta x$.
|
||||
|
||||
|
||||
**Fundamental theorem of calculus**
|
||||
|
||||
To more clearly illustrate the advantages in non-linear settings, we
|
||||
@ -352,45 +383,27 @@ This effectively amounts to _smoothing the objective landscape_ of an optimizati
|
||||
The equations naturally generalize to higher dimensions by replacing the integral with a path integral along any differentiable path connecting $x_0$ and $x_0 + \Delta x_{\text{PG}}$ and replacing the local gradient by the local gradient in the direction of the path.
|
||||
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
### Global and local inverse functions
|
||||
**Global and local inverse simulators**
|
||||
|
||||
Let $\mathcal P$ be a function with a square Jacobian and $y = \mathcal P(x)$.
|
||||
A global inverse function $\mathcal P^{-1}$ is defined only for bijective $\mathcal P$.
|
||||
If the inverse exists, it can find $x$ for any $y$ such that $y = \mathcal P(x)$.
|
||||
|
||||
Instead of using this "perfect" inverse $\mathcal P^{-1}$ directly, we'll in practice often use a local inverse
|
||||
$\mathcal P_{(x_0,y_0)}^{-1}(y)$, defined at the point $(x_0, y_0)$. This local inverse can be
|
||||
easier to obtain, as it only needs to exist near a given $y_0$, and not for all $y$.
|
||||
For $\mathcal P^{-1}$ to exist $\mathcal P$ would need to be globally invertible.
|
||||
$\mathcal P_{(x_0,y_0)}^{-1}(y; x_0)$, which is conditioned for the point $x_0$, and correspondingly on
|
||||
$y_0=\mathcal P(x_0)$.
|
||||
This local inverse is easier to obtain, as it only needs to exist near a given $y_0$, and not for all $y$.
|
||||
For the generic $\mathcal P^{-1}$ to exist $\mathcal P$ would need to be globally invertible.
|
||||
|
||||
By contrast, a _local inverse_, defined at point $(x_0, y_0)$, only needs to be accurate in the vicinity of that point.
|
||||
By contrast, a _local inverse_ only needs to exist and be accurate in the vicinity of $(x_0, y_0)$.
|
||||
If a global inverse $\mathcal P^{-1}(y)$ exists, the local inverse approximates it and matches it exactly as $y \rightarrow y_0$.
|
||||
More formally, $\lim_{y \rightarrow y_0} \frac{\mathcal P^{-1}_{(x_0, y_0)}(y) - P^{-1}(y)}{|y - y_0|} = 0$.
|
||||
Local inverse functions can exist, even when a global inverse does not.
|
||||
Non-injective functions can be inverted, for example, by choosing the closest $x$ to $x_0$ such that $\mathcal P(x) = y$.
|
||||
|
||||
With the local inverse, the PG is defined as
|
||||
|
||||
$$
|
||||
\frac{\Delta x_{\text{PG}}}{\Delta y} \equiv \big( \mathcal P_{(x_0,y_0)}^{-1} (y_0 + \Delta y) - x_0 \big) / \Delta y
|
||||
$$ (local-PG-def)
|
||||
|
||||
For differentiable functions, a local inverse is guaranteed to exist by the inverse function theorem as long as the Jacobian is non-singular.
|
||||
That is because the inverse Jacobian $\frac{\partial x}{\partial y}$ itself is a local inverse function, albeit not the most accurate one.
|
||||
That is because the inverse Jacobian $\frac{\partial x}{\partial y}$ itself is a local inverse function, albeit, with being first-order, not the most accurate one.
|
||||
Even when the Jacobian is singular (because the function is not injective, chaotic or noisy), we can usually find good local inverse functions.
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
The update obtained with a regular gradient descent method has surprising shortcomings.
|
||||
The physical gradient instead allows us to more accurately backpropagate through nonlinear functions, provided that we have access to good inverse functions.
|
||||
|
||||
Before moving on to including PGs in NN training processes, the next example will illustrate the differences between these approaches with a practical example.
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user