first version of HIGs
This commit is contained in:
parent
5ae6a742ec
commit
e7665da0c9
10
_toc.yml
10
_toc.yml
@ -1,7 +1,7 @@
|
||||
format: jb-book
|
||||
root: intro.md
|
||||
parts:
|
||||
- caption: Introduction
|
||||
- caption: Introduction bla
|
||||
chapters:
|
||||
- file: intro-teaser.ipynb
|
||||
- file: overview.md
|
||||
@ -31,6 +31,14 @@ parts:
|
||||
- file: diffphys-code-sol.ipynb
|
||||
- file: diffphys-control.ipynb
|
||||
- file: diffphys-outlook.md
|
||||
- caption: Improved Gradients
|
||||
chapters:
|
||||
- file: physgrad.md
|
||||
- file: physgrad-comparison.ipynb
|
||||
- file: physgrad-nn.md
|
||||
- file: physgrad-hig.md
|
||||
- file: physgrad-hig-code.ipynb
|
||||
- file: physgrad-discuss.md
|
||||
- caption: Reinforcement Learning
|
||||
chapters:
|
||||
- file: reinflearn-intro.md
|
||||
|
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Optimizations with Physical Gradients\n",
|
||||
"# Simple Example with Physical Gradients\n",
|
||||
"\n",
|
||||
"The previous section has made many comments about the advantages and disadvantages of different optimization methods. Below we'll show with a practical example how much differences these properties actually make.\n",
|
||||
"\n",
|
||||
|
657
physgrad-hig-code.ipynb
Normal file
657
physgrad-hig-code.ipynb
Normal file
File diff suppressed because one or more lines are too long
88
physgrad-hig.md
Normal file
88
physgrad-hig.md
Normal file
@ -0,0 +1,88 @@
|
||||
|
||||
Half-Inverse Gradients
|
||||
=======================
|
||||
|
||||
The physical gradients (PGs) illustrated the importance of _inverting_ the direction of the update step (in addition to making use of higher order terms). We'll now turn to an alternative for achieving the inversion, the _Half-Inverse Gradients_ (HIGs) {cite}`schnell2022hig`. They comes with its own set of pros and cons, and thus provide an interesting alternative for computing improved update steps for physics-based deep learning tasks.
|
||||
|
||||
More specifically, unlike the PGs, it does not require an analytical inverse solver and it jointly inverts the neural network part as well as the physical model. As a drawback, it requires an SVD for a large Jacobian matrix.
|
||||
|
||||
|
||||
```{admonition} Preview: HIGs versus PGs
|
||||
:class: tip
|
||||
|
||||
More specifically, unlike the PGs the HIGs
|
||||
- do not require an analytical inverse solver
|
||||
- they also jointly invert the neural network part as well as the physical model.
|
||||
|
||||
As a drawback, HIGs
|
||||
- require an SVD for a large Jacobian matrix
|
||||
- They are based on first-order information, like regular gradients.
|
||||
|
||||
In contrast to regular gradients, they use the full Jacobian matrix, though. So as we'll see below, they typically outperform regular SGD and Adam significantly.
|
||||
|
||||
```
|
||||
|
||||
## Derivation
|
||||
|
||||
As mentioned during the derivation of PGs in {eq}`quasi-newton-update`, the update for regular Newton steps
|
||||
uses the inverse Hessian matrix. If we rewrite its update for the network weights $\theta$, and neglect the mixed derivative terms, we arrive at the _Gauss-Newton_ method:
|
||||
|
||||
% \Delta \theta_{GN} = -\eta {\partial x}.
|
||||
$$
|
||||
\Delta \theta_{\mathrm{GN}}
|
||||
= - \eta \Bigg( \bigg(\frac{\partial z}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial z}{\partial \theta}\bigg) \Bigg)^{-1} \cdot
|
||||
\bigg(\frac{\partial z}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial L}{\partial z}\bigg)^{\top} .
|
||||
$$ (gauss-newton-update-full)
|
||||
|
||||
For a full-rank Jacobian $\partial z / \partial \theta$, the transposed Jacobian cancels out, and the equation simplifies to
|
||||
|
||||
$$
|
||||
\Delta \theta_{\mathrm{GN}}
|
||||
= - \eta \bigg(\frac{\partial z}{\partial \theta}\bigg) ^{-1} \cdot
|
||||
\bigg(\frac{\partial L}{\partial z}\bigg)^{\top} .
|
||||
$$ (gauss-newton-update)
|
||||
|
||||
This looks much simpler, but still leaves us with a Jacobian matrix to invert. This Jacobian is typically non-square, and has small eigenvalues, which is why even equipped with a pseudo-inverse Gauss-Newton methods are not used for practical deep learning problems.
|
||||
|
||||
HIGs alleviate these difficulties by employing a partial inversion of the form
|
||||
|
||||
$$
|
||||
\Delta \theta_{\mathrm{HIG}} = - \eta \cdot \bigg(\frac{\partial y}{\partial \theta}\bigg)^{-1/2} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{\top} ,
|
||||
$$ (hig-update)
|
||||
|
||||
where the square-root for $^{-1/2}$ is computed via an SVD, and denotes the half-inverse. I.e., for a matrix $A$,
|
||||
we compute its half-inverse via a singular value decomposition as $A^{-1/2} = V \Lambda^{-1/2} U^\top$, where $\Lambda$ contains the singular values.
|
||||
During this step we can also take care of numerical noise in the form of small singular values. All entries
|
||||
of $\Lambda$ smaller than a threshold $\tau$ are set to zero.
|
||||
|
||||
```{note} Truncation
|
||||
|
||||
It might seem attractive at first to clamp singular values to a small value $\tau$, instead of discarding them by setting them to zero. However, the singular vectors corresponding to these small singular values are exactly the ones which are potentially unreliable. A small $\tau$ yields a large contribution during the inversion, and thus these singular vectors would cause problems when clamping. Hence, it's a much better idea to discard their content by setting their singular values to zero.
|
||||
|
||||
```
|
||||
|
||||
explain batch stacking
|
||||
|
||||
$$
|
||||
\frac{\partial y}{\partial \theta} := \left(
|
||||
\begin{array}{c}
|
||||
\frac{\partial y_1}{\partial \theta}\big\vert_{x_1}\\
|
||||
\frac{\partial y_2}{\partial \theta}\big\vert_{x_2}\\
|
||||
\vdots\\
|
||||
\frac{\partial y_b}{\partial \theta}\big\vert_{x_b}\\
|
||||
\end{array}
|
||||
\right)
|
||||
$$
|
||||
|
||||
%
|
||||
|
||||
background? motivated by Adam, ...
|
||||
|
||||
%We've kept the $\eta$ in here for consistency, but in practice $\eta=1$ is used for Gauss-Newton
|
||||
|
||||
|
||||
%PGs higher order, custom inverse , chain PDE & NN together
|
||||
|
||||
%HIG more generic, numerical inversion , joint physics & NN
|
||||
|
||||
|
50
physgrad.md
50
physgrad.md
@ -3,15 +3,45 @@ Physical Gradients
|
||||
|
||||
**Note, this chapter is very preliminary - probably not for the first version of the book. move after RL, before BNNs?**
|
||||
|
||||
The next chapter will questions some fundamental aspects of the formulations so far -- namely the gradients -- and aim for an even tighter integration of physics and learning.
|
||||
The approaches explained previously all integrate physical models into deep learning algorithms.
|
||||
Either as a physics-informed (PI) loss function or via differentiable physics (DP) operators embedded into the network.
|
||||
In the PI case, the simulator is only required at training time, while for DP approaches, it also employed at inference time, it actually enables an end-to-end training of NNs and numerical solvers. Both employ first order derivatives to drive optimizations and learning processes, and we haven't questioned at all whether this is the best choice so far.
|
||||
The next chapter will question some fundamental aspects of the formulations so far, namely the update step computed via gradients.
|
||||
To re-cap, the approaches explained in the previous chapters either dealt with pure data, integrated the physical model as a physical loss term or included it via differentiable physics (DP) operators embedded into the network.
|
||||
Supervised training with physical data is straight-forward.
|
||||
The latter two methods share similarities, but in the loss term case, the evaluations are only required at training time. For DP approaches, the solver itself is also employed at inference time, which enables an end-to-end training of NNs and numerical solvers. All three approaches employ _first-order_ derivatives to drive optimizations and learning processes, the latter two also using them for the physical model terms.
|
||||
This is a natural choice from a deep learning perspective, but we haven't questioned at all whether this is actually a good choice.
|
||||
|
||||
Not too surprising after this introduction: A central insight of the following chapter will be that regular gradients are often a _sub-optimal choice_ for learning problems involving physical quantities.
|
||||
It turns out that both supervised and DP gradients have their pros and cons. In the following, we'll analyze this in more detail. In particular, we'll illustrate how the multi-modal problems (as hinted at in {doc}`intro-teaser`) negatively influence NNs. Then we'll show how scaling problems of DP gradients affect NN training. Finally, we'll explain several alternatives to prevent these problems. It turns out that a key property that is missing in regular gradients is a proper _inversion_ of the Jacobian matrix.
|
||||
|
||||
|
||||
```{admonition} A preview of this chapter
|
||||
:class: tip
|
||||
|
||||
Below, we'll proceed in the following steps:
|
||||
- We'll illustrate how the multi-modal problems (as hinted at in {doc}`intro-teaser`) negatively influence NNs
|
||||
- Then we'll show how scaling problems of DP gradients affect NN training.
|
||||
- Finally we'll explain several alternatives to prevent these problems.
|
||||
- It turns out that a key property that is missing in regular gradients is a proper _inversion_ of the Jacobian matrix.
|
||||
|
||||
```
|
||||
|
||||
%- 2 remedies coming up:
|
||||
% 1) Treating network and simulator as separate systems instead of a single black box, we'll derive different and improved update steps that replaces the gradient of the simulator. As this gradient is closely related to a regular gradient, but computed via physical model equations, we refer to this update (proposed by Holl et al. {cite}`holl2021pg`) as the {\em physical gradient} (PG).
|
||||
% [toolbox, but requires perfect inversion]
|
||||
% 2) Treating them jointly, -> HIGs
|
||||
% [analytical, more practical approach]
|
||||
|
||||
|
||||
|
||||
XXX PG physgrad chapter notes from dec 23 XXX
|
||||
- recap formulation P(x)=z , L() ... etc. rename z to y?
|
||||
- intro after dL/dx bad, Newton? discussion is repetitive
|
||||
[older commment - more intro to quasi newton?]
|
||||
- GD - is "diff. phys." , rename? add supervised before?
|
||||
comparison:
|
||||
- why z, rename to y?
|
||||
- add legends to plot
|
||||
- summary "tighest possible" bad -> rather, illustrates what ideal direction can do
|
||||
|
||||
A central insight the following chapter is that regular gradients are often a _sub-optimal choice_ for learning problems involving physical quantities.
|
||||
Treating network and simulator as separate systems instead of a single black box, we'll derive a different update step that replaces the gradient of the simulator.
|
||||
As this gradient is closely related to a regular gradient, but computed via physical model equations,
|
||||
we refer to this update (proposed by Holl et al. {cite}`holl2021pg`) as the {\em physical gradient} (PG).
|
||||
|
||||
```{admonition} Looking ahead
|
||||
:class: tip
|
||||
@ -117,7 +147,7 @@ stabilize the training. On the other hand, it also makes the learning process di
|
||||
Quasi-Newton methods, such as BFGS and its variants, evaluate the gradient $\frac{\partial L}{\partial x}$ and Hessian $\frac{\partial^2 L}{\partial x^2}$ to solve a system of linear equations. The resulting update can be written as
|
||||
|
||||
$$
|
||||
\Delta x = \eta \cdot \left( \frac{\partial^2 L}{\partial x^2} \right)^{-1} \frac{\partial L}{\partial x}.
|
||||
\Delta x = -\eta \cdot \left( \frac{\partial^2 L}{\partial x^2} \right)^{-1} \frac{\partial L}{\partial x}.
|
||||
$$ (quasi-newton-update)
|
||||
|
||||
where $\eta$, the scalar step size, takes the place of GD's learning rate and is typically determined via a line search.
|
||||
@ -252,7 +282,7 @@ In fact, it is believed that information in our universe cannot be destroyed so
|
||||
|
||||
While evaluating the IGs directly can be done through matrix inversion or taking the derivative of an inverse simulator, we now consider what happens if we use the inverse simulator directly in backpropagation.
|
||||
Let $z = \mathcal P(x)$ be a forward simulation, and $\mathcal P(z)^{-1}=x$ its inverse (we assume it exists for now, but below we'll relax that assumption).
|
||||
Equipped with the inverse we now define an update that we'll call the **physical gradient** (PG) in the following as
|
||||
Equipped with the inverse we now define an update that we'll call the **physical gradient** (PG) {cite}`holl2021pg` in the following as
|
||||
|
||||
% Original: \begin{equation} \label{eq:pg-def} \frac{\Delta x}{\Delta z} \equiv \mathcal P_{(x_0,z_0)}^{-1} (z_0 + \Delta z) - x_0 = \frac{\partial x}{\partial z} + \mathcal O(\Delta z^2)
|
||||
|
||||
|
@ -13,6 +13,14 @@
|
||||
@STRING{NeurIPS = "Advances in Neural Information Processing Systems"}
|
||||
|
||||
|
||||
@inproceedings{schnell2022hig,
|
||||
title={Half-Inverse Gradients for Physical Deep Learning},
|
||||
author={Schnell, Patrick and Holl, Philipp and Thuerey, Nils},
|
||||
booktitle={arXiv:21xx.yyyyy},
|
||||
year={2021},
|
||||
url={https://ge.in.tum.de/publications/},
|
||||
}
|
||||
|
||||
@inproceedings{chen2021highacc,
|
||||
title={Towards high-accuracy deep learning inference of compressible turbulent flows over aerofoils},
|
||||
author={Chen, Li-Wei and Thuerey, Nils},
|
||||
@ -103,7 +111,7 @@
|
||||
}
|
||||
|
||||
@inproceedings{holl2021pg,
|
||||
title={Physical Gradients},
|
||||
title={Physical Gradients for Deep Learning},
|
||||
author={Holl, Philipp and Koltun, Vladlen and Thuerey, Nils},
|
||||
booktitle={arXiv:2109.15048},
|
||||
year={2021},
|
||||
|
Loading…
Reference in New Issue
Block a user