updated discussion

2022-04-18 13:30:24 +02:00 · 2022-04-18 13:30:24 +02:00 · 4e4016f705
commit 4e4016f705
parent aa4f67eec7
2 changed files with 29 additions and 75 deletions
--- a/physgrad-discuss.md
+++ b/physgrad-discuss.md
@ -1,19 +1,19 @@
 Discussion
 =======================

-At this point it's a good time to take another step back, and assess the different methods of the previous chapters. For deep learning applications, we can broadly distinguish three approaches: the _regular_ differentiable physics (DP) training, the training with half-inverse gradients (HIGs), and using the physical gradients (PGs). Unfortunately, we can't simply discard two of them, and focus on a single approach for all future endeavours. However, discussing the pros and cons sheds light on some fundamental aspects of physics-based deep learning, so here we go...
+At this point it's a good time to take another step back, and assess the different methods of the previous chapters. For deep learning applications, we can broadly distinguish three approaches: the _regular_ differentiable physics (DP) training, the training with half-inverse gradients (HIGs), and using the scale-invariant physics updates (SIPs). Unfortunately, we can't simply discard two of them, and focus on a single approach for all future endeavours. However, discussing the pros and cons sheds light on some fundamental aspects of physics-based deep learning.

 ![Divider](resources/divider7.jpg)

 ## Addressing scaling issues

-First and foremost, a central motivation for improved updates is the need to address the scaling issues of the learning problems. This is not a completely new problem: numerous deep learning algorithms were proposed to address these for training NNs. However, the combination of NNs with physical simulations brings new challenges that provide new angles to tackle this problem. On the negative side, we have additional, highly non-linear operators from the PDE models. On the positive side, these operators typically do not have free parameters during learning, and thus can be treated with different, tailored methods.
+First and foremost, a central motivation for improved updates is the need to address the scaling issues of the learning problems. This is not a completely new problem: numerous deep learning algorithms were proposed to address these for training NNs. However, the combination of NNs with physical simulations brings new challenges that at the same time provide new angles to tackle this problem. On the negative side, we have additional, highly non-linear operators from the PDE models. On the positive side, these operators typically do not have free parameters during learning, and thus can be treated with different, tailored methods.

-This is exactly where HIGs and PGs come in: instead of treating the physical simulation like the rest of the NNs (this is the DP approach), they show how much can be achieved with custom inverse solvers (PGs) or a custom numerical inversion (HIGs).
+This is exactly where HIGs and SIPs come in: instead of treating the physical simulation like the rest of the NNs (this is the DP approach), they show how much can be achieved with custom inverse solvers (SIPs) or a custom numerical inversion (HIGs).

 ## Computational Resources

-Both cases usually lead to more complicated and resource intensive training. However, assuming that we can re-use a trained model many times after the training has been completed, there are many areas of applications where this can quickly pay off: the trained NNs, despite being identical in runtime to those obtained from other training methods, often achieve significantly improved accuracies. Achieving similar levels of accuracy with regular Adam and DP-based training can be infeasible. 
+Both cases usually lead to more complicated and resource intensive training. However, assuming that we can re-use a trained model many times after the training has been completed, there are many areas of application where this can quickly pay off: the trained NNs, despite being identical in runtime to those obtained from other training methods, often achieve significantly improved accuracies. Achieving similar levels of accuracy with regular Adam and DP-based training can be completely infeasible. 

 When such a trained NN is used, e.g., as a surrogate model for an inverse problem, it might be executed a large number of times, and the improved accuracy can save correspondingly large amounts of computational resources in such a follow up stage. 
 A good potential example are shape optimizations for the drag reduction of bodies immersed in a fluid {cite}`chen2021numerical`.
@ -21,71 +21,42 @@ A good potential example are shape optimizations for the drag reduction of bodie



-
-## A learning toolbox
-
-***re-integrate?***
-
-Taking a step back, what we have here is a flexible "toolbox" for propagating update steps
-through different parts of a system to be optimized. An important takeaway message is that
-the regular gradients we are working with for training NNs are not the best choice when PDEs are 
-involved. In these situations we can get much better information about how to direct the
-optimization than the localized first-order information that regular gradients provide.
-
-Above we've motivated a combination of inverse simulations, Newton steps, and regular gradients.
-In general, it's a good idea to consider separately for each piece that makes up a learning
-task what information we can get out of it for training an NN. The approach explained so far
-gives us a _toolbox_ to concatenate update steps coming from the different sources, and due
-to the very active research in this area we'll surely discover new and improved ways to compute
-these updates.
-
-***re-integrate?***
-
-
-
 ![Divider](resources/divider1.jpg)


 ## Summary 

-To summarize, the physical gradients showed the importance of the inversion. Even when it is only done for the physics simulation component, it can substantially improve the learning process. When we can employ a custom inverse solver, we can often do even better. These methods employed higher-order information.
+To summarize, this chapter demonstrated the importance of the inversion. 
+An important takeaway message is that
+the regular gradients from NN training are not the best choice when PDEs are 
+involved. In these situations we can get much better information about how to direct the
+optimization than the localized first-order information that regular gradients provide.

-✅ Pro PG: 
+Even when the inversion is only done for the physics simulation component (as with SIPs), it can substantially improve the learning process. The custom inverse solvers allow us to employ higher-order information in the training.
+
+✅ Pro SIP: 
 - Very accurate "gradient" information for physical simulations.
 - Often strongly improved convergence and model performance.

-❌ Con PG: 
- Requires inverse simulators (at least local ones).
+❌ Con SIP: 
+- Require inverse simulators (at least local ones).
 - Less wide-spread availability than, e.g., differentiable physics simulators.

 ---

-The HIGs on the other hand, go back to first order information in the form of Jacobians. They showed how useful the inversion can be even without any higher order terms. At the same time, they make use of a combined inversion of NN and physics, taking into account all samples of a mini-batch.
+The HIGs on the other hand, go back to first order information in the form of Jacobians. They show how useful the inversion can be even without any higher order terms. At the same time, they make use of a combined inversion of NN and physics, taking into account all samples of a mini-batch to compute an optimal first-order direction.

 ✅ Pro HIG: 
 - Robustly addresses scaling issues, jointly for physical models and NN.
 - Improved convergence and model performance.

 ❌ Con HIG: 
- Requires an SVD for potentially large Jacobian matrix.
- This can also lead to significant memory requirements.
+- Requires an SVD for a potentially large Jacobian matrix.
+- This can be costly in terms of runtime and memory.

 ---

-In both cases, the resulting models can give a performance that we simply can't obtain by, e.g., training longer with a simpler DP or supervised approach. So, if we plan to evaluate these models often, e.g., shipping them in an application, this increased one-time cost can pay off in the long run.
-
-
-xxx TODO, connect to uncert. chapter xxx
-
-
-% DP basic, generic, 
-% PGs higher order, custom inverse , chain PDE & NN together
-% HIG more generic, numerical inversion , joint physics & NN
-
-%In a way, the learning via physical gradients provide the tightest possible coupling of physics and NNs: the full non-linear process of the PDE model directly steers the optimization of the NN.
-
-%PG old: Naturally, this comes at a cost - invertible simulators are more difficult to build (and less common) than the first-order gradients from deep learning and adjoint optimizations. Nonetheless, if they're available, invertible simulators can speed up convergence, and yield models that have an inherently better performance. 
-
+However, in both cases, the resulting models can give a performance that we simply can't obtain by, e.g., training longer with a simpler DP or supervised approach. So, if we plan to evaluate these models often, e.g., shipping them in an application, this increased one-time cost can pay off in the long run.



--- a/physgrad-hig.md
+++ b/physgrad-hig.md
@ -2,17 +2,17 @@
 Half-Inverse Gradients
 =======================

-The physical gradients (PGs) of the previous chapters illustrated the importance of _inverting_ the direction of the update step (in addition to making use of higher order terms). We'll now turn to an alternative for achieving the inversion, the so-called _Half-Inverse Gradients_ (HIGs) {cite}`schnell2022hig`. They come with their own set of pros and cons, and thus provide an interesting alternative for computing improved update steps for physics-based deep learning tasks.
+The scale-invariant physics updates (SIPs) of the previous chapters illustrated the importance of _inverting_ the direction of the update step (in addition to making use of higher order terms). We'll now turn to an alternative for achieving the inversion, the so-called _Half-Inverse Gradients_ (HIGs) {cite}`schnell2022hig`. They come with their own set of pros and cons, and thus provide an interesting alternative for computing improved update steps for physics-based deep learning tasks.

-Unlike the PGs, they do not require an analytical inverse solver. The HIGs jointly invert the neural network part as well as the physical model. As a drawback, they require an SVD for a large Jacobian matrix. 
+Unlike the SIPs, they do not require an analytical inverse solver. The HIGs jointly invert the neural network part as well as the physical model. As a drawback, they require an SVD for a large Jacobian matrix. 


-```{admonition} Preview: HIGs versus PGs (and versus Adam)
+```{admonition} Preview: HIGs versus SIPs (and versus Adam)
 :class: tip

-More specifically, in contrast to PGs the HIGs:
- do not require an analytical inverse solver,
- and jointly invert the neural network part as well as the physical model. 
+More specifically the HIGs:
+- do not require an analytical inverse solver (in contrast to SIPs),
+- and they jointly invert the neural network part as well as the physical model. 

 As a drawback, HIGs:
 - require an SVD for a large Jacobian matrix,
@ -24,8 +24,7 @@ Howver, in contrast to regular gradients, they use the full Jacobian matrix. So

 ## Derivation

-As mentioned during the derivation of PGs in {eq}`quasi-newton-update`, the update for regular Newton steps 
-uses the inverse Hessian matrix. If we rewrite its update for the case of an $L^2$ loss, we arrive at the _Gauss-Newton_ (GN) method:
+As mentioned during the derivation of inverse simulator updates in {eq}`quasi-newton-update`, the update for regular Newton steps uses the inverse Hessian matrix. If we rewrite its update for the case of an $L^2$ loss, we arrive at the _Gauss-Newton_ (GN) method:

 $$
     \Delta \theta_{\mathrm{GN}}
@ -41,9 +40,9 @@ $$
        \bigg(\frac{\partial L}{\partial z}\bigg)^{\top} .
 $$ (gauss-newton-update)

-This looks much simpler, but still leaves us with a Jacobian matrix to invert. This Jacobian is typically non-square, and has small eigenvalues, which is why even we could make use of a pseudo-inverse, Gauss-Newton methods are not used for practical deep learning problems. 
+This looks much simpler, but still leaves us with a Jacobian matrix to invert. This Jacobian is typically non-square, and has small singular values which cause problems during inversion. Naively applying methods like Gauss-Newton can quickly explode. However, as we're dealing with cases where we have a physics solver in the training loop, the small singular values are often relevant for the physics. Hence, we don't want to just discard these parts of the learning signal, but rather preserve as many of them as possible.

-HIGs alleviate these difficulties by employing a truncated, partial inversion of the form
+This motivates the HIG update, which employs a partial and truncated inversion of the form

 $$
    \Delta \theta_{\mathrm{HIG}} = - \eta \cdot  \bigg(\frac{\partial y}{\partial \theta}\bigg)^{-1/2} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{\top} , 
@ -126,7 +125,7 @@ In addition, the neuron activations, which are shown in terms of mean and standa
 
 Finally, the third graph on the right shows the evolution in terms of a single input-output pair. The starting point from the initial network state is shown in light gray, while the ground truth target $\hat{y}$ is shown as a black dot. Most importantly, all three methods reach the black dot in the end. For this simple example, it's not overly impressive to see this. However, it's still interesting that both GN and HIG exhibit large jumps in the initial stages of the learning process (the first few segments leaving the gray dot). This is caused by the fairly bad initial state, and the inversion, which leads to significant changes of the NN state and its outputs. In contrast, the momentum terms of Adam reduce this jumpiness: the initial jumps in the light blue line are smaller than those of the other two.

-Overall, the behavior of all three methods is largely in line with what we'd expect: while the loss surely could go down more, and some of the steps in $y$ seem to momentarily do in the wrong direction, all three methods cope quite well with this case. Not surprisingly, this picture will change when making things harder with a more ill-conditioned Jacobian resulting from a small $\lambda$
+Overall, the behavior of all three methods is largely in line with what we'd expect: while the loss surely could go down more, and some of the steps in $y$ seem to momentarily do in the wrong direction, all three methods cope quite well with this case. Not surprisingly, this picture will change when making things harder with a more ill-conditioned Jacobian resulting from a small $\lambda$.

 ## Ill-conditioned

@ -154,22 +153,6 @@ The third graph on the right side of figure {numref}`hig-toy-example-bad` shows

 Note that for all examples so far, we've improved upon the _differentiable physics_ (DP) training from the previous chapters. I.e., we've focused on combinations of neural networks and PDE solving operators. The latter need to be differentiable for training with regular SGD, as well as for HIG-based training. 

-In contrast, for training with physical gradients (from {doc}`physgrad`), we even needed to provide a full inverse solver. As shown there, this has advantages, but differentiates PGs from DP and HIGs. Thus, the HIGs share more similarities with, e.g., {doc}`diffphys-code-sol` and  {doc}`diffphys-code-control`, than with the example {doc}`physgrad-code`.
+In contrast, for training with SIPs (from {doc}`physgrad-nn`), we even needed to provide a full inverse solver. As shown there, this has advantages, but differentiates SIPs from DP and HIGs. Thus, the HIGs share more similarities with, e.g., {doc}`diffphys-code-sol` and  {doc}`diffphys-code-control`, than with the example {doc}`physgrad-code`.

 This is a good time to give a specific code example of how to train physical NNs with HIGs: we'll look at a classic case, a system of coupled oscillators.
-
-
-## xxx TODO , merge into HIG example code later on xxx
-
-As example problem for the Half-Inverse Gradients (HIGs) we'll consider controlling a system of coupled oscillators. This is a classical problem in physics, and a good case to evaluate the HIGs due to it's smaller size. We're using two mass points, and thus we'll only have four degrees of freedom for position and velocity of both points (compared to, e.g., the $32\times32\times2$ unknowns we'd get even for "only" a small fluid simulation with 32 cells along x and y). Nonetheless, the oscillators are a highly-non trivial case: we aim for applying a control such that the initial state is reached again after a chosen time interval. Here we'll 96 steps of a fourth-order Runge-Kutta scheme, and hence the NN has to learn how to best "nudge" the two mass points over the course of all time steps, so that they end up at the desired position with the right velocity at the right time.
-
-A system of $N$ coupled oscillators is described by ...hamiltonian, TODO, replace by PDE ...
-
-$$
-  \mathcal{H}(x_i,p_i,t)=\sum_i \bigg( \frac{x_i^2}{2}+ \frac{p_i^2}{2} +  \alpha \cdot (x_i-x_{i+1})^4+u(t) \cdot x_i \cdot c_i\bigg),
-$$
-
-... which provides the basis for the RK4 time integration.
-
-xxx
-