minor updates

2022-04-20 08:58:36 +02:00 · 2022-04-20 08:58:36 +02:00 · e84633aaa2
commit e84633aaa2
parent bc4718ffe9
3 changed files with 13 additions and 17 deletions
--- a/physgrad-discuss.md
+++ b/physgrad-discuss.md
@ -1,4 +1,4 @@
-Discussion
+Discussion of Improved Gradients
 =======================

 At this point it's a good time to take another step back, and assess the different methods of the previous chapters. For deep learning applications, we can broadly distinguish three approaches: the _regular_ differentiable physics (DP) training, the training with half-inverse gradients (HIGs), and using the scale-invariant physics updates (SIPs). Unfortunately, we can't simply discard two of them, and focus on a single approach for all future endeavours. However, discussing the pros and cons sheds light on some fundamental aspects of physics-based deep learning.
--- a/physgrad-hig.md
+++ b/physgrad-hig.md
@ -28,16 +28,16 @@ As mentioned during the derivation of inverse simulator updates in {eq}`quasi-ne

 $$
     \Delta \theta_{\mathrm{GN}}
-    = - \eta \Bigg( \bigg(\frac{\partial z}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial z}{\partial \theta}\bigg) \Bigg)^{-1} \cdot
-    \bigg(\frac{\partial z}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial L}{\partial z}\bigg)^{\top} .
+    = - \eta \Bigg( \bigg(\frac{\partial y}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial y}{\partial \theta}\bigg) \Bigg)^{-1} \cdot
+    \bigg(\frac{\partial y}{\partial \theta}\bigg)^{T} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{T} .
 $$ (gauss-newton-update-full)

-For a full-rank Jacobian $\partial z / \partial \theta$, the transposed Jacobian cancels out, and the equation simplifies to
+For a full-rank Jacobian $\partial y / \partial \theta$, the transposed Jacobian cancels out, and the equation simplifies to

 $$
     \Delta \theta_{\mathrm{GN}}
-    = - \eta \bigg(\frac{\partial z}{\partial \theta}\bigg)  ^{-1} \cdot
-        \bigg(\frac{\partial L}{\partial z}\bigg)^{\top} .
+    = - \eta \bigg(\frac{\partial y}{\partial \theta}\bigg)  ^{-1} \cdot
+        \bigg(\frac{\partial L}{\partial y}\bigg)^{T} .
 $$ (gauss-newton-update)

 This looks much simpler, but still leaves us with a Jacobian matrix to invert. This Jacobian is typically non-square, and has small singular values which cause problems during inversion. Naively applying methods like Gauss-Newton can quickly explode. However, as we're dealing with cases where we have a physics solver in the training loop, the small singular values are often relevant for the physics. Hence, we don't want to just discard these parts of the learning signal, but rather preserve as many of them as possible.
@ -45,11 +45,11 @@ This looks much simpler, but still leaves us with a Jacobian matrix to invert. T
 This motivates the HIG update, which employs a partial and truncated inversion of the form

 $$
-    \Delta \theta_{\mathrm{HIG}} = - \eta \cdot  \bigg(\frac{\partial y}{\partial \theta}\bigg)^{-1/2} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{\top} , 
+    \Delta \theta_{\mathrm{HIG}} = - \eta \cdot  \bigg(\frac{\partial y}{\partial \theta}\bigg)^{-1/2} \cdot \bigg(\frac{\partial L}{\partial y}\bigg)^{T} , 
 $$ (hig-update)

 where the square-root for $^{-1/2}$ is computed via an SVD, and denotes the half-inverse. I.e., for a matrix $A$, 
-we compute its half-inverse via a singular value decomposition as $A^{-1/2} = V \Lambda^{-1/2} U^\top$, where $\Lambda$ contains the singular values.
+we compute its half-inverse via a singular value decomposition as $A^{-1/2} = V \Lambda^{-1/2} U^T$, where $\Lambda$ contains the singular values.
 During this step we can also take care of numerical noise in the form of small singular values. All entries
 of $\Lambda$ smaller than a threshold $\tau$ are set to zero.

@ -59,7 +59,7 @@ It might seem attractive at first to clamp singular values to a small value $\ta

 ```

-The use of a partial inversion via $^{-1/2}$ instead of a full inversion with $^{-1}$ helps preventing that small eigenvalues lead to overly large contributions in the update step. This is inspired by Adam, which  normalizes the search direction via $J/(\sqrt(diag(J^{\top}J)))$ instead of inverting it via $J/(J^{\top}J)$, with $J$ being the diagonal of the Jacobian matrix. For Adam, this compromise is necessary due to the rough approximation via the diagonal. For HIGs, we use the full Jacobian, and hence can do a proper inversion. Nonetheless, as outlined in the original paper {cite}`schnell2022hig`, the half-inversion regularizes the inverse and provides substantial improvements for the learning, while reducing the chance of gradient explosions.
+The use of a partial inversion via $^{-1/2}$ instead of a full inversion with $^{-1}$ helps preventing that small eigenvalues lead to overly large contributions in the update step. This is inspired by Adam, which  normalizes the search direction via $J/(\sqrt(diag(J^{R}J)))$ instead of inverting it via $J/(J^{T}J)$, with $J$ being the diagonal of the Jacobian matrix. For Adam, this compromise is necessary due to the rough approximation via the diagonal. For HIGs, we use the full Jacobian, and hence can do a proper inversion. Nonetheless, as outlined in the original paper {cite}`schnell2022hig`, the half-inversion regularizes the inverse and provides substantial improvements for the learning, while reducing the chance of gradient explosions.

 ## Constructing the Jacobian

--- a/physgrad-nn.md
+++ b/physgrad-nn.md
@ -51,8 +51,6 @@ To update the weights $\theta$ of the NN $f$, we perform the following update st

 % xxx TODO, make clear, we're solving the inverse problem $f(y; \theta)=x$

-% * Compute the scale-invariant update $\Delta x_{\text{PG}} = \mathcal P^{-1}(y + \Delta y; x_0) - x$ using an inverse simulator
-

 This combined optimization algorithm depends on both the learning rate $\eta_\textrm{NN}$ for the network as well as the step size $\eta$ from above, which factors into $\Delta y$.
 To first order, the effective learning rate of the network weights is $\eta_\textrm{eff} = \eta \cdot \eta_\textrm{NN}$.
@ -68,8 +66,6 @@ This algorithm combines the inverse simulator to compute accurate, higher-order

 In the above algorithm, we have assumed an $L^2$ loss, and without further explanation introduced a Newton step to propagate the inverse simulator step to the NN. Below, we explain and justify this treatment in more detail.

-%Here an obvious questions is: Doesn't this leave us with the disadvantage of having to compute the inverse Hessian, as discussed before?
-
 The central reason for introducing a Newton step is the improved accuracy for the loss derivative.
 Unlike with regular Newton or the quasi-Newton methods from equation {eq}`quasi-newton-update`, we do not need the Hessian of the full system. 
 Instead, the Hessian is only needed for $L(y)$. 
@ -80,15 +76,15 @@ E.g., consider the most common supervised objective function, $L(y) = \frac 1 2
 We then have $\frac{\partial L}{\partial y} = y - y^*$ and $\frac{\partial^2 L}{\partial y^2} = 1$.
 Using equation {eq}`quasi-newton-update`, we get $\Delta y = \eta \cdot (y^* - y)$ which can be computed right away, without evaluating any additional Hessian matrices.

-Once $\Delta y$ is determined, the gradient can be backpropagated to earlier time steps using the inverse simulator $\mathcal P^{-1}$. We've already used this combination of a Newton step for the loss and an inverse simulator for the PDE in {doc}`physgrad-comparison`.
+Once $\Delta y$ is determined, the gradient can be backpropagated to $x$, e.g. an earlier time, using the inverse simulator $\mathcal P^{-1}$. We've already used this combination of a Newton step for the loss and an inverse simulator for the PDE in {doc}`physgrad-comparison`.

-The loss here acts as a _proxy_ to embed the update from the inverse simulator into the network training pipeline. 
+The loss in $x$ here acts as a _proxy_ to embed the update from the inverse simulator into the network training pipeline. 
 It is not to be confused with a traditional supervised loss in $x$ space.
 Due to the dependency of $\mathcal P^{-1}$ on the prediction $y$, it does not average multiple modes of solutions in $x$.
 To demonstrate this, consider the case that GD is being used as solver for the inverse simulation.
 Then the total loss is purely defined in $y$ space, reducing to a regular first-order optimization. 

-Hence, the proxy loss function simply connects the computational graphs of inverse physics and NN for backpropagation.
+Hence, to summarize with SIPs we employ a trivial Newton step for the loss in $y$, and a proxy $L^2$ loss in $x$ that connects the computational graphs of inverse physics and NN for backpropagation.

 ## Iterations and time dependence

@ -127,7 +123,7 @@ name: physgrad-sin-loss
 Next we train a fully-connected neural network to invert this problem via equation {eq}`eq:unsupervised-training`. 
 We'll compare SIP training using a saddle-free Newton solver to various state-of-the-art network optimizers.
 For fairness, the best learning rate is selected independently for each optimizer.
-When choosing $\xi=0$ the problem is perfectly conditioned. In this case all network optimizers converge, with Adam having a slight advantage. This is shown in the left graph:
+When choosing $\xi=1$ the problem is perfectly conditioned. In this case all network optimizers converge, with Adam having a slight advantage. This is shown in the left graph:
 ```{figure} resources/physgrad-sin-time-graphs.png
 ---
 height: 180px