update PG chapter, fixing typos

2021-06-27 16:49:32 +02:00 · 2021-06-27 16:49:32 +02:00 · e88a8c76c3
commit e88a8c76c3
parent 5fb03ba615
8 changed files with 59 additions and 37 deletions
--- a/_toc.yml
+++ b/_toc.yml
@ -42,17 +42,17 @@
  - file: reinflearn-intro.md
  - file: reinflearn-code.ipynb
  
+# - part: Physical Gradients
+#  chapters:
+#  - file: physgrad.md
+#  - file: physgrad-comparison.ipynb
+#  - file: physgrad-nn.md
+#  - file: physgrad-discuss.md
+  
 - part: PBDL and Uncertainty 
  chapters:
  - file: bayesian-intro.md
  - file: bayesian-code.ipynb
-  
- part: Physical Gradients
-  chapters:
-  - file: physgrad.md
-  - file: physgrad-comparison.ipynb
-  - file: physgrad-nn.md
-  - file: physgrad-discuss.md

 - part: Fast Forward Topics
  chapters:
--- a/bayesian-intro.md
+++ b/bayesian-intro.md
@ -3,7 +3,7 @@ Introduction to Posterior Inference

 We should keep in mind that for all measurements, models, and discretizations we have uncertainties. For the former, this typically appears in the form of measurements errors, while model equations usually encompass only parts of a system we're interested in, and for numerical simulations we inherently introduce discretization errors. So a very important question to ask here is how sure we can be sure that an answer we obtain is the correct one. From a statistics viewpoint, we'd like to know the probability distribution for the posterior, i.e., the different outcomes that are possible.

-### Uncertainty 
+## Uncertainty 

 This admittedly becomes even more difficult in the context of machine learning:
 we're typically facing the task of approximating complex and unknown functions.
@ -31,6 +31,19 @@ learn something fundamentally different here: a full probability distribution
 instead of a point estimate. (All previous chapters "just" dealt with
 learning such point estimates, and the tasks were still far from trivial.)

+```{admonition} Aleatoric and Epistemic Uncertainty
+:class: tip
+Although we won't go into detail within the scope of this book, many works 
+distinguish two types of uncertainty which are important to mention here:
+
+- _Aleatoric_ uncertainty denotes uncertainty within the data, e.g., noise in measurements.
+
+- _Epistemic_ uncertainty, on the other hand, describes uncertainties within a model such as a trained neural network.
+
+In the following we'll primarily target _epistemic_ uncertainty via posterior inference. 
+However, as a word of caution: if they appear together, the different kinds of uncertainties (the two types above are not exhaustive) are very difficult to disentangle in practice. 
+```
+
 ![Divider](resources/divider5.jpg)

 ## Introduction to Bayesian Neural Networks
--- a/overview-ns-forw.ipynb
+++ b/overview-ns-forw.ipynb
@ -484,7 +484,7 @@
    "id": "ooqVxCPM8PXl"
   },
   "source": [
-    "It looks simple here, but this simulation setup is a powerful tool. The simulation could easily be extended to more complex cases or 3D, and it is already fully compatible with back-propagation pipelines of deep learning frameworks. \n",
+    "It looks simple here, but this simulation setup is a powerful tool. The simulation could easily be extended to more complex cases or 3D, and it is already fully compatible with backpropagation pipelines of deep learning frameworks. \n",
    "\n",
    "In the next chapters we'll show how to use these simulations for training NNs, and how to steer and modify them via trained NNs. This will illustrate how much we can improve the training process by having a solver in the loop, especially when the solver is _differentiable_. Before moving to these more complex training processes, we will cover a simpler supervised approach in the next chapter. This is very fundamental: even when aiming for advanced physics-based learning setups, a working supervised training is always the first step."
   ]
--- a/physgrad-comparison.ipynb
+++ b/physgrad-comparison.ipynb
@ -338,7 +338,7 @@
    "\n",
    "This is quite straightforward in JAX: we can call `jax.jacobian` two times, and then use the JAX version of `linalg.inv` to invert the resulting matrix.\n",
    "\n",
-    "For the optimization with Newton's method we'll use a larger step size of $\\eta =1/3$. For this example and the following one, we've chosen the stepsize such that the magnitude of the first update step is roughly the same as the one of GD. In this way, we can compare the trajectories of all three methods relative to each other. Note that this is by no means meant to illustrate or compare the stability of the methods here. Stability and upper limits for $\\eta$  are separate topics. Here we're focusing on convergence properties.\n",
+    "For the optimization with Newton's method we'll use a larger step size of $\\eta =1/3$. For this example and the following one, we've chosen the step size such that the magnitude of the first update step is roughly the same as the one of GD. In this way, we can compare the trajectories of all three methods relative to each other. Note that this is by no means meant to illustrate or compare the stability of the methods here. Stability and upper limits for $\\eta$  are separate topics. Here we're focusing on convergence properties.\n",
    "\n",
    "In the next cell, we apply the Newton updates ten times starting from the same initial guess:"
   ]
@ -610,7 +610,7 @@
    "\n",
    "To understand the behavior and differences of the methods here, it's important to keep in mind that we're not dealing with a black box that maps between $\\mathbf{x}$ and $L$, but rather there are spaces in between that matter. In our case, we only have a single $\\mathbf{z}$ space, but for DL settings, we might have a large number of latent spaces, over which we have a certain amount of control. We will return to NNs soon, but for now let's focus on $\\mathbf{z}$. \n",
    "\n",
-    "A first thing to note is that for PG, we explicitly map from $L$ to $\\mathbf{z}$, and then continue with a mapping to $\\mathbf{x}$. Thus we already obtained the trajectory in $\\mathbf{z}$ space, and not conincidentally, we already stored it in the `historyPGz` list above.\n",
+    "A first thing to note is that for PG, we explicitly map from $L$ to $\\mathbf{z}$, and then continue with a mapping to $\\mathbf{x}$. Thus we already obtained the trajectory in $\\mathbf{z}$ space, and not coincidentally, we already stored it in the `historyPGz` list above.\n",
    "\n",
    "Let's directly take a look what PG did in $\\mathbf{z}$ space:"
   ]
@ -884,7 +884,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Nice! It works, just like PG. Not much point plotting this, it's basiclly the PG version, but let's measure the difference. Below, we compute the MAE, which for this simple example turns out to be on the order of our floating point accuracy."
+    "Nice! It works, just like PG. Not much point plotting this, it's basically the PG version, but let's measure the difference. Below, we compute the MAE, which for this simple example turns out to be on the order of our floating point accuracy."
   ]
  },
  {
--- a/physgrad-discuss.md
+++ b/physgrad-discuss.md
@ -1,14 +1,14 @@
 Discussion
 =======================

-In a way, the learning via physical gradients provides the tightest possible coupling
+In a way, the learning via physical gradients provide the tightest possible coupling
 of physics and NNs: the full non-linear process of the PDE model directly steers
 the optimization of the NN.

 Naturally, this comes at a cost - invertible simulators are more difficult to build
-(and less common) than the first-order gradients which are relatively commonly used
-for learning processes and adjoint optimizations. Nonetheless, if they're available,
-they can speed up convergence, and yield models that have an inherently better performance.
+(and less common) than the first-order gradients from
+deep learning and adjoint optimizations. Nonetheless, if they're available,
+invertible simulators can speed up convergence, and yield models that have an inherently better performance.
 Thus, once trained, these models can give a performance that we simply can't obtain
 by, e.g., training longer with a simpler approach. So, if we plan to evaluate these
 models often (e.g., ship them in an application), this increased one-time cost
@ -25,5 +25,4 @@ can pay off in the long run.

 ❌ Con: 
 - Requires inverse simulators (at least local ones).
- less wide-spread availability than, e.g., differentiable physics simulators.
-
+- Less wide-spread availability than, e.g., differentiable physics simulators.
--- a/physgrad-nn.md
+++ b/physgrad-nn.md
@ -30,7 +30,7 @@ This equation has turned the step w.r.t. $L$ into a step in $z$ space: $\Delta z
 However, it does not prescribe a unique way to compute $\Delta z$ since the derivative $\frac{\partial z}{\partial L}$ as the right-inverse of the row-vector $\frac{\partial L}{\partial z}$ puts almost no restrictions on $\Delta z$.
 Instead, we use a Newton step (equation {eq}`quasi-newton-update`) to determine $\Delta z$ where $\eta$ controls the step size of the optimization steps.

-Here an obvious questions is: Doesn't this leave us with the distadvantage of having to compute the inverse Hessian, as dicussed before?
+Here an obvious questions is: Doesn't this leave us with the disadvantage of having to compute the inverse Hessian, as discussed before?
 Luckily, unlike with regular Newton or quasi-Newton methods, where the Hessian of the full system is required, here, the Hessian is needed only for $L(z)$. Even better, for many typical $L$ its computation can be completely forgone.

 E.g., consider the case $L(z) = \frac 1 2 || z^\textrm{predicted} - z^\textrm{target}||_2^2$ which is the most common supervised objective function.
@ -127,5 +127,6 @@ name: pg-toolbox
 TODO, visual overview of toolbox  , combinations 
 ```

+Details of PGs and additional examples can be found in the corresponding paper {cite}`holl2021pg`.
 In the next section's we'll show examples of training physics-based NNs 
 with invertible simulations. (These will follow soon, stay tuned.)
--- a/physgrad.md
+++ b/physgrad.md
@ -1,13 +1,17 @@
 Physical Gradients
 =======================

-**Note, this chapter is very preliminary - probably not for the first version of the book**
+**Note, this chapter is very preliminary - probably not for the first version of the book. move after RL, before BNNs?**

-The next chapter will dive deeper into state-of-the-art-research, and aim for an even tighter
-integration of physics and learning.
-The approaches explained previously all integrate physical models into deep learning algorithms,
-either as part of the loss function or via operators embedded into the network.
-In the former case, the simulator is only required at training time, while in the latter it also employed at inference time. When using {doc}`diffphys`, it actually enables an end-to-end training of NNs.
+The next chapter will questions some fundamental aspects of the formulations so far -- namely the gradients -- and aim for an even tighter integration of physics and learning.
+The approaches explained previously all integrate physical models into deep learning algorithms.
+Either as a physics-informed (PI) loss function or via differentiable physics (DP) operators embedded into the network.
+In the PI case, the simulator is only required at training time, while for DP approaches, it also employed at inference time, it actually enables an end-to-end training of NNs and numerical solvers. Both employ first order derivatives to drive optimizations and learning processes, and we haven't questioned at all whether this is the best choice so far.
+
+A central insight the following chapter is that regular gradients are often a _sub-optimal choice_ for learning problems involving physical quantities.
+Treating network and simulator as separate systems instead of a single black box, we'll derive a different update step that replaces the gradient of the simulator.
+As this gradient is closely related to a regular gradient, but computed via physical model equations, 
+we refer to this update (proposed by Holl et al. {cite}`holl2021pg`) as the {\em physical gradient} (PG).

 ```{admonition} Looking ahead
 :class: tip
@ -23,20 +27,18 @@ Below, we'll proceed in the following steps:
 ## Overview

 All NNs of the previous chapters were trained with gradient descent (GD) via backpropagation, GD and hence backpropagation was also employed for the PDE solver (_simulator_) $\mathcal P$, computing the composite gradient 
-$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial \mathcal P(x)} \frac{\partial \mathcal P(x)}{\partial x}$ for the loss function $L$.
+$\partial L / \partial x$ for the loss function $L$:
+
+$$
+\frac{\partial L}{\partial x} = \frac{\partial L}{\partial \mathcal P(x)} \frac{\partial \mathcal P(x)}{\partial x}
+$$ 

 In the field of classical optimization, techniques such as Newton's method or BFGS variants are commonly used to optimize numerical processes since they can offer better convergence speed and stability.
 These methods likewise employ gradient information, but substantially differ from GD in the way they 
 compute the update step, typically via higher order derivatives.
-
 % cite{nocedal2006numerical} 

-A central insight the following chapter is that regular gradients are often a _sub-optimal choice_ for learning problems involving physical quantities.
-Treating network and simulator as separate systems instead of a single black box, we'll derive a different update step that replaces the gradient of the simulator.
-As this gradient is closely related to a regular gradient, but computed via physical model equations, 
-we refer to this update as the {\em physical gradient} (PG).
-The PG can take into account nonlinearities to produce better optimization updates when an (full or approximate) inverse simulator is available.
-
+The PG which we'll derive below can take into account nonlinearities to produce better optimization updates when an (full or approximate) inverse simulator is available.
 In contrast to classic optimization techniques, we show how a differentiable or invertible physics 
 simulator can be leveraged to compute the PG without requiring higher-order derivatives of the simulator.

@ -78,7 +80,7 @@ Surprisingly, this very widely used construction has a number of undesirable pro

 **Units** 📏

-A first indicator that something is amiss with GD is that it inherently misrespresents dimensions.
+A first indicator that something is amiss with GD is that it inherently misrepresents dimensions.
 Assume two parameters $x_1$ and $x_2$ have different physical units.
 Then the GD parameter updates scale with the inverse of these units because the parameters appear in the denominator for the GD update above.
 The learning rate $\eta$ could compensate for this discrepancy but since $x_1$ and $x_2$ have different units, there exists no single $\eta$ to produce the correct units for both parameters.
@ -124,8 +126,7 @@ This construction solves some of the problems of gradient descent from above, bu
 **Units** 📏

 Quasi-Newton methods definitely provide a much better handling of physical units than GD.
-%Equation~\ref{eq:quasi-newton-update} 
-The quasi-Newton update
+The quasi-Newton update from equation {eq}`quasi-newton-update`
 produces the correct units for all parameters to be optimized, $\eta$ can stay dimensionless.

 **Convergence near optimum** 💎
@ -223,7 +224,7 @@ The change in $x$ is $\Delta x = \Delta L \cdot \frac{\partial x}{\partial z} \f
 The change in intermediate spaces is independent of their respective dependencies, at least up to first order.
 Consequently, the change to these spaces can be estimated during backpropagation, before all gradients have been computed.

-Note that even Newton's method with its inverse Hessian didn't fully get this right. The key here is that if the Jacobian is invertible, we'll direclty get the correctly scaled direction at a given layer, without "helpers" such as the inverse Hessian.
+Note that even Newton's method with its inverse Hessian didn't fully get this right. The key here is that if the Jacobian is invertible, we'll directly get the correctly scaled direction at a given layer, without "helpers" such as the inverse Hessian.

 **Limitations**

--- a/references.bib
+++ b/references.bib
@ -85,6 +85,14 @@
 	  url={https://ge.in.tum.de/publications/2020-iclr-holl/},
 }

+@inproceedings{holl2021pg,
+	  title={Physical Gradients},
+	  author={Holl, Philipp and Koltun, Vladlen and Thuerey, Nils},
+	  booktitle={arXiv},
+	  year={2021},
+	  url={https://ge.in.tum.de/publications/},
+}
+
@article{prantl2019tranquil,
 	  title={Tranquil clouds: Neural networks for learning temporally coherent features in point clouds},
 	  author={Prantl, Lukas and Chentanez, Nuttapong and Jeschke, Stefan and Thuerey, Nils},