update supervised chapter

2021-05-16 11:29:51 +08:00
parent 05d2783759
commit cd3de70540
4 changed files with 885 additions and 998 deletions
--- a/fixup-latex.py
+++ b/fixup-latex.py
@@ -11,6 +11,9 @@ fte = re.compile(r"👋")
 # TODO , replace phi symbol w text in phiflow
 # TODO , filter tensorflow warnings?
 # also torch "UserWarning:"
 path = "tmp2.txt" # simple
 path = "tmp.txt" # utf8
 #path = "book.tex-in.bak" # full utf8
--- a/supervised-airfoils.ipynb
+++ b/supervised-airfoils.ipynb
--- a/supervised-discuss.md
+++ b/supervised-discuss.md
@@ -2,7 +2,7 @@ Discussion of Supervised Approaches
 =======================
 The previous example illustrates that we can quite easily use 
-supervised training to solve quite complex tasks. The main workload is
+supervised training to solve complex tasks. The main workload is
 collecting a large enough dataset of examples. Once that exists, we can
 train a network to approximate the solution manifold sampled
 by these solutions, and the trained network can give us predictions
@@ -23,8 +23,19 @@ on a single example, then there's something fundamentally wrong
 with your code or data. Thus, there's no reason to move on to more complex
 setups that will make finding these fundamental problems more difficult.
-Hence: **always** start with a 1-sample overfitting test,
+```{admonition} Best practices 👑
-and then increase the complexity of the setup.
+:class: tip
 To summarize the scattered comments of the previous sections, here's a set of "golden rules"  for setting up a DL project.
 - Always start with a 1-sample overfitting test.
 - Check how many trainable parameters your network has.
 - Slowly increase the amount of trianing data (and potentially network parameters and depth).
 - Adjust hyperparameters (especially the learning rate).
 - Then introduce other components such as differentiable solvers or adversarial training.
 ```
 ### Stability
@@ -65,27 +76,33 @@ because they can accurately adapt to the signals they receive at training time,
 but in contrast to other learned representations, they're actually not very good
 at extrapolation. So we can't expect an NN to magically work with new inputs.
 Rather, we need to make sure that we can properly shape the input space,
-e.g., by normalization and by focusing on invariants. In short, if you always train
+e.g., by normalization and by focusing on invariants. 
 To give a more specific example: if you always train
 your networks for inputs in the range $[0\dots1]$, don't expect it to work
-with inputs of $[27\dots39]$. You might be able to subtract an offset of $10$ beforehand,
+with inputs of $[27\dots39]$. In certain cases it's valid to normalize
-and re-apply it after evaluating the network.
+inputs and outputs by subtracting the mean, and normalize via the standard 
-As a rule of thumb: always make sure you
+deviation or a suitable quantile (make sure this doesn't destroy important
-actually train the NN on the kinds of input you want to use at inference time.
+correlations in your data).
 As a rule of thumb: make sure you actually train the NN on the 
 inputs that are as similar as possible to those you want to use at inference time.
 This is important to keep in mind during the next chapters: e.g., if we
 want an NN to work in conjunction with another solver or simulation environment,
 it's important to actually bring the solver into the training process, otherwise
 the network might specialize on pre-computed data that differs from what is produced
-when combining the NN with the solver, i.e _distribution shift_.
+when combining the NN with the solver, i.e it will suffer from _distribution shift_.
 ### Meshes and grids
-The previous airfoil example use Cartesian grids with standard 
+The previous airfoil example used Cartesian grids with standard 
 convolutions. These typically give the most _bang-for-the-buck_, in terms
 of performance and stability. Nonetheless, the whole discussion here of course 
-also holds for less regular convolutions, e.g., a less regular mesh
+also holds for other types of convolutions, e.g., a less regular mesh
-in conjunction with graph-convolutions. You will typically see reduced learning
+in conjunction with graph-convolutions, or particle-based data
-performance in exchange for improved stability when switching to these.
+with continuous convolutions (cf {doc}`others-lagrangian`). You will typically see reduced learning
 performance in exchange for improved sampling flexibility when switching to these.
 Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
 to avoid these as much as possible. For any structured data, like spatial functions,
@@ -108,8 +125,9 @@ To summarize, supervised training has the following properties.
 ❌ Con: 
 - Lots of data needed.
 - Sub-optimal performance, accuracy and generalization.
 - Interactions with external "processes" (such as embedding into a solver) are difficult.
-Outlook: any interactions with external "processes" (such as embedding into a solver) are tricky with supervised training.
+The next chapters will explain how to alleviate these shortcomings of supervised training.
 First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
 we'll revisit the challenges of bringing together numerical simulations and learned approaches.
--- a/supervised.md
+++ b/supervised.md
@@ -2,41 +2,43 @@ Supervised Training
 =======================
 _Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of 
-deep learning (DL), of course, so it's still fairly new. Also, "old fashioned" doesn't 
+deep learning (DL), of course, so it's still fairly new. 
-always mean bad - it's just that later on we'll be able to do better than with a simple supervised training.
+Also, "old fashioned" doesn't always mean bad - it's just that later on we'll discuss ways to train networks that clearly outperform approaches using supervised training.
-In a way, the viewpoint of "supervised training" is a starting point for all projects one would encounter in the context of DL, and
+Nonetheless, "supervised training" is a starting point for all projects one would encounter in the context of DL, and
-hence is worth studying. While it typically yields inferior results to approaches that more tightly 
+hence it is worth studying. Also, while it typically yields inferior results to approaches that more tightly 
-couple with physics, it nonetheless can be the only choice in certain application scenarios where no good
+couple with physics, it can be the only choice in certain application scenarios where no good
 model equations exist.
 ## Problem setting
 For supervised training, we're faced with an 
 unknown function $f^*(x)=y^*$, collect lots of pairs of data $[x_0,y^*_0], ...[x_n,y^*_n]$ (the training data set)
-and directly train a NN to represent an approximation of $f^*$ denoted as $f$, such
+and directly train a NN to represent an approximation of $f^*$ denoted as $f$.
 that $f(x)=y \approx y^*$.
-The $f$ we can obtain is typically not exact, 
+The $f$ we can obtain in this way is typically not exact, 
 but instead we obtain it via a minimization problem:
-by adjusting weights $\theta$ of our representation with $f$ such that
+by adjusting the weights $\theta$ of our NN representation of $f$ such that
-$\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2$.
+$$
 \text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
 $$ (supervised-training)
-This will give us $\theta$ such that $f(x;\theta) \approx y$ as accurately as possible given
+This will give us $\theta$ such that $f(x;\theta) =  y \approx y$ as accurately as possible given
 our choice of $f$ and the hyperparameters for training. Note that above we've assumed 
 the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
 to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
-of a suitable metric is topic we will get back to later on.
+of a suitable metric is a topic we will get back to later on.
 Irrespective of our choice of metric, this formulation
 gives the actual "learning" process for a supervised approach.
 The training data typically needs to be of substantial size, and hence it is attractive 
-to use numerical simulations to produce a large number of training input-output pairs.
+to use numerical simulations solving a physical model $\mathcal{P}$ 
 to produce a large number of reliable input-output pairs for training.
 This means that the training process uses a set of model equations, and approximates
 them numerically, in order to train the NN representation $\tilde{f}$. This
-has a bunch of advantages, e.g., we don't have measurement noise of real-world devices
+has quite a few advantages, e.g., we don't have measurement noise of real-world devices
 and we don't need manual labour to annotate a large number of samples to get training data.
 On the other hand, this approach inherits the common challenges of replacing experiments
@@ -44,7 +46,7 @@ with simulations: first, we need to ensure the chosen model has enough power to
 behavior of real-world phenomena that we're interested in.
 In addition, the numerical approximations have numerical errors
 which need to be kept small enough for a chosen application. As these topics are studied in depth
-for classical simulations, the existing knowledge can likewise be leveraged to
+for classical simulations, and the existing knowledge can likewise be leveraged to
 set up DL training tasks.
 ```{figure} resources/supervised-training.jpg
@@ -56,8 +58,24 @@ A visual overview of supervised training. Quite simple overall, but it's good to
 in mind in comparison to the more complex variants we'll encounter later on.
 ```
 ## Surrogate models
 One of the central advantages of the supervised approach above is that
 we obtain a _surrogate_ for the model $\mathcal{P}$. The numerical approximations
 of PDE models for real world phenomena are often very expensive to compute. A trained
 NN on the other hand incurs a constant cost per evaluation, and is typically trivial
 to evaluate on specialized hardware such as GPUs or NN units.
 Despite this, it's important to be careful:
 NNs can quickly generate huge numbers of inbetween results. Consider a CNN layer with
 $128$ features. If we apply it to an input of $128^2$, i.e. ca. 16k cells, we get $128^3$ intermediate values.
 That's more than 2 million.
 All these values at least need to be momentarily stored in memory, and processed by the next layer.
 Nonetheless, replacing complex and expensive solvers with fast, learned approximations
 is a very attractive and interesting direction.
 ## Show me some code!
-Let's directly look at an implementation within a more complicated context:
+Let's directly look at an example for this: we'll replace a full solver for
-_turbulent flows around airfoils_ from {cite}`thuerey2020deepFlowPred`.
+_turbulent flows around airfoils_ with a surrogate model (from {cite}`thuerey2020dfp`).