update supervised chapter

2021-05-16 11:29:51 +08:00
parent 05d2783759
commit cd3de70540
4 changed files with 885 additions and 998 deletions
--- a/supervised.md
+++ b/supervised.md
@@ -2,41 +2,43 @@ Supervised Training
 =======================

 _Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of 
-deep learning (DL), of course, so it's still fairly new. Also, "old fashioned" doesn't 
-always mean bad - it's just that later on we'll be able to do better than with a simple supervised training.
+deep learning (DL), of course, so it's still fairly new. 
+Also, "old fashioned" doesn't always mean bad - it's just that later on we'll discuss ways to train networks that clearly outperform approaches using supervised training.

-In a way, the viewpoint of "supervised training" is a starting point for all projects one would encounter in the context of DL, and
-hence is worth studying. While it typically yields inferior results to approaches that more tightly 
-couple with physics, it nonetheless can be the only choice in certain application scenarios where no good
+Nonetheless, "supervised training" is a starting point for all projects one would encounter in the context of DL, and
+hence it is worth studying. Also, while it typically yields inferior results to approaches that more tightly 
+couple with physics, it can be the only choice in certain application scenarios where no good
 model equations exist.

 ## Problem setting

 For supervised training, we're faced with an 
 unknown function $f^*(x)=y^*$, collect lots of pairs of data $[x_0,y^*_0], ...[x_n,y^*_n]$ (the training data set)
-and directly train a NN to represent an approximation of $f^*$ denoted as $f$, such
-that $f(x)=y \approx y^*$.
+and directly train a NN to represent an approximation of $f^*$ denoted as $f$.

-The $f$ we can obtain is typically not exact, 
+The $f$ we can obtain in this way is typically not exact, 
 but instead we obtain it via a minimization problem:
-by adjusting weights $\theta$ of our representation with $f$ such that
+by adjusting the weights $\theta$ of our NN representation of $f$ such that

-$\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2$.
+$$
+\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
+$$ (supervised-training)

-This will give us $\theta$ such that $f(x;\theta) \approx y$ as accurately as possible given
+This will give us $\theta$ such that $f(x;\theta) =  y \approx y$ as accurately as possible given
 our choice of $f$ and the hyperparameters for training. Note that above we've assumed 
 the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
 to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
-of a suitable metric is topic we will get back to later on.
+of a suitable metric is a topic we will get back to later on.

 Irrespective of our choice of metric, this formulation
 gives the actual "learning" process for a supervised approach.

 The training data typically needs to be of substantial size, and hence it is attractive 
-to use numerical simulations to produce a large number of training input-output pairs.
+to use numerical simulations solving a physical model $\mathcal{P}$ 
+to produce a large number of reliable input-output pairs for training.
 This means that the training process uses a set of model equations, and approximates
 them numerically, in order to train the NN representation $\tilde{f}$. This
-has a bunch of advantages, e.g., we don't have measurement noise of real-world devices
+has quite a few advantages, e.g., we don't have measurement noise of real-world devices
 and we don't need manual labour to annotate a large number of samples to get training data.

 On the other hand, this approach inherits the common challenges of replacing experiments
@@ -44,7 +46,7 @@ with simulations: first, we need to ensure the chosen model has enough power to
 behavior of real-world phenomena that we're interested in.
 In addition, the numerical approximations have numerical errors
 which need to be kept small enough for a chosen application. As these topics are studied in depth
-for classical simulations, the existing knowledge can likewise be leveraged to
+for classical simulations, and the existing knowledge can likewise be leveraged to
 set up DL training tasks.

 ```{figure} resources/supervised-training.jpg
@@ -56,8 +58,24 @@ A visual overview of supervised training. Quite simple overall, but it's good to
 in mind in comparison to the more complex variants we'll encounter later on.
 ```

+## Surrogate models
+
+One of the central advantages of the supervised approach above is that
+we obtain a _surrogate_ for the model $\mathcal{P}$. The numerical approximations
+of PDE models for real world phenomena are often very expensive to compute. A trained
+NN on the other hand incurs a constant cost per evaluation, and is typically trivial
+to evaluate on specialized hardware such as GPUs or NN units.
+
+Despite this, it's important to be careful:
+NNs can quickly generate huge numbers of inbetween results. Consider a CNN layer with
+$128$ features. If we apply it to an input of $128^2$, i.e. ca. 16k cells, we get $128^3$ intermediate values.
+That's more than 2 million.
+All these values at least need to be momentarily stored in memory, and processed by the next layer.
+
+Nonetheless, replacing complex and expensive solvers with fast, learned approximations
+is a very attractive and interesting direction.
+
 ## Show me some code!

-Let's directly look at an implementation within a more complicated context:
-_turbulent flows around airfoils_ from {cite}`thuerey2020deepFlowPred`.
-
+Let's directly look at an example for this: we'll replace a full solver for
+_turbulent flows around airfoils_ with a surrogate model (from {cite}`thuerey2020dfp`).