update supervised chapter

This commit is contained in:
NT
2021-05-16 11:29:51 +08:00
parent 05d2783759
commit cd3de70540
4 changed files with 885 additions and 998 deletions

View File

@@ -11,6 +11,9 @@ fte = re.compile(r"👋")
# TODO , replace phi symbol w text in phiflow # TODO , replace phi symbol w text in phiflow
# TODO , filter tensorflow warnings?
# also torch "UserWarning:"
path = "tmp2.txt" # simple path = "tmp2.txt" # simple
path = "tmp.txt" # utf8 path = "tmp.txt" # utf8
#path = "book.tex-in.bak" # full utf8 #path = "book.tex-in.bak" # full utf8

File diff suppressed because one or more lines are too long

View File

@@ -2,7 +2,7 @@ Discussion of Supervised Approaches
======================= =======================
The previous example illustrates that we can quite easily use The previous example illustrates that we can quite easily use
supervised training to solve quite complex tasks. The main workload is supervised training to solve complex tasks. The main workload is
collecting a large enough dataset of examples. Once that exists, we can collecting a large enough dataset of examples. Once that exists, we can
train a network to approximate the solution manifold sampled train a network to approximate the solution manifold sampled
by these solutions, and the trained network can give us predictions by these solutions, and the trained network can give us predictions
@@ -23,8 +23,19 @@ on a single example, then there's something fundamentally wrong
with your code or data. Thus, there's no reason to move on to more complex with your code or data. Thus, there's no reason to move on to more complex
setups that will make finding these fundamental problems more difficult. setups that will make finding these fundamental problems more difficult.
Hence: **always** start with a 1-sample overfitting test, ```{admonition} Best practices 👑
and then increase the complexity of the setup. :class: tip
To summarize the scattered comments of the previous sections, here's a set of "golden rules" for setting up a DL project.
- Always start with a 1-sample overfitting test.
- Check how many trainable parameters your network has.
- Slowly increase the amount of trianing data (and potentially network parameters and depth).
- Adjust hyperparameters (especially the learning rate).
- Then introduce other components such as differentiable solvers or adversarial training.
```
### Stability ### Stability
@@ -65,27 +76,33 @@ because they can accurately adapt to the signals they receive at training time,
but in contrast to other learned representations, they're actually not very good but in contrast to other learned representations, they're actually not very good
at extrapolation. So we can't expect an NN to magically work with new inputs. at extrapolation. So we can't expect an NN to magically work with new inputs.
Rather, we need to make sure that we can properly shape the input space, Rather, we need to make sure that we can properly shape the input space,
e.g., by normalization and by focusing on invariants. In short, if you always train e.g., by normalization and by focusing on invariants.
To give a more specific example: if you always train
your networks for inputs in the range $[0\dots1]$, don't expect it to work your networks for inputs in the range $[0\dots1]$, don't expect it to work
with inputs of $[27\dots39]$. You might be able to subtract an offset of $10$ beforehand, with inputs of $[27\dots39]$. In certain cases it's valid to normalize
and re-apply it after evaluating the network. inputs and outputs by subtracting the mean, and normalize via the standard
As a rule of thumb: always make sure you deviation or a suitable quantile (make sure this doesn't destroy important
actually train the NN on the kinds of input you want to use at inference time. correlations in your data).
As a rule of thumb: make sure you actually train the NN on the
inputs that are as similar as possible to those you want to use at inference time.
This is important to keep in mind during the next chapters: e.g., if we This is important to keep in mind during the next chapters: e.g., if we
want an NN to work in conjunction with another solver or simulation environment, want an NN to work in conjunction with another solver or simulation environment,
it's important to actually bring the solver into the training process, otherwise it's important to actually bring the solver into the training process, otherwise
the network might specialize on pre-computed data that differs from what is produced the network might specialize on pre-computed data that differs from what is produced
when combining the NN with the solver, i.e _distribution shift_. when combining the NN with the solver, i.e it will suffer from _distribution shift_.
### Meshes and grids ### Meshes and grids
The previous airfoil example use Cartesian grids with standard The previous airfoil example used Cartesian grids with standard
convolutions. These typically give the most _bang-for-the-buck_, in terms convolutions. These typically give the most _bang-for-the-buck_, in terms
of performance and stability. Nonetheless, the whole discussion here of course of performance and stability. Nonetheless, the whole discussion here of course
also holds for less regular convolutions, e.g., a less regular mesh also holds for other types of convolutions, e.g., a less regular mesh
in conjunction with graph-convolutions. You will typically see reduced learning in conjunction with graph-convolutions, or particle-based data
performance in exchange for improved stability when switching to these. with continuous convolutions (cf {doc}`others-lagrangian`). You will typically see reduced learning
performance in exchange for improved sampling flexibility when switching to these.
Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
to avoid these as much as possible. For any structured data, like spatial functions, to avoid these as much as possible. For any structured data, like spatial functions,
@@ -108,8 +125,9 @@ To summarize, supervised training has the following properties.
❌ Con: ❌ Con:
- Lots of data needed. - Lots of data needed.
- Sub-optimal performance, accuracy and generalization. - Sub-optimal performance, accuracy and generalization.
- Interactions with external "processes" (such as embedding into a solver) are difficult.
Outlook: any interactions with external "processes" (such as embedding into a solver) are tricky with supervised training. The next chapters will explain how to alleviate these shortcomings of supervised training.
First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
we'll revisit the challenges of bringing together numerical simulations and learned approaches. we'll revisit the challenges of bringing together numerical simulations and learned approaches.

View File

@@ -2,41 +2,43 @@ Supervised Training
======================= =======================
_Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of _Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of
deep learning (DL), of course, so it's still fairly new. Also, "old fashioned" doesn't deep learning (DL), of course, so it's still fairly new.
always mean bad - it's just that later on we'll be able to do better than with a simple supervised training. Also, "old fashioned" doesn't always mean bad - it's just that later on we'll discuss ways to train networks that clearly outperform approaches using supervised training.
In a way, the viewpoint of "supervised training" is a starting point for all projects one would encounter in the context of DL, and Nonetheless, "supervised training" is a starting point for all projects one would encounter in the context of DL, and
hence is worth studying. While it typically yields inferior results to approaches that more tightly hence it is worth studying. Also, while it typically yields inferior results to approaches that more tightly
couple with physics, it nonetheless can be the only choice in certain application scenarios where no good couple with physics, it can be the only choice in certain application scenarios where no good
model equations exist. model equations exist.
## Problem setting ## Problem setting
For supervised training, we're faced with an For supervised training, we're faced with an
unknown function $f^*(x)=y^*$, collect lots of pairs of data $[x_0,y^*_0], ...[x_n,y^*_n]$ (the training data set) unknown function $f^*(x)=y^*$, collect lots of pairs of data $[x_0,y^*_0], ...[x_n,y^*_n]$ (the training data set)
and directly train a NN to represent an approximation of $f^*$ denoted as $f$, such and directly train a NN to represent an approximation of $f^*$ denoted as $f$.
that $f(x)=y \approx y^*$.
The $f$ we can obtain is typically not exact, The $f$ we can obtain in this way is typically not exact,
but instead we obtain it via a minimization problem: but instead we obtain it via a minimization problem:
by adjusting weights $\theta$ of our representation with $f$ such that by adjusting the weights $\theta$ of our NN representation of $f$ such that
$\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2$. $$
\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
$$ (supervised-training)
This will give us $\theta$ such that $f(x;\theta) \approx y$ as accurately as possible given This will give us $\theta$ such that $f(x;\theta) = y \approx y$ as accurately as possible given
our choice of $f$ and the hyperparameters for training. Note that above we've assumed our choice of $f$ and the hyperparameters for training. Note that above we've assumed
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$ the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
of a suitable metric is topic we will get back to later on. of a suitable metric is a topic we will get back to later on.
Irrespective of our choice of metric, this formulation Irrespective of our choice of metric, this formulation
gives the actual "learning" process for a supervised approach. gives the actual "learning" process for a supervised approach.
The training data typically needs to be of substantial size, and hence it is attractive The training data typically needs to be of substantial size, and hence it is attractive
to use numerical simulations to produce a large number of training input-output pairs. to use numerical simulations solving a physical model $\mathcal{P}$
to produce a large number of reliable input-output pairs for training.
This means that the training process uses a set of model equations, and approximates This means that the training process uses a set of model equations, and approximates
them numerically, in order to train the NN representation $\tilde{f}$. This them numerically, in order to train the NN representation $\tilde{f}$. This
has a bunch of advantages, e.g., we don't have measurement noise of real-world devices has quite a few advantages, e.g., we don't have measurement noise of real-world devices
and we don't need manual labour to annotate a large number of samples to get training data. and we don't need manual labour to annotate a large number of samples to get training data.
On the other hand, this approach inherits the common challenges of replacing experiments On the other hand, this approach inherits the common challenges of replacing experiments
@@ -44,7 +46,7 @@ with simulations: first, we need to ensure the chosen model has enough power to
behavior of real-world phenomena that we're interested in. behavior of real-world phenomena that we're interested in.
In addition, the numerical approximations have numerical errors In addition, the numerical approximations have numerical errors
which need to be kept small enough for a chosen application. As these topics are studied in depth which need to be kept small enough for a chosen application. As these topics are studied in depth
for classical simulations, the existing knowledge can likewise be leveraged to for classical simulations, and the existing knowledge can likewise be leveraged to
set up DL training tasks. set up DL training tasks.
```{figure} resources/supervised-training.jpg ```{figure} resources/supervised-training.jpg
@@ -56,8 +58,24 @@ A visual overview of supervised training. Quite simple overall, but it's good to
in mind in comparison to the more complex variants we'll encounter later on. in mind in comparison to the more complex variants we'll encounter later on.
``` ```
## Surrogate models
One of the central advantages of the supervised approach above is that
we obtain a _surrogate_ for the model $\mathcal{P}$. The numerical approximations
of PDE models for real world phenomena are often very expensive to compute. A trained
NN on the other hand incurs a constant cost per evaluation, and is typically trivial
to evaluate on specialized hardware such as GPUs or NN units.
Despite this, it's important to be careful:
NNs can quickly generate huge numbers of inbetween results. Consider a CNN layer with
$128$ features. If we apply it to an input of $128^2$, i.e. ca. 16k cells, we get $128^3$ intermediate values.
That's more than 2 million.
All these values at least need to be momentarily stored in memory, and processed by the next layer.
Nonetheless, replacing complex and expensive solvers with fast, learned approximations
is a very attractive and interesting direction.
## Show me some code! ## Show me some code!
Let's directly look at an implementation within a more complicated context: Let's directly look at an example for this: we'll replace a full solver for
_turbulent flows around airfoils_ from {cite}`thuerey2020deepFlowPred`. _turbulent flows around airfoils_ with a surrogate model (from {cite}`thuerey2020dfp`).