update supervised chapter

This commit is contained in:
NT 2021-05-16 11:29:51 +08:00
parent 05d2783759
commit cd3de70540
4 changed files with 885 additions and 998 deletions

View File

@ -11,6 +11,9 @@ fte = re.compile(r"👋")
# TODO , replace phi symbol w text in phiflow
# TODO , filter tensorflow warnings?
# also torch "UserWarning:"
path = "tmp2.txt" # simple
path = "tmp.txt" # utf8
#path = "book.tex-in.bak" # full utf8

File diff suppressed because one or more lines are too long

View File

@ -2,7 +2,7 @@ Discussion of Supervised Approaches
=======================
The previous example illustrates that we can quite easily use
supervised training to solve quite complex tasks. The main workload is
supervised training to solve complex tasks. The main workload is
collecting a large enough dataset of examples. Once that exists, we can
train a network to approximate the solution manifold sampled
by these solutions, and the trained network can give us predictions
@ -23,8 +23,19 @@ on a single example, then there's something fundamentally wrong
with your code or data. Thus, there's no reason to move on to more complex
setups that will make finding these fundamental problems more difficult.
Hence: **always** start with a 1-sample overfitting test,
and then increase the complexity of the setup.
```{admonition} Best practices 👑
:class: tip
To summarize the scattered comments of the previous sections, here's a set of "golden rules" for setting up a DL project.
- Always start with a 1-sample overfitting test.
- Check how many trainable parameters your network has.
- Slowly increase the amount of trianing data (and potentially network parameters and depth).
- Adjust hyperparameters (especially the learning rate).
- Then introduce other components such as differentiable solvers or adversarial training.
```
### Stability
@ -65,27 +76,33 @@ because they can accurately adapt to the signals they receive at training time,
but in contrast to other learned representations, they're actually not very good
at extrapolation. So we can't expect an NN to magically work with new inputs.
Rather, we need to make sure that we can properly shape the input space,
e.g., by normalization and by focusing on invariants. In short, if you always train
e.g., by normalization and by focusing on invariants.
To give a more specific example: if you always train
your networks for inputs in the range $[0\dots1]$, don't expect it to work
with inputs of $[27\dots39]$. You might be able to subtract an offset of $10$ beforehand,
and re-apply it after evaluating the network.
As a rule of thumb: always make sure you
actually train the NN on the kinds of input you want to use at inference time.
with inputs of $[27\dots39]$. In certain cases it's valid to normalize
inputs and outputs by subtracting the mean, and normalize via the standard
deviation or a suitable quantile (make sure this doesn't destroy important
correlations in your data).
As a rule of thumb: make sure you actually train the NN on the
inputs that are as similar as possible to those you want to use at inference time.
This is important to keep in mind during the next chapters: e.g., if we
want an NN to work in conjunction with another solver or simulation environment,
it's important to actually bring the solver into the training process, otherwise
the network might specialize on pre-computed data that differs from what is produced
when combining the NN with the solver, i.e _distribution shift_.
when combining the NN with the solver, i.e it will suffer from _distribution shift_.
### Meshes and grids
The previous airfoil example use Cartesian grids with standard
The previous airfoil example used Cartesian grids with standard
convolutions. These typically give the most _bang-for-the-buck_, in terms
of performance and stability. Nonetheless, the whole discussion here of course
also holds for less regular convolutions, e.g., a less regular mesh
in conjunction with graph-convolutions. You will typically see reduced learning
performance in exchange for improved stability when switching to these.
also holds for other types of convolutions, e.g., a less regular mesh
in conjunction with graph-convolutions, or particle-based data
with continuous convolutions (cf {doc}`others-lagrangian`). You will typically see reduced learning
performance in exchange for improved sampling flexibility when switching to these.
Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
to avoid these as much as possible. For any structured data, like spatial functions,
@ -108,8 +125,9 @@ To summarize, supervised training has the following properties.
❌ Con:
- Lots of data needed.
- Sub-optimal performance, accuracy and generalization.
- Interactions with external "processes" (such as embedding into a solver) are difficult.
Outlook: any interactions with external "processes" (such as embedding into a solver) are tricky with supervised training.
The next chapters will explain how to alleviate these shortcomings of supervised training.
First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
we'll revisit the challenges of bringing together numerical simulations and learned approaches.

View File

@ -2,41 +2,43 @@ Supervised Training
=======================
_Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of
deep learning (DL), of course, so it's still fairly new. Also, "old fashioned" doesn't
always mean bad - it's just that later on we'll be able to do better than with a simple supervised training.
deep learning (DL), of course, so it's still fairly new.
Also, "old fashioned" doesn't always mean bad - it's just that later on we'll discuss ways to train networks that clearly outperform approaches using supervised training.
In a way, the viewpoint of "supervised training" is a starting point for all projects one would encounter in the context of DL, and
hence is worth studying. While it typically yields inferior results to approaches that more tightly
couple with physics, it nonetheless can be the only choice in certain application scenarios where no good
Nonetheless, "supervised training" is a starting point for all projects one would encounter in the context of DL, and
hence it is worth studying. Also, while it typically yields inferior results to approaches that more tightly
couple with physics, it can be the only choice in certain application scenarios where no good
model equations exist.
## Problem setting
For supervised training, we're faced with an
unknown function $f^*(x)=y^*$, collect lots of pairs of data $[x_0,y^*_0], ...[x_n,y^*_n]$ (the training data set)
and directly train a NN to represent an approximation of $f^*$ denoted as $f$, such
that $f(x)=y \approx y^*$.
and directly train a NN to represent an approximation of $f^*$ denoted as $f$.
The $f$ we can obtain is typically not exact,
The $f$ we can obtain in this way is typically not exact,
but instead we obtain it via a minimization problem:
by adjusting weights $\theta$ of our representation with $f$ such that
by adjusting the weights $\theta$ of our NN representation of $f$ such that
$\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2$.
$$
\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
$$ (supervised-training)
This will give us $\theta$ such that $f(x;\theta) \approx y$ as accurately as possible given
This will give us $\theta$ such that $f(x;\theta) = y \approx y$ as accurately as possible given
our choice of $f$ and the hyperparameters for training. Note that above we've assumed
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
of a suitable metric is topic we will get back to later on.
of a suitable metric is a topic we will get back to later on.
Irrespective of our choice of metric, this formulation
gives the actual "learning" process for a supervised approach.
The training data typically needs to be of substantial size, and hence it is attractive
to use numerical simulations to produce a large number of training input-output pairs.
to use numerical simulations solving a physical model $\mathcal{P}$
to produce a large number of reliable input-output pairs for training.
This means that the training process uses a set of model equations, and approximates
them numerically, in order to train the NN representation $\tilde{f}$. This
has a bunch of advantages, e.g., we don't have measurement noise of real-world devices
has quite a few advantages, e.g., we don't have measurement noise of real-world devices
and we don't need manual labour to annotate a large number of samples to get training data.
On the other hand, this approach inherits the common challenges of replacing experiments
@ -44,7 +46,7 @@ with simulations: first, we need to ensure the chosen model has enough power to
behavior of real-world phenomena that we're interested in.
In addition, the numerical approximations have numerical errors
which need to be kept small enough for a chosen application. As these topics are studied in depth
for classical simulations, the existing knowledge can likewise be leveraged to
for classical simulations, and the existing knowledge can likewise be leveraged to
set up DL training tasks.
```{figure} resources/supervised-training.jpg
@ -56,8 +58,24 @@ A visual overview of supervised training. Quite simple overall, but it's good to
in mind in comparison to the more complex variants we'll encounter later on.
```
## Surrogate models
One of the central advantages of the supervised approach above is that
we obtain a _surrogate_ for the model $\mathcal{P}$. The numerical approximations
of PDE models for real world phenomena are often very expensive to compute. A trained
NN on the other hand incurs a constant cost per evaluation, and is typically trivial
to evaluate on specialized hardware such as GPUs or NN units.
Despite this, it's important to be careful:
NNs can quickly generate huge numbers of inbetween results. Consider a CNN layer with
$128$ features. If we apply it to an input of $128^2$, i.e. ca. 16k cells, we get $128^3$ intermediate values.
That's more than 2 million.
All these values at least need to be momentarily stored in memory, and processed by the next layer.
Nonetheless, replacing complex and expensive solvers with fast, learned approximations
is a very attractive and interesting direction.
## Show me some code!
Let's directly look at an implementation within a more complicated context:
_turbulent flows around airfoils_ from {cite}`thuerey2020deepFlowPred`.
Let's directly look at an example for this: we'll replace a full solver for
_turbulent flows around airfoils_ with a surrogate model (from {cite}`thuerey2020dfp`).