update supervised chapter
This commit is contained in:
parent
05d2783759
commit
cd3de70540
@ -11,6 +11,9 @@ fte = re.compile(r"👋")
|
||||
|
||||
# TODO , replace phi symbol w text in phiflow
|
||||
|
||||
# TODO , filter tensorflow warnings?
|
||||
# also torch "UserWarning:"
|
||||
|
||||
path = "tmp2.txt" # simple
|
||||
path = "tmp.txt" # utf8
|
||||
#path = "book.tex-in.bak" # full utf8
|
||||
|
File diff suppressed because one or more lines are too long
@ -2,7 +2,7 @@ Discussion of Supervised Approaches
|
||||
=======================
|
||||
|
||||
The previous example illustrates that we can quite easily use
|
||||
supervised training to solve quite complex tasks. The main workload is
|
||||
supervised training to solve complex tasks. The main workload is
|
||||
collecting a large enough dataset of examples. Once that exists, we can
|
||||
train a network to approximate the solution manifold sampled
|
||||
by these solutions, and the trained network can give us predictions
|
||||
@ -23,8 +23,19 @@ on a single example, then there's something fundamentally wrong
|
||||
with your code or data. Thus, there's no reason to move on to more complex
|
||||
setups that will make finding these fundamental problems more difficult.
|
||||
|
||||
Hence: **always** start with a 1-sample overfitting test,
|
||||
and then increase the complexity of the setup.
|
||||
```{admonition} Best practices 👑
|
||||
:class: tip
|
||||
|
||||
To summarize the scattered comments of the previous sections, here's a set of "golden rules" for setting up a DL project.
|
||||
|
||||
- Always start with a 1-sample overfitting test.
|
||||
- Check how many trainable parameters your network has.
|
||||
- Slowly increase the amount of trianing data (and potentially network parameters and depth).
|
||||
- Adjust hyperparameters (especially the learning rate).
|
||||
- Then introduce other components such as differentiable solvers or adversarial training.
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Stability
|
||||
|
||||
@ -65,27 +76,33 @@ because they can accurately adapt to the signals they receive at training time,
|
||||
but in contrast to other learned representations, they're actually not very good
|
||||
at extrapolation. So we can't expect an NN to magically work with new inputs.
|
||||
Rather, we need to make sure that we can properly shape the input space,
|
||||
e.g., by normalization and by focusing on invariants. In short, if you always train
|
||||
e.g., by normalization and by focusing on invariants.
|
||||
|
||||
To give a more specific example: if you always train
|
||||
your networks for inputs in the range $[0\dots1]$, don't expect it to work
|
||||
with inputs of $[27\dots39]$. You might be able to subtract an offset of $10$ beforehand,
|
||||
and re-apply it after evaluating the network.
|
||||
As a rule of thumb: always make sure you
|
||||
actually train the NN on the kinds of input you want to use at inference time.
|
||||
with inputs of $[27\dots39]$. In certain cases it's valid to normalize
|
||||
inputs and outputs by subtracting the mean, and normalize via the standard
|
||||
deviation or a suitable quantile (make sure this doesn't destroy important
|
||||
correlations in your data).
|
||||
|
||||
As a rule of thumb: make sure you actually train the NN on the
|
||||
inputs that are as similar as possible to those you want to use at inference time.
|
||||
|
||||
This is important to keep in mind during the next chapters: e.g., if we
|
||||
want an NN to work in conjunction with another solver or simulation environment,
|
||||
it's important to actually bring the solver into the training process, otherwise
|
||||
the network might specialize on pre-computed data that differs from what is produced
|
||||
when combining the NN with the solver, i.e _distribution shift_.
|
||||
when combining the NN with the solver, i.e it will suffer from _distribution shift_.
|
||||
|
||||
### Meshes and grids
|
||||
|
||||
The previous airfoil example use Cartesian grids with standard
|
||||
The previous airfoil example used Cartesian grids with standard
|
||||
convolutions. These typically give the most _bang-for-the-buck_, in terms
|
||||
of performance and stability. Nonetheless, the whole discussion here of course
|
||||
also holds for less regular convolutions, e.g., a less regular mesh
|
||||
in conjunction with graph-convolutions. You will typically see reduced learning
|
||||
performance in exchange for improved stability when switching to these.
|
||||
also holds for other types of convolutions, e.g., a less regular mesh
|
||||
in conjunction with graph-convolutions, or particle-based data
|
||||
with continuous convolutions (cf {doc}`others-lagrangian`). You will typically see reduced learning
|
||||
performance in exchange for improved sampling flexibility when switching to these.
|
||||
|
||||
Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
|
||||
to avoid these as much as possible. For any structured data, like spatial functions,
|
||||
@ -108,8 +125,9 @@ To summarize, supervised training has the following properties.
|
||||
❌ Con:
|
||||
- Lots of data needed.
|
||||
- Sub-optimal performance, accuracy and generalization.
|
||||
- Interactions with external "processes" (such as embedding into a solver) are difficult.
|
||||
|
||||
Outlook: any interactions with external "processes" (such as embedding into a solver) are tricky with supervised training.
|
||||
The next chapters will explain how to alleviate these shortcomings of supervised training.
|
||||
First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
|
||||
we'll revisit the challenges of bringing together numerical simulations and learned approaches.
|
||||
|
||||
|
@ -2,41 +2,43 @@ Supervised Training
|
||||
=======================
|
||||
|
||||
_Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of
|
||||
deep learning (DL), of course, so it's still fairly new. Also, "old fashioned" doesn't
|
||||
always mean bad - it's just that later on we'll be able to do better than with a simple supervised training.
|
||||
deep learning (DL), of course, so it's still fairly new.
|
||||
Also, "old fashioned" doesn't always mean bad - it's just that later on we'll discuss ways to train networks that clearly outperform approaches using supervised training.
|
||||
|
||||
In a way, the viewpoint of "supervised training" is a starting point for all projects one would encounter in the context of DL, and
|
||||
hence is worth studying. While it typically yields inferior results to approaches that more tightly
|
||||
couple with physics, it nonetheless can be the only choice in certain application scenarios where no good
|
||||
Nonetheless, "supervised training" is a starting point for all projects one would encounter in the context of DL, and
|
||||
hence it is worth studying. Also, while it typically yields inferior results to approaches that more tightly
|
||||
couple with physics, it can be the only choice in certain application scenarios where no good
|
||||
model equations exist.
|
||||
|
||||
## Problem setting
|
||||
|
||||
For supervised training, we're faced with an
|
||||
unknown function $f^*(x)=y^*$, collect lots of pairs of data $[x_0,y^*_0], ...[x_n,y^*_n]$ (the training data set)
|
||||
and directly train a NN to represent an approximation of $f^*$ denoted as $f$, such
|
||||
that $f(x)=y \approx y^*$.
|
||||
and directly train a NN to represent an approximation of $f^*$ denoted as $f$.
|
||||
|
||||
The $f$ we can obtain is typically not exact,
|
||||
The $f$ we can obtain in this way is typically not exact,
|
||||
but instead we obtain it via a minimization problem:
|
||||
by adjusting weights $\theta$ of our representation with $f$ such that
|
||||
by adjusting the weights $\theta$ of our NN representation of $f$ such that
|
||||
|
||||
$\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2$.
|
||||
$$
|
||||
\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y^*_i)^2 .
|
||||
$$ (supervised-training)
|
||||
|
||||
This will give us $\theta$ such that $f(x;\theta) \approx y$ as accurately as possible given
|
||||
This will give us $\theta$ such that $f(x;\theta) = y \approx y$ as accurately as possible given
|
||||
our choice of $f$ and the hyperparameters for training. Note that above we've assumed
|
||||
the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
|
||||
to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
|
||||
of a suitable metric is topic we will get back to later on.
|
||||
of a suitable metric is a topic we will get back to later on.
|
||||
|
||||
Irrespective of our choice of metric, this formulation
|
||||
gives the actual "learning" process for a supervised approach.
|
||||
|
||||
The training data typically needs to be of substantial size, and hence it is attractive
|
||||
to use numerical simulations to produce a large number of training input-output pairs.
|
||||
to use numerical simulations solving a physical model $\mathcal{P}$
|
||||
to produce a large number of reliable input-output pairs for training.
|
||||
This means that the training process uses a set of model equations, and approximates
|
||||
them numerically, in order to train the NN representation $\tilde{f}$. This
|
||||
has a bunch of advantages, e.g., we don't have measurement noise of real-world devices
|
||||
has quite a few advantages, e.g., we don't have measurement noise of real-world devices
|
||||
and we don't need manual labour to annotate a large number of samples to get training data.
|
||||
|
||||
On the other hand, this approach inherits the common challenges of replacing experiments
|
||||
@ -44,7 +46,7 @@ with simulations: first, we need to ensure the chosen model has enough power to
|
||||
behavior of real-world phenomena that we're interested in.
|
||||
In addition, the numerical approximations have numerical errors
|
||||
which need to be kept small enough for a chosen application. As these topics are studied in depth
|
||||
for classical simulations, the existing knowledge can likewise be leveraged to
|
||||
for classical simulations, and the existing knowledge can likewise be leveraged to
|
||||
set up DL training tasks.
|
||||
|
||||
```{figure} resources/supervised-training.jpg
|
||||
@ -56,8 +58,24 @@ A visual overview of supervised training. Quite simple overall, but it's good to
|
||||
in mind in comparison to the more complex variants we'll encounter later on.
|
||||
```
|
||||
|
||||
## Surrogate models
|
||||
|
||||
One of the central advantages of the supervised approach above is that
|
||||
we obtain a _surrogate_ for the model $\mathcal{P}$. The numerical approximations
|
||||
of PDE models for real world phenomena are often very expensive to compute. A trained
|
||||
NN on the other hand incurs a constant cost per evaluation, and is typically trivial
|
||||
to evaluate on specialized hardware such as GPUs or NN units.
|
||||
|
||||
Despite this, it's important to be careful:
|
||||
NNs can quickly generate huge numbers of inbetween results. Consider a CNN layer with
|
||||
$128$ features. If we apply it to an input of $128^2$, i.e. ca. 16k cells, we get $128^3$ intermediate values.
|
||||
That's more than 2 million.
|
||||
All these values at least need to be momentarily stored in memory, and processed by the next layer.
|
||||
|
||||
Nonetheless, replacing complex and expensive solvers with fast, learned approximations
|
||||
is a very attractive and interesting direction.
|
||||
|
||||
## Show me some code!
|
||||
|
||||
Let's directly look at an implementation within a more complicated context:
|
||||
_turbulent flows around airfoils_ from {cite}`thuerey2020deepFlowPred`.
|
||||
|
||||
Let's directly look at an example for this: we'll replace a full solver for
|
||||
_turbulent flows around airfoils_ with a surrogate model (from {cite}`thuerey2020dfp`).
|
||||
|
Loading…
Reference in New Issue
Block a user