update supervised chapter

This commit is contained in:
NT
2021-05-16 11:29:51 +08:00
parent 05d2783759
commit cd3de70540
4 changed files with 885 additions and 998 deletions

View File

@@ -2,7 +2,7 @@ Discussion of Supervised Approaches
=======================
The previous example illustrates that we can quite easily use
supervised training to solve quite complex tasks. The main workload is
supervised training to solve complex tasks. The main workload is
collecting a large enough dataset of examples. Once that exists, we can
train a network to approximate the solution manifold sampled
by these solutions, and the trained network can give us predictions
@@ -23,8 +23,19 @@ on a single example, then there's something fundamentally wrong
with your code or data. Thus, there's no reason to move on to more complex
setups that will make finding these fundamental problems more difficult.
Hence: **always** start with a 1-sample overfitting test,
and then increase the complexity of the setup.
```{admonition} Best practices 👑
:class: tip
To summarize the scattered comments of the previous sections, here's a set of "golden rules" for setting up a DL project.
- Always start with a 1-sample overfitting test.
- Check how many trainable parameters your network has.
- Slowly increase the amount of trianing data (and potentially network parameters and depth).
- Adjust hyperparameters (especially the learning rate).
- Then introduce other components such as differentiable solvers or adversarial training.
```
### Stability
@@ -65,27 +76,33 @@ because they can accurately adapt to the signals they receive at training time,
but in contrast to other learned representations, they're actually not very good
at extrapolation. So we can't expect an NN to magically work with new inputs.
Rather, we need to make sure that we can properly shape the input space,
e.g., by normalization and by focusing on invariants. In short, if you always train
e.g., by normalization and by focusing on invariants.
To give a more specific example: if you always train
your networks for inputs in the range $[0\dots1]$, don't expect it to work
with inputs of $[27\dots39]$. You might be able to subtract an offset of $10$ beforehand,
and re-apply it after evaluating the network.
As a rule of thumb: always make sure you
actually train the NN on the kinds of input you want to use at inference time.
with inputs of $[27\dots39]$. In certain cases it's valid to normalize
inputs and outputs by subtracting the mean, and normalize via the standard
deviation or a suitable quantile (make sure this doesn't destroy important
correlations in your data).
As a rule of thumb: make sure you actually train the NN on the
inputs that are as similar as possible to those you want to use at inference time.
This is important to keep in mind during the next chapters: e.g., if we
want an NN to work in conjunction with another solver or simulation environment,
it's important to actually bring the solver into the training process, otherwise
the network might specialize on pre-computed data that differs from what is produced
when combining the NN with the solver, i.e _distribution shift_.
when combining the NN with the solver, i.e it will suffer from _distribution shift_.
### Meshes and grids
The previous airfoil example use Cartesian grids with standard
The previous airfoil example used Cartesian grids with standard
convolutions. These typically give the most _bang-for-the-buck_, in terms
of performance and stability. Nonetheless, the whole discussion here of course
also holds for less regular convolutions, e.g., a less regular mesh
in conjunction with graph-convolutions. You will typically see reduced learning
performance in exchange for improved stability when switching to these.
also holds for other types of convolutions, e.g., a less regular mesh
in conjunction with graph-convolutions, or particle-based data
with continuous convolutions (cf {doc}`others-lagrangian`). You will typically see reduced learning
performance in exchange for improved sampling flexibility when switching to these.
Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
to avoid these as much as possible. For any structured data, like spatial functions,
@@ -108,8 +125,9 @@ To summarize, supervised training has the following properties.
❌ Con:
- Lots of data needed.
- Sub-optimal performance, accuracy and generalization.
- Interactions with external "processes" (such as embedding into a solver) are difficult.
Outlook: any interactions with external "processes" (such as embedding into a solver) are tricky with supervised training.
The next chapters will explain how to alleviate these shortcomings of supervised training.
First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
we'll revisit the challenges of bringing together numerical simulations and learned approaches.