pbdl-book/supervised-discuss.md

165 lines
7.8 KiB
Markdown
Raw Normal View History

2021-03-03 04:10:49 +01:00
Discussion of Supervised Approaches
=======================
The previous example illustrates that we can quite easily use
2021-05-16 05:29:51 +02:00
supervised training to solve complex tasks. The main workload is
2021-03-03 04:10:49 +01:00
collecting a large enough dataset of examples. Once that exists, we can
train a network to approximate the solution manifold sampled
by these solutions, and the trained network can give us predictions
very quickly. There are a few important points to keep in mind when
using supervised training.
2021-04-11 14:17:03 +02:00
![Divider](resources/divider1.jpg)
2021-03-03 04:10:49 +01:00
## Some things to keep in mind...
### Natural starting point
_Supervised training_ is the natural starting point for **any** DL project. It always,
and we really mean **always** here, makes sense to start with a fully supervised
test using as little data as possible. This will be a pure overfitting test,
2021-03-09 09:39:54 +01:00
but if your network can't quickly converge and give a very good performance
2021-03-03 04:10:49 +01:00
on a single example, then there's something fundamentally wrong
with your code or data. Thus, there's no reason to move on to more complex
setups that will make finding these fundamental problems more difficult.
2021-05-16 05:29:51 +02:00
```{admonition} Best practices 👑
:class: tip
To summarize the scattered comments of the previous sections, here's a set of "golden rules" for setting up a DL project.
- Always start with a 1-sample overfitting test.
- Check how many trainable parameters your network has.
2021-06-22 09:49:57 +02:00
- Slowly increase the amount of training data (and potentially network parameters and depth).
2021-05-16 05:29:51 +02:00
- Adjust hyperparameters (especially the learning rate).
- Then introduce other components such as differentiable solvers or adversarial training.
```
2021-03-03 04:10:49 +01:00
### Stability
A nice property of the supervised training is also that it's very stable.
Things won't get any better when we include more complex physical
models, or look at more complicated NN architectures.
2021-03-03 04:10:49 +01:00
Thus, again, make sure you can see a nice exponential falloff in your training
loss when starting with the simple overfitting tests. This is a good
setup to figure out an upper bound and reasonable range for the learning rate
as the most central hyperparameter.
You'll probably need to reduce it later on, but you should at least get a
rough estimate of suitable values for $\eta$.
### Know your data
All data-driven methods obey the _garbage-in-garbage-out_ principle. Because of this it's important
to work on getting to know the data you are dealing with. While there's no one-size-fits-all
approach for how to best achieve this, we can strongly recommend to track
a broad range of statistics of your data set. A good starting point are
per quantity mean, standard deviation, min and max values.
If some of these contain unusual values, this is a first indicator of bad
samples in the dataset.
These values can
also be easily visualized in terms of histograms, to track down
unwanted outliers. A small number of such outliers
can easily skew a data set in undesirable ways.
Finally, checking the relationships between different quantities
is often a good idea to get some intuition for what's contained in the
data set. The next figure gives an example for this step.
```{figure} resources/supervised-example-plot.jpg
---
height: 300px
name: supervised-example-plot
---
An example from the airfoil case of the previous section: a visualization of a training data
set in terms of mean u and v velocity of 2D flow fields. It nicely shows that there are no extreme outliers,
but there are a few entries with relatively low mean u velocity on the left side.
A second, smaller data set is shown on top in red, showing that its samples cover the range of mean motions quite well.
```
2021-03-03 04:10:49 +01:00
### Where's the magic? 🦄
A comment that you'll often hear when talking about DL approaches, and especially
when using relatively simple training methodologies is: "Isn't it just interpolating the data?"
Well, **yes** it is! And that's exactly what the NN should do. In a way - there isn't
2021-03-03 04:10:49 +01:00
anything else to do. This is what _all_ DL approaches are about. They give us smooth
representations of the data seen at training time. Even if we'll use fancy physical
models at training time later on, the NNs just adjust their weights to represent the signals
2021-03-03 04:10:49 +01:00
they receive, and reproduce it.
Due to the hype and numerous success stories, people not familiar with DL often have
the impression that DL works like a human mind, and is able to detect fundamental
and general principles in data sets (["messages from god"](https://dilbert.com/strip/2000-01-03) anyone?).
That's not what happens with the current state of the art. Nonetheless, it's
the most powerful tool we have to approximate complex, non-linear functions.
It is a great tool, but it's important to keep in mind, that once we set up the training
correctly, all we'll get out of it is an approximation of the function the NN
2021-03-03 04:10:49 +01:00
was trained for - no magic involved.
An implication of this is that you shouldn't expect the network
to work on data it has never seen. In a way, the NNs are so good exactly
2021-03-03 04:10:49 +01:00
because they can accurately adapt to the signals they receive at training time,
2021-03-09 09:39:54 +01:00
but in contrast to other learned representations, they're actually not very good
at extrapolation. So we can't expect an NN to magically work with new inputs.
2021-03-03 04:10:49 +01:00
Rather, we need to make sure that we can properly shape the input space,
2021-05-16 05:29:51 +02:00
e.g., by normalization and by focusing on invariants.
To give a more specific example: if you always train
2021-03-03 04:10:49 +01:00
your networks for inputs in the range $[0\dots1]$, don't expect it to work
2021-05-16 05:29:51 +02:00
with inputs of $[27\dots39]$. In certain cases it's valid to normalize
inputs and outputs by subtracting the mean, and normalize via the standard
deviation or a suitable quantile (make sure this doesn't destroy important
correlations in your data).
As a rule of thumb: make sure you actually train the NN on the
inputs that are as similar as possible to those you want to use at inference time.
2021-03-03 04:10:49 +01:00
This is important to keep in mind during the next chapters: e.g., if we
want an NN to work in conjunction with another solver or simulation environment,
2021-03-03 04:10:49 +01:00
it's important to actually bring the solver into the training process, otherwise
the network might specialize on pre-computed data that differs from what is produced
2021-05-16 05:29:51 +02:00
when combining the NN with the solver, i.e it will suffer from _distribution shift_.
2021-03-03 04:10:49 +01:00
### Meshes and grids
2021-05-16 05:29:51 +02:00
The previous airfoil example used Cartesian grids with standard
2021-03-03 04:10:49 +01:00
convolutions. These typically give the most _bang-for-the-buck_, in terms
of performance and stability. Nonetheless, the whole discussion here of course
2021-05-16 05:29:51 +02:00
also holds for other types of convolutions, e.g., a less regular mesh
in conjunction with graph-convolutions, or particle-based data
with continuous convolutions (cf {doc}`others-lagrangian`). You will typically see reduced learning
performance in exchange for improved sampling flexibility when switching to these.
2021-03-03 04:10:49 +01:00
Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
to avoid these as much as possible. For any structured data, like spatial functions,
or _field data_ in general, convolutions are preferable, and less likely to overfit.
E.g., you'll notice that CNNs typically don't need dropout, as they're nicely
regularized by construction. For MLPs, you typically need quite a bit to
avoid overfitting.
2021-04-11 14:17:03 +02:00
![Divider](resources/divider2.jpg)
2021-03-03 04:10:49 +01:00
2021-04-12 03:19:00 +02:00
## Supervised training in a nutshell
2021-03-03 04:10:49 +01:00
2021-03-26 03:28:05 +01:00
To summarize, supervised training has the following properties.
2021-03-03 04:10:49 +01:00
✅ Pros:
2021-03-26 03:28:05 +01:00
- Very fast training.
- Stable and simple.
- Great starting point.
2021-03-03 04:10:49 +01:00
❌ Con:
2021-03-26 03:28:05 +01:00
- Lots of data needed.
- Sub-optimal performance, accuracy and generalization.
2021-05-16 05:29:51 +02:00
- Interactions with external "processes" (such as embedding into a solver) are difficult.
2021-03-03 04:10:49 +01:00
2021-05-16 05:29:51 +02:00
The next chapters will explain how to alleviate these shortcomings of supervised training.
2021-03-03 04:10:49 +01:00
First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
we'll revisit the challenges of bringing together numerical simulations and learned approaches.