pbdl-book/supervised-discuss.md

Discussion of Supervised Approaches
=======================

The previous example illustrates that we can quite easily use 
supervised training to solve quite complex tasks. The main workload is
collecting a large enough dataset of examples. Once that exists, we can
train a network to approximate the solution manifold sampled
by these solutions, and the trained network can give us predictions
very quickly. There are a few important points to keep in mind when 
using supervised training.

![Divider](resources/divider1.jpg)

## Some things to keep in mind...

### Natural starting point

_Supervised training_ is the natural starting point for **any** DL project. It always,
and we really mean **always** here, makes sense to start with a fully supervised
test using as little data as possible. This will be a pure overfitting test,
but if your network can't quickly converge and give a very good performance 
on a single example, then there's something fundamentally wrong
with your code or data. Thus, there's no reason to move on to more complex
setups that will make finding these fundamental problems more difficult.

Hence: **always** start with a 1-sample overfitting test,
and then increase the complexity of the setup.

### Stability

A nice property of the supervised training is also that it's very stable.
Things won't get any better when we include more complex physical 
models, or look at more complicated NN architectures.

Thus, again, make sure you can see a nice exponential falloff in your training 
loss when starting with the simple overfitting tests. This is a good
setup to figure out an upper bound and reasonable range for the learning rate
as the most central hyperparameter.
You'll probably need to reduce it later on, but you should at least get a 
rough estimate of suitable values for $\eta$.

### Where's the magic? 🦄 

A comment that you'll often hear when talking about DL approaches, and especially
when using relatively simple training methodologies is: "Isn't it just interpolating the data?"

Well, **yes** it is! And that's exactly what the NN should do. In a way - there isn't 
anything else to do. This is what _all_ DL approaches are about. They give us smooth
representations of the data seen at training time. Even if we'll use fancy physical 
models at training time later on, the NNs just adjust their weights to represent the signals
they receive, and reproduce it.

Due to the hype and numerous success stories, people not familiar with DL often have 
the impression that DL works like a human mind, and is able to detect fundamental
and general principles in data sets (["messages from god"](https://dilbert.com/strip/2000-01-03) anyone?).
That's not what happens with the current state of the art. Nonetheless, it's
the most powerful tool we have to approximate complex, non-linear functions.
It is a great tool, but it's important to keep in mind, that once we set up the training
correctly, all we'll get out of it is an approximation of the function the NN
was trained for - no magic involved.

An implication of this is that you shouldn't expect the network 
to work on data it has never seen. In a way, the NNs are so good exactly 
because they can accurately adapt to the signals they receive at training time,
but in contrast to other learned representations, they're actually not very good
at extrapolation. So we can't expect an NN to magically work with new inputs.
Rather, we need to make sure that we can properly shape the input space,
e.g., by normalization and by focusing on invariants. In short, if you always train
your networks for inputs in the range $[0\dots1]$, don't expect it to work
with inputs of $[27\dots39]$. You might be able to subtract an offset of $10$ beforehand,
and re-apply it after evaluating the network.
As a rule of thumb: always make sure you
actually train the NN on the kinds of input you want to use at inference time.

This is important to keep in mind during the next chapters: e.g., if we
want an NN to work in conjunction with another solver or simulation environment,
it's important to actually bring the solver into the training process, otherwise
the network might specialize on pre-computed data that differs from what is produced
when combining the NN with the solver, i.e _distribution shift_.

### Meshes and grids

The previous airfoil example use Cartesian grids with standard 
convolutions. These typically give the most _bang-for-the-buck_, in terms
of performance and stability. Nonetheless, the whole discussion here of course 
also holds for less regular convolutions, e.g., a less regular mesh
in conjunction with graph-convolutions. You will typically see reduced learning
performance in exchange for improved stability when switching to these.

Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
to avoid these as much as possible. For any structured data, like spatial functions,
or _field data_ in general, convolutions are preferable, and less likely to overfit.
E.g., you'll notice that CNNs typically don't need dropout, as they're nicely
regularized by construction. For MLPs, you typically need quite a bit to
avoid overfitting.

![Divider](resources/divider2.jpg)

## Supervised training in a nutshell

To summarize, supervised training has the following properties.

✅ Pros: 
- Very fast training.
- Stable and simple.
- Great starting point.

❌ Con: 
- Lots of data needed.
- Sub-optimal performance, accuracy and generalization.

Outlook: any interactions with external "processes" (such as embedding into a solver) are tricky with supervised training.
First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
we'll revisit the challenges of bringing together numerical simulations and learned approaches.
added missing section file 2021-03-03 04:10:49 +01:00			`Discussion of Supervised Approaches`
			`=======================`

			`The previous example illustrates that we can quite easily use`
			`supervised training to solve quite complex tasks. The main workload is`
			`collecting a large enough dataset of examples. Once that exists, we can`
			`train a network to approximate the solution manifold sampled`
			`by these solutions, and the trained network can give us predictions`
			`very quickly. There are a few important points to keep in mind when`
			`using supervised training.`

updated teaser, added dividers 2021-04-11 14:17:03 +02:00			`![Divider](resources/divider1.jpg)`

added missing section file 2021-03-03 04:10:49 +01:00			`## Some things to keep in mind...`

			`### Natural starting point`

			`_Supervised training_ is the natural starting point for any DL project. It always,`
			`and we really mean always here, makes sense to start with a fully supervised`
			`test using as little data as possible. This will be a pure overfitting test,`
spellcheck 2021-03-09 09:39:54 +01:00			`but if your network can't quickly converge and give a very good performance`
added missing section file 2021-03-03 04:10:49 +01:00			`on a single example, then there's something fundamentally wrong`
			`with your code or data. Thus, there's no reason to move on to more complex`
			`setups that will make finding these fundamental problems more difficult.`

			`Hence: always start with a 1-sample overfitting test,`
			`and then increase the complexity of the setup.`

			`### Stability`

			`A nice property of the supervised training is also that it's very stable.`
			`Things won't get any better when we include more complex physical`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`models, or look at more complicated NN architectures.`
added missing section file 2021-03-03 04:10:49 +01:00
			`Thus, again, make sure you can see a nice exponential falloff in your training`
			`loss when starting with the simple overfitting tests. This is a good`
			`setup to figure out an upper bound and reasonable range for the learning rate`
			`as the most central hyperparameter.`
			`You'll probably need to reduce it later on, but you should at least get a`
			`rough estimate of suitable values for $\eta$.`

			`### Where's the magic? 🦄`

			`A comment that you'll often hear when talking about DL approaches, and especially`
			`when using relatively simple training methodologies is: "Isn't it just interpolating the data?"`

cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`Well, yes it is! And that's exactly what the NN should do. In a way - there isn't`
added missing section file 2021-03-03 04:10:49 +01:00			`anything else to do. This is what _all_ DL approaches are about. They give us smooth`
			`representations of the data seen at training time. Even if we'll use fancy physical`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`models at training time later on, the NNs just adjust their weights to represent the signals`
added missing section file 2021-03-03 04:10:49 +01:00			`they receive, and reproduce it.`

			`Due to the hype and numerous success stories, people not familiar with DL often have`
			`the impression that DL works like a human mind, and is able to detect fundamental`
			`and general principles in data sets (["messages from god"](https://dilbert.com/strip/2000-01-03) anyone?).`
			`That's not what happens with the current state of the art. Nonetheless, it's`
			`the most powerful tool we have to approximate complex, non-linear functions.`
			`It is a great tool, but it's important to keep in mind, that once we set up the training`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`correctly, all we'll get out of it is an approximation of the function the NN`
added missing section file 2021-03-03 04:10:49 +01:00			`was trained for - no magic involved.`

			`An implication of this is that you shouldn't expect the network`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`to work on data it has never seen. In a way, the NNs are so good exactly`
added missing section file 2021-03-03 04:10:49 +01:00			`because they can accurately adapt to the signals they receive at training time,`
spellcheck 2021-03-09 09:39:54 +01:00			`but in contrast to other learned representations, they're actually not very good`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`at extrapolation. So we can't expect an NN to magically work with new inputs.`
added missing section file 2021-03-03 04:10:49 +01:00			`Rather, we need to make sure that we can properly shape the input space,`
			`e.g., by normalization and by focusing on invariants. In short, if you always train`
			`your networks for inputs in the range $[0\dots1]$, don't expect it to work`
unified caps of headings 2021-04-12 03:19:00 +02:00			`with inputs of $[27\dots39]$. You might be able to subtract an offset of $10$ beforehand,`
added missing section file 2021-03-03 04:10:49 +01:00			`and re-apply it after evaluating the network.`
			`As a rule of thumb: always make sure you`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`actually train the NN on the kinds of input you want to use at inference time.`
added missing section file 2021-03-03 04:10:49 +01:00
			`This is important to keep in mind during the next chapters: e.g., if we`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`want an NN to work in conjunction with another solver or simulation environment,`
added missing section file 2021-03-03 04:10:49 +01:00			`it's important to actually bring the solver into the training process, otherwise`
			`the network might specialize on pre-computed data that differs from what is produced`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`when combining the NN with the solver, i.e _distribution shift_.`
added missing section file 2021-03-03 04:10:49 +01:00
			`### Meshes and grids`

			`The previous airfoil example use Cartesian grids with standard`
			`convolutions. These typically give the most _bang-for-the-buck_, in terms`
			`of performance and stability. Nonetheless, the whole discussion here of course`
spellcheck 2021-03-09 09:39:54 +01:00			`also holds for less regular convolutions, e.g., a less regular mesh`
added missing section file 2021-03-03 04:10:49 +01:00			`in conjunction with graph-convolutions. You will typically see reduced learning`
			`performance in exchange for improved stability when switching to these.`

			`Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend`
			`to avoid these as much as possible. For any structured data, like spatial functions,`
			`or _field data_ in general, convolutions are preferable, and less likely to overfit.`
			`E.g., you'll notice that CNNs typically don't need dropout, as they're nicely`
			`regularized by construction. For MLPs, you typically need quite a bit to`
			`avoid overfitting.`

updated teaser, added dividers 2021-04-11 14:17:03 +02:00			`![Divider](resources/divider2.jpg)`
added missing section file 2021-03-03 04:10:49 +01:00
unified caps of headings 2021-04-12 03:19:00 +02:00			`## Supervised training in a nutshell`
added missing section file 2021-03-03 04:10:49 +01:00
PG conclusions, list formatting 2021-03-26 03:28:05 +01:00			`To summarize, supervised training has the following properties.`
added missing section file 2021-03-03 04:10:49 +01:00
			`✅ Pros:`
PG conclusions, list formatting 2021-03-26 03:28:05 +01:00			`- Very fast training.`
			`- Stable and simple.`
			`- Great starting point.`
added missing section file 2021-03-03 04:10:49 +01:00
			`❌ Con:`
PG conclusions, list formatting 2021-03-26 03:28:05 +01:00			`- Lots of data needed.`
			`- Sub-optimal performance, accuracy and generalization.`
added missing section file 2021-03-03 04:10:49 +01:00
PG conclusions, list formatting 2021-03-26 03:28:05 +01:00			`Outlook: any interactions with external "processes" (such as embedding into a solver) are tricky with supervised training.`
added missing section file 2021-03-03 04:10:49 +01:00			`First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards`
			`we'll revisit the challenges of bringing together numerical simulations and learned approaches.`