pbdl-book/supervised-discuss.md
2021-03-26 10:28:05 +08:00

5.6 KiB
Raw Blame History

Discussion of Supervised Approaches

The previous example illustrates that we can quite easily use supervised training to solve quite complex tasks. The main workload is collecting a large enough dataset of examples. Once that exists, we can train a network to approximate the solution manifold sampled by these solutions, and the trained network can give us predictions very quickly. There are a few important points to keep in mind when using supervised training.

Some things to keep in mind…

Natural starting point

Supervised training is the natural starting point for any DL project. It always, and we really mean always here, makes sense to start with a fully supervised test using as little data as possible. This will be a pure overfitting test, but if your network cant quickly converge and give a very good performance on a single example, then theres something fundamentally wrong with your code or data. Thus, theres no reason to move on to more complex setups that will make finding these fundamental problems more difficult.

Hence: always start with a 1-sample overfitting test, and then increase the complexity of the setup.

Stability

A nice property of the supervised training is also that its very stable. Things wont get any better when we include more complex physical models, or look at more complicated NN architectures.

Thus, again, make sure you can see a nice exponential falloff in your training loss when starting with the simple overfitting tests. This is a good setup to figure out an upper bound and reasonable range for the learning rate as the most central hyperparameter. Youll probably need to reduce it later on, but you should at least get a rough estimate of suitable values for \eta.

Wheres the magic? 🦄

A comment that youll often hear when talking about DL approaches, and especially when using relatively simple training methodologies is: “Isnt it just interpolating the data?”

Well, yes it is! And thats exactly what the NN should do. In a way - there isnt anything else to do. This is what all DL approaches are about. They give us smooth representations of the data seen at training time. Even if well use fancy physical models at training time later on, the NNs just adjust their weights to represent the signals they receive, and reproduce it.

Due to the hype and numerous success stories, people not familiar with DL often have the impression that DL works like a human mind, and is able to detect fundamental and general principles in data sets (“messages from god” anyone?). Thats not what happens with the current state of the art. Nonetheless, its the most powerful tool we have to approximate complex, non-linear functions. It is a great tool, but its important to keep in mind, that once we set up the training correctly, all well get out of it is an approximation of the function the NN was trained for - no magic involved.

An implication of this is that you shouldnt expect the network to work on data it has never seen. In a way, the NNs are so good exactly because they can accurately adapt to the signals they receive at training time, but in contrast to other learned representations, theyre actually not very good at extrapolation. So we cant expect an NN to magically work with new inputs. Rather, we need to make sure that we can properly shape the input space, e.g., by normalization and by focusing on invariants. In short, if you always train your networks for inputs in the range [0\dots1], dont expect it to work with inputs of [10\dots11]. You might be able to subtract an offset of 10 beforehand, and re-apply it after evaluating the network. As a rule of thumb: always make sure you actually train the NN on the kinds of input you want to use at inference time.

This is important to keep in mind during the next chapters: e.g., if we want an NN to work in conjunction with another solver or simulation environment, its important to actually bring the solver into the training process, otherwise the network might specialize on pre-computed data that differs from what is produced when combining the NN with the solver, i.e distribution shift.

Meshes and grids

The previous airfoil example use Cartesian grids with standard convolutions. These typically give the most bang-for-the-buck, in terms of performance and stability. Nonetheless, the whole discussion here of course also holds for less regular convolutions, e.g., a less regular mesh in conjunction with graph-convolutions. You will typically see reduced learning performance in exchange for improved stability when switching to these.

Finally, a word on fully-connected layers or MLPs in general: wed recommend to avoid these as much as possible. For any structured data, like spatial functions, or field data in general, convolutions are preferable, and less likely to overfit. E.g., youll notice that CNNs typically dont need dropout, as theyre nicely regularized by construction. For MLPs, you typically need quite a bit to avoid overfitting.


Supervised Training in a nutshell

To summarize, supervised training has the following properties.

Pros: - Very fast training. - Stable and simple. - Great starting point.

Con: - Lots of data needed. - Sub-optimal performance, accuracy and generalization.

Outlook: any interactions with external “processes” (such as embedding into a solver) are tricky with supervised training. First, well look at bringing model equations into the picture via soft-constraints, and afterwards well revisit the challenges of bringing together numerical simulations and learned approaches.