diff --git a/supervised-discuss.md b/supervised-discuss.md
new file mode 100644
index 0000000..f035e25
--- /dev/null
+++ b/supervised-discuss.md
@@ -0,0 +1,114 @@
+Discussion of Supervised Approaches
+=======================
+
+The previous example illustrates that we can quite easily use 
+supervised training to solve quite complex tasks. The main workload is
+collecting a large enough dataset of examples. Once that exists, we can
+train a network to approximate the solution manifold sampled
+by these solutions, and the trained network can give us predictions
+very quickly. There are a few important points to keep in mind when 
+using supervised training.
+
+## Some things to keep in mind...
+
+### Natural starting point
+
+_Supervised training_ is the natural starting point for **any** DL project. It always,
+and we really mean **always** here, makes sense to start with a fully supervised
+test using as little data as possible. This will be a pure overfitting test,
+but if your network can't quicklyl converge and give a very good performance 
+on a single example, then there's something fundamentally wrong
+with your code or data. Thus, there's no reason to move on to more complex
+setups that will make finding these fundamental problems more difficult.
+
+Hence: **always** start with a 1-sample overfitting test,
+and then increase the complexity of the setup.
+
+### Stability
+
+A nice property of the supervised training is also that it's very stable.
+Things won't get any better when we include more complex physical 
+models, or look at more complicated NN architectures.
+
+Thus, again, make sure you can see a nice exponential falloff in your training 
+loss when starting with the simple overfitting tests. This is a good
+setup to figure out an upper bound and reasonable range for the learning rate
+as the most central hyperparameter.
+You'll probably need to reduce it later on, but you should at least get a 
+rough estimate of suitable values for $\eta$.
+
+### Where's the magic? 🦄 
+
+A comment that you'll often hear when talking about DL approaches, and especially
+when using relatively simple training methodologies is: "Isn't it just interpolating the data?"
+
+Well, **yes** it is! And that's exactly what the NN should do. In a way - there isn't 
+anything else to do. This is what _all_ DL approaches are about. They give us smooth
+representations of the data seen at training time. Even if we'll use fancy physical 
+models at training time later on, the NNs just adjust their weights to represent the signals
+they receive, and reproduce it.
+
+Due to the hype and numerous success stories, people not familiar with DL often have 
+the impression that DL works like a human mind, and is able to detect fundamental
+and general principles in data sets (["messages from god"](https://dilbert.com/strip/2000-01-03) anyone?).
+That's not what happens with the current state of the art. Nonetheless, it's
+the most powerful tool we have to approximate complex, non-linear functions.
+It is a great tool, but it's important to keep in mind, that once we set up the training
+correctly, all we'll get out of it is an approximation of the function the NN
+was trained for - no magic involved.
+
+An implication of this is that you shouldn't expect the network 
+to work on data it has never seen. In a way, the NNs are so good exactly 
+because they can accurately adapt to the signals they receive at training time,
+but in contrast to other learned representations, they're acutally not very good
+at extrapolation. So we can't expect an NN to magically work with new inputs.
+Rather, we need to make sure that we can properly shape the input space,
+e.g., by normalization and by focusing on invariants. In short, if you always train
+your networks for inputs in the range $[0\dots1]$, don't expect it to work
+with inputs of $[10\dots11]$. You might be able to subtract an offset of $10$ beforehand,
+and re-apply it after evaluating the network.
+As a rule of thumb: always make sure you
+acutally train the NN on the kinds of input you want to use at inference time.
+
+This is important to keep in mind during the next chapters: e.g., if we
+want an NN to work in conjunction with another solver or simulation environment,
+it's important to actually bring the solver into the training process, otherwise
+the network might specialize on pre-computed data that differs from what is produced
+when combining the NN with the solver, i.e _distribution shift_.
+
+### Meshes and grids
+
+The previous airfoil example use Cartesian grids with standard 
+convolutions. These typically give the most _bang-for-the-buck_, in terms
+of performance and stability. Nonetheless, the whole discussion here of course 
+also holds for less regular convcolutions, e.g., a less regular mesh
+in conjunction with graph-convolutions. You will typically see reduced learning
+performance in exchange for improved stability when switching to these.
+
+Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
+to avoid these as much as possible. For any structured data, like spatial functions,
+or _field data_ in general, convolutions are preferable, and less likely to overfit.
+E.g., you'll notice that CNNs typically don't need dropout, as they're nicely
+regularized by construction. For MLPs, you typically need quite a bit to
+avoid overfitting.
+
+---
+
+## Supervised Training in a nutshell
+
+To summarize:
+
+✅ Pros: 
+- very fast training
+- stable and simple
+- great starting point
+
+❌ Con: 
+- lots of data needed
+- sub-optimal performance, accuracy and generalization
+
+Outlook: interactions with external "processes" (such as embedding into a solver) are tricky with supervised training.
+First, we'll look at bringing model equations into the picture via soft-constraints, and afterwards
+we'll revisit the challenges of bringing together numerical simulations and learned approaches.
+
+