diff --git a/_toc.yml b/_toc.yml
index aef17f2..0c35cc2 100644
--- a/_toc.yml
+++ b/_toc.yml
@@ -1,9 +1,9 @@
-# Table of content
-# Learn more at https://jupyterbook.org/customize/toc.html
+# PBDL Table of content (cf https://jupyterbook.org/customize/toc.html)
 #
 - file: intro
 - file: overview.md
   sections:
+    - file: overview-equations.md
     - file: overview-burgers-forw.ipynb
     - file: overview-ns-forw.ipynb
 - file: supervised
@@ -12,6 +12,7 @@
 - file: physicalloss
   sections:
     - file: physicalloss-code.ipynb
+    - file: physicalloss-discuss.md
 - file: diffphys
   sections:
     - file: diffphys-code-gradient.ipynb
@@ -23,3 +24,4 @@
     - file: markdown
     - file: notebooks
 - file: references
+- file: notation
diff --git a/diffphys-discuss.md b/diffphys-discuss.md
index d795c78..2ee58db 100644
--- a/diffphys-discuss.md
+++ b/diffphys-discuss.md
@@ -29,7 +29,7 @@ For the PINN representation with fully-connected networks on the other hand, we
 
 The following table summarizes these findings:
 
-| Method   |  Pro   |  Con  |
+| Method   |  ✅ Pro   |  ❌ Con  |
 |----------|-------------|------------|
 | **PINN** | - Analytic derivatives via back-propagation  | - Expensive evaluation of NN, as well as derivative calculations | 
 |          | - Simple to implement  | - Incompatible with existing numerical methods     | 
diff --git a/intro.md b/intro.md
index fcf54be..4566b5f 100644
--- a/intro.md
+++ b/intro.md
@@ -73,9 +73,15 @@ The contents of the following files would not have been possible without the hel
 - Ms. y
 - ...
 
-% tests...
 
 
+
+
+
+% some markdown tests follow ...
+
+---
+
 a b c
 
 ```{admonition} My title2
@@ -86,6 +92,7 @@ See also... Test link: {doc}`supervised`
 ✅  Do this , ❌  Don't do this
 
 % ----------------
+
 ---
 
 
@@ -152,6 +159,6 @@ time series, sequence prediction?] {cite}`wiewel2019lss,bkim2019deep,wiewel2020l
 _Misc jupyter book TODOs_
 
 - Fix latex PDF output
-- How to include links in references?
+- How to include links to papers in the bibtex references?
 
 
diff --git a/jupyter-book-reference.md b/jupyter-book-reference.md
index 44b45b2..a793d0d 100644
--- a/jupyter-book-reference.md
+++ b/jupyter-book-reference.md
@@ -1,5 +1,8 @@
-Jupyter Book Reference Stuff
+Old Jupyter Book Reference Stuff
 =======================
 
 There are many ways to write content in Jupyter Book. This short section
 covers a few tips for how to do so.
+
+TODO remove sometime...
+
diff --git a/notation.md b/notation.md
new file mode 100644
index 0000000..305c893
--- /dev/null
+++ b/notation.md
@@ -0,0 +1,38 @@
+
+# Notation and Abbreviations
+
+## Math notation:
+
+| Symbol | Meaning |
+| --- | --- |
+| $A$ | matrix |
+| $\eta$ | learning rate or step size |
+| $\Gamma$ | boundary of computational domain $\Omega$ |
+| $f()$ | approximated version of $f^{*}$ |
+| $f^{*}()$ | generic function to be approximated, typically  unknown |
+| $\Omega$ | computational domain |
+| $\mathcal P$ | physical model, PDE |
+| $\theta$ | neural network params |
+| $t$ | time dimension |
+| $\mathbf{u}$ | vector-valued velocity |
+| $x$ | neural network input or spatial coordinate |
+| $y$ | neural network output |
+
+## Summary of the most important abbreviations:
+
+| ABbreviation | Meaning |
+| --- | --- |
+| CNN | Convolutional neural network |
+| DL | Deep learning |
+| NN | Neural network |
+| PBDL | Physics-based deep learning |
+
+
+
+% test table formatting in markdown
+% |    | Sentence #  | Word    | POS   | Tag   |
+% |---:|:-------------|:-----------|:------|:------|
+% | 1 | Sentence: 1  | They       | PRP   | O     |
+% | 2 | Sentence: 1  | marched    | VBD   | O     |
+
+
diff --git a/overview-equations.md b/overview-equations.md
new file mode 100644
index 0000000..631d4c4
--- /dev/null
+++ b/overview-equations.md
@@ -0,0 +1,138 @@
+Model Equations
+============================
+
+overview of PDE models to be used later on ...
+
+domain $\Omega$, boundary $\Gamma$
+
+continuous functions, but few assumptions about continuity for now...
+
+```{admonition} Notation and abbreviations
+:class: seealso
+If unsure, please check the summary of our mathematical notation
+and the abbreviations used inn: {doc}`notation`, at the bottom of the left panel.
+```
+
+% \newcommand{\pde}{\mathcal{P}}         % PDE ops
+% \newcommand{\pdec}{\pde_{s}}
+% \newcommand{\manifsrc}{\mathscr{S}}    % coarse / "source"
+% \newcommand{\pder}{\pde_{R}}
+% \newcommand{\manifref}{\mathscr{R}}
+
+% vc - coarse solutions
+% \renewcommand{\vc}[1]{\vs_{#1}}            % plain coarse state at time t
+% \newcommand{\vcN}{\vs}                     % plain coarse state without time 
+% vc - coarse solutions, modified by correction
+% \newcommand{\vct}[1]{\tilde{\vs}_{#1}}     % modified / over time at time t
+% \newcommand{\vctN}{\tilde{\vs}}            % modified / over time without time
+% vr - fine/reference solutions
+% \renewcommand{\vr}[1]{\mathbf{r}_{#1}}            % fine / reference state at time t , never modified
+% \newcommand{\vrN}{\mathbf{r}}                     % plain coarse state without time 
+
+% \newcommand{\project}{\mathcal{T}}           % transfer operator fine <> coarse
+% \newcommand{\loss}{\mathcal{L}}              % generic loss function
+% \newcommand{\nn}{f_{\theta}}
+% \newcommand{\dt}{\Delta t}                   % timestep
+% \newcommand{\corrPre}{\mathcal{C}_{\text{pre}}}            % analytic correction , "pre computed"
+% \newcommand{\corr}{\mathcal{C}}                         % just C for now...
+% \newcommand{\nnfunc}{F} % {\text{NN}}
+
+
+Some notation from SoL, move with parts from overview into "appendix"?
+
+
+
+We typically solve a discretized PDE $\mathcal{P}$ by performing discrete time steps of size $\Delta t$. 
+Each subsequent step can depend on any number of previous steps,
+$\mathbf{u}(\mathbf{x},t+\Delta t) = \mathcal{P}(\mathbf{u}(\mathbf{x},t), \mathbf{u}(\mathbf{x},t-\Delta t),...)$, 
+where
+$\mathbf{x} \in \Omega \subseteq \mathbb{R}^d$ for the domain $\Omega$ in $d$
+dimensions, and $t \in \mathbb{R}^{+}$.
+
+Numerical methods yield approximations of a smooth function such as $\mathbf{u}$ in a discrete
+setting and invariably introduce errors. These errors can be measured in terms
+of the deviation from the exact analytical solution.
+For discrete simulations of
+PDEs, these errors are typically expressed as a function of the truncation, $O(\Delta t^k)$ 
+for a given step size $\Delta t$ and an exponent $k$ that is discretization dependent.
+
+The following PDEs typically work with a continuous
+velocity field $\mathbf{u}$ with $d$ dimensions and components, i.e.,
+$\mathbf{u}(\mathbf{x},t): \mathbb{R}^d \rightarrow \mathbb{R}^d $.
+For discretized versions below, $d_{i,j}$ will denote the dimensionality
+of a field such as the velocity,
+with domain size $d_{x},d_{y},d_{z}$ for source and reference in 3D.
+
+% with $i \in \{s,r\}$ denoting source/inference manifold and reference manifold, respectively.
+%This yields $\vc{} \in \mathbb{R}^{d \times d_{s,x} \times d_{s,y} \times d_{s,z} }$ and $\vr{} \in \mathbb{R}^{d \times d_{r,x} \times d_{r,y} \times d_{r,z} }$
+%Typically, $d_{r,i} > d_{s,i}$ and $d_{z}=1$ for $d=2$.
+
+For all PDEs, we use non-dimensional parametrizations as outlined below,
+and the components of the velocity vector are typically denoted by $x,y,z$ subscripts, i.e.,
+$\mathbf{u} = (u_x,u_y,u_z)^T$ for $d=3$.
+
+Burgers' equation in 2D. It represents a well-studied advection-diffusion PDE:
+
+$\frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x =
+  \nu \nabla\cdot \nabla u_x + g_x(t), 
+  \\
+  \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y =
+  \nu \nabla\cdot \nabla u_y + g_y(t)
+$, 
+
+where $\nu$ and $\mathbf{g}$ denote diffusion constant and external forces, respectively.
+
+Burgers' equation in 1D without forces with $u_x = u$:
+%\begin{eqnarray}
+$\frac{\partial u}{\partial{t}} + u \nabla u = \nu \nabla \cdot \nabla u $ .
+
+---
+
+Later on, additional equations...
+
+
+
+Navier-Stokes, in 2D:
+
+$
+    \frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x =
+    - \frac{1}{\rho}\nabla{p} + \nu \nabla\cdot \nabla u_x  
+    \\
+    \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y =
+    - \frac{1}{\rho}\nabla{p} + \nu \nabla\cdot \nabla u_y  
+    \\
+    \text{subject to} \quad \nabla \cdot \mathbf{u} = 0
+$
+
+
+
+Navier-Stokes, in 2D with Boussinesq:
+
+%$\frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x$
+%$ -\frac{1}{\rho} \nabla p $
+
+$
+  \frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x = - \frac{1}{\rho} \nabla p 
+  \\
+  \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y = - \frac{1}{\rho} \nabla p + \eta d
+  \\
+  \text{subject to} \quad \nabla \cdot \mathbf{u} = 0,
+  \\
+  \frac{\partial d}{\partial{t}} + \mathbf{u} \cdot \nabla d = 0 
+$
+
+
+
+Navier-Stokes, in 3D:
+
+$
+  \frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x = - \frac{1}{\rho} \nabla p + \nu \nabla\cdot \nabla u_x 
+  \\
+  \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y = - \frac{1}{\rho} \nabla p + \nu \nabla\cdot \nabla u_y 
+  \\
+  \frac{\partial u_z}{\partial{t}} + \mathbf{u} \cdot \nabla u_z = - \frac{1}{\rho} \nabla p + \nu \nabla\cdot \nabla u_z 
+  \\
+  \text{subject to} \quad \nabla \cdot \mathbf{u} = 0.
+$
+
+
diff --git a/overview.md b/overview.md
index ed9c17e..7fec37b 100644
--- a/overview.md
+++ b/overview.md
@@ -1,12 +1,14 @@
 Overview
 ============================
 
-The following "book" of targets _"Physics-Based Deep Learning"_ techniques, 
-i.e., methods that combine physical modeling and numerical simulations with
-deep learning (DL). Here, DL will typically refer to methods based
-on artificial neural networks. The general direction of 
-Physics-Based Deep Learning represents a very
-active, quickly growing and exciting field of research. 
+The following collection of digital documents, i.e. "book", 
+targets _Physics-Based Deep Learning_ techniques.
+By that we mean combining physical modeling and numerical simulations with
+methods based on artificial neural networks. 
+The general direction of Physics-Based Deep Learning represents a very
+active, quickly growing and exciting field of research -- we want to provide 
+a starting point for new researchers as well as a hands-on introduction into
+state-of-the-art resarch topics. 
 
 ## Motivation
 
@@ -50,8 +52,8 @@ whether key phenomena are visible in the solutions or not.
 :class: tip
 Thus, a key aspect that we want to address in the following in the following is:
 - explain how to use DL,
-- and how to combine it with existing knowledge of physics and simulations,
-- **without throwing away** all existing numerical knowledeg and techniques!
+- how to combine it with existing knowledge of physics and simulations,
+- **without throwing away** all existing numerical knowledge and techniques!
 ```
 
 Rather, we want to build on all the neat techniques that we have
@@ -112,7 +114,7 @@ starting points with code examples, and illustrate pros and cons of the
 different approaches. In particular, it's important to know in which scenarios 
 each of the different techniques is particularly useful.
 
-```{admonition} Skip ahead if...
+```{admonition} You can skip ahead if...
 :class: tip
 
 - you're very familiar with numerical methods and PDE solvers, and want to get started with DL topics right away. The _Supervised Learning_ chapter is a good starting point then.
@@ -138,37 +140,13 @@ PINNs ... and more ...
 
 ## Deep Learning and Neural Networks
 
-Very brief intro, basic equations... approximate $f(x)=y$ with NN ...
+Very brief intro, basic equations... approximate $f^*(x)=y$ with NN $f(x;\theta)$ ...
 
-Details in [Deep Learning book](https://www.deeplearningbook.org)
+learn via GD, $\partial f / \partial \theta$ 
 
+Read chapters 6 to 9 of the [Deep Learning book](https://www.deeplearningbook.org),
+especially about [MLPs]https://www.deeplearningbook.org/contents/mlp.html and 
+"Conv-Nets", i.e. [CNNs](https://www.deeplearningbook.org/contents/convnets.html).
 
-## Notation and Abbreviations
-
-Unify notation... TODO ...
-
-Math notation:
-
-| Symbol | Meaning |
-| --- | --- |
-| $x$ | NN input |
-| $y$ | NN output |
-| $\theta$ | NN params |
-
-Quick summary of the most important abbreviations:
-
-| ABbreviation | Meaning |
-| --- | --- |
-| CNN | Convolutional neural network |
-| DL | Deep learning |
-| NN | Neural network |
-| PBDL | Physics-based deep learning |
-
-
-
-test table formatting in markdown
-
-|    | Sentence #  | Word    | POS   | Tag   |
-|---:|:-------------|:-----------|:------|:------|
-| 1 | Sentence: 1  | They       | PRP   | O     |
-| 2 | Sentence: 1  | marched    | VBD   | O     |
+**Note:** Classic distinction between _classification_ and _regression_ problems not so important here,
+we only deal with _regression_ problems in the following.
diff --git a/physicalloss-discuss.md b/physicalloss-discuss.md
new file mode 100644
index 0000000..30bc44e
--- /dev/null
+++ b/physicalloss-discuss.md
@@ -0,0 +1,37 @@
+Discussion of Physical Soft-Constraints
+=======================
+
+The good news so far is - we have a DL method that can include 
+physical laws in the form of soft constraints by minimizing residuals.
+However, as the very simple previous example illustrates, this is just a conceptual
+starting point.
+
+On the positive side, we can leverage DL frameworks with backpropagation to compute
+the derivatives of the model. At the same time, this puts us at the mercy of the learned
+representation regarding the reliability of these derivatives. Also, each derivative
+requires backpropagation through the full network, which can be very slow. Especially so
+for higher-order derivatives.
+
+And while the setup is realtively simple, it is generally difficult to control. The NN
+has flexibility to refine the solution by itself, but at the same time, tricks are necessary
+when it doesn't pick the right regions of the solution.
+
+In general, a fundamental drawback of this approach is that it does combine with traditional
+numerical techniques well. E.g., learned representation is not suitable to be refined with 
+a classical iterative solver such as the conjugate gradient method. This means many
+powerful techniques that were developed in the past decades cannot be used in this context.
+Bringing these numerical methods back into the picture will be one of the central
+goals of the next sections.
+
+✅ Pro: 
+- uses physical model
+- derivatives via backpropagation
+
+❌ Con: 
+- slow ...
+- only soft constraints
+- largely incompatible _classical_ numerical methods
+- derivatives rely on learned representation
+
+Next, let's look at how we can leverage numerical methods to improve the DL accuracy and efficiency
+by making use of differentiable solvers.
diff --git a/physicalloss.md b/physicalloss.md
index 0d25a98..4fb9086 100644
--- a/physicalloss.md
+++ b/physicalloss.md
@@ -1,134 +1,98 @@
 Physical Loss Terms
 =======================
 
+The supervised setting of the previous sections can quickly 
+yield approximate solutions with a fairly simple training process, but what's
+quite sad to see here is that we only use physical models and numerics
+as an "external" tool to produce a big pile of data 😢.
 
-Using the equations now, but no numerical methods!
+## Using Physical Models
 
-Still interesting, leverages analytic derivatives of NNs, but lots of problems
+We can improve this setting by trying to bring the model equations (or parts thereof)
+into the training process. E.g., given a PDE for $\mathbf{u}(x,t)$ with a time evolution, 
+we can typically express it in terms of a function $\mathcal F$ of the derivatives 
+of $\mathbf{u}$ via  
+$
+  \mathbf{u}_t = \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{x..x})
+$,
+where the $_{x}$ subscripts denote spatial derivatives of higher order.
+
+In this context we can employ DL by approxmating the unknown $\mathbf{u}$ itself 
+with a NN, denoted by $\tilde{\mathbf{u}}$. If the approximation is accurate, the PDE
+naturally should be satisfied, i.e., the residual $R$ should be equal to zero: 
+$
+  R = \mathbf{u}_t - \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{x..x}) = 0
+$
+
+This nicely integrates with the objective for training a neural network: similar to before
+we can collect sample solutions 
+$[x_0,y_0], ...[x_n,y_n]$ for $\mathbf{u}$ with $\mathbf{u}(x)=y$. 
+This is typically important, as most practical PDEs we encounter do not have unique solutions
+unless initial and boundary conditions are specified. Hence, if we only consider $R$ we might
+get solutions with random offset or other undesirable components. Hence the supervised sample points
+help to _pin down_ the solution in certain places.
+Now our training objective becomes
+
+$\text{arg min}_{\theta} \ \alpha_0 \sum_i (f(x_i ; \theta)-y_i)^2 + \alpha_1 R(x_i) $,
+
+where $\alpha_{0,1}$ denote hyper parameters that scale the contribution of the supervised term and 
+the residual term, respectively. We could of course add additional residual terms with suitable scaling factors here.
+
+Note that, similar to the data samples used for supervised training, we have no guarantees that the
+residual terms $R$ will actually reach zero during training. The non-linear optimization of the training process
+will minimize the supervised and residual terms as much as possible, but worst case, large non-zero residual 
+contributions can remain. We'll look at this in more detail in the upcoming code example, for now it's important 
+to remember that physical constraints in this way only represent _soft-constraints_, without guarantees
+of minimizing these constraints.
+
+## Neural network derivatives
+
+In order to compute the residuals at training time, it would be possible to store 
+the unknowns of $\mathbf{u}$ on a computational mesh, e.g., a grid, and discretize the equations of
+$R$ there. This has a fairly long "tradition" in DL, and was proposed by Tompson et al. {cite}`tompson2017` early on.
+
+Instead, a more widely used variant of employing physical soft-constraints {cite}`raissi2018hiddenphys`
+uses fully connected NNs to represent $\mathbf{u}$. This has some interesting pros and cons that we'll outline in the following.
+Due to the popularity of the version, we'll also focus on it in the following code examples and comparisons.
+
+The central idea here is that the aforementioned general function $f$ that we're after in our learning problems
+can be seen as a representation of a physical field we're after. Thus, the $\mathbf{u}(x)$ will 
+be turned into $\mathbf{u}(x, \theta)$ where we choose $\theta$ such that the solution to $\mathbf{u}$ is 
+represented as precisely as possible.
+
+One nice side effect of this viewpoint is that NN representations inherently support the calculation of derivatives. 
+The derivative $\partial f / \partial \theta$ was a key building block for learning via gradient descent, as explained 
+in {doc}`overview`. Here, we can use the same tools to compute spatial derivatives such as $\partial \mathbf{u} / \partial x$,
+Note that above for $R$ we've written this derivative in the shortened notation as $\mathbf{u}_{x}$.
+For functions over time this of course also works for $\partial \mathbf{u} / \partial t$, i.e. $\mathbf{u}_{t}$ in the notation above.
+
+Thus, for some generic $R$, made up of $\mathbf{u}_t$ and $\mathbf{u}_{x}$ terms, we can rely on the back-propagation algorithm
+of DL frameworks to compute these derivatives once we have a NN that represents $\mathbf{u}$. Essentially, this gives us a 
+function (the NN) that receives space and time coordinates to produce a solution for $\mathbf{u}$. Hence, the input is typically
+quite low-dimensional, e.g., 3+1 values for a 3D case over time, and often produces a scalar value or a spatial vector.
+Due to the lack of explicit spatial sampling points, an MLP, i.e., fully-connected NN is the architecture of choice here.
+
+To pick a simple example, Burgers equation in 1D,
+$\frac{\partial u}{\partial{t}} + u \nabla u = \nu \nabla \cdot \nabla u $ , we can directly
+formulate a loss term $R = \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2} u$ that should be minimized as much as possible at training time. For each of the terms, e.g. $\frac{\partial u}{\partial x}$,
+we can simply query the DL framework that realizes $u$ to obtain the corresponding derivative. 
+For higher order derivatives, such as $\frac{\partial^2 u}{\partial x^2}$, we can typically simply query the derivative function of the framework twice. In the following section, we'll give a specific example of how that works in tensorflow.
 
 
+## Summary so far
 
+This gives us a method to include physical equations into DL learning as a soft-constraint.
+Typically, this setup is suitable for _inverse_ problems, where we have certain measurements or observations
+that we wish to find a solution of a model PDE for. Because of the high expense of the reconstruction (to be 
+demonstrated in the following), the solution manifold typically shouldn't be overly complex. E.g., it is difficult 
+to capture a wide range of solutions, such as the previous supervised airfoil example, in this way.
+
+```{figure} resources/placeholder.png
 ---
-
-
-% \newcommand{\pde}{\mathcal{P}}         % PDE ops
-% \newcommand{\pdec}{\pde_{s}}
-% \newcommand{\manifsrc}{\mathscr{S}}    % coarse / "source"
-% \newcommand{\pder}{\pde_{R}}
-% \newcommand{\manifref}{\mathscr{R}}
-
-% vc - coarse solutions
-% \renewcommand{\vc}[1]{\vs_{#1}}            % plain coarse state at time t
-% \newcommand{\vcN}{\vs}                     % plain coarse state without time 
-% vc - coarse solutions, modified by correction
-% \newcommand{\vct}[1]{\tilde{\vs}_{#1}}     % modified / over time at time t
-% \newcommand{\vctN}{\tilde{\vs}}            % modified / over time without time
-% vr - fine/reference solutions
-% \renewcommand{\vr}[1]{\mathbf{r}_{#1}}            % fine / reference state at time t , never modified
-% \newcommand{\vrN}{\mathbf{r}}                     % plain coarse state without time 
-
-% \newcommand{\project}{\mathcal{T}}           % transfer operator fine <> coarse
-% \newcommand{\loss}{\mathcal{L}}              % generic loss function
-% \newcommand{\nn}{f_{\theta}}
-% \newcommand{\dt}{\Delta t}                   % timestep
-% \newcommand{\corrPre}{\mathcal{C}_{\text{pre}}}            % analytic correction , "pre computed"
-% \newcommand{\corr}{\mathcal{C}}                         % just C for now...
-% \newcommand{\nnfunc}{F} % {\text{NN}}
-
-
-Some notation from SoL, move with parts from overview into "appendix"?
-
-
-
-We typically solve a discretized PDE $\mathcal{P}$ by performing discrete time steps of size $\Delta t$. 
-Each subsequent step can depend on any number of previous steps,
-$\mathbf{u}(\mathbf{x},t+\Delta t) = \mathcal{P}(\mathbf{u}(\mathbf{x},t), \mathbf{u}(\mathbf{x},t-\Delta t),...)$, 
-where
-$\mathbf{x} \in \Omega \subseteq \mathbb{R}^d$ for the domain $\Omega$ in $d$
-dimensions, and $t \in \mathbb{R}^{+}$.
-
-Numerical methods yield approximations of a smooth function such as $\mathbf{u}$ in a discrete
-setting and invariably introduce errors. These errors can be measured in terms
-of the deviation from the exact analytical solution.
-For discrete simulations of
-PDEs, these errors are typically expressed as a function of the truncation, $O(\Delta t^k)$ 
-for a given step size $\Delta t$ and an exponent $k$ that is discretization dependent.
-
-The following PDEs typically work with a continuous
-velocity field $\mathbf{u}$ with $d$ dimensions and components, i.e.,
-$\mathbf{u}(\mathbf{x},t): \mathbb{R}^d \rightarrow \mathbb{R}^d $.
-For discretized versions below, $d_{i,j}$ will denote the dimensionality
-of a field such as the velocity,
-with domain size $d_{x},d_{y},d_{z}$ for source and reference in 3D.
-
-% with $i \in \{s,r\}$ denoting source/inference manifold and reference manifold, respectively.
-%This yields $\vc{} \in \mathbb{R}^{d \times d_{s,x} \times d_{s,y} \times d_{s,z} }$ and $\vr{} \in \mathbb{R}^{d \times d_{r,x} \times d_{r,y} \times d_{r,z} }$
-%Typically, $d_{r,i} > d_{s,i}$ and $d_{z}=1$ for $d=2$.
-
-For all PDEs, we use non-dimensional parametrizations as outlined below,
-and the components of the velocity vector are typically denoted by $x,y,z$ subscripts, i.e.,
-$\mathbf{u} = (u_x,u_y,u_z)^T$ for $d=3$.
-
-Burgers' equation in 2D. It represents a well-studied advection-diffusion PDE:
-
-$\frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x =
-  \nu \nabla\cdot \nabla u_x + g_x(t), 
-  \\
-  \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y =
-  \nu \nabla\cdot \nabla u_y + g_y(t)
-$, 
-
-where $\nu$ and $\mathbf{g}$ denote diffusion constant and external forces, respectively.
-
-Burgers' equation in 1D without forces with $u_x = u$:
-%\begin{eqnarray}
-$\frac{\partial u}{\partial{t}} + u \nabla u = \nu \nabla \cdot \nabla u $ .
-
+height: 220px
+name: pinn-training
 ---
-
-Later on, additional equations...
+TODO, visual overview of PINN training
+```
 
 
-
-Navier-Stokes, in 2D:
-
-$
-    \frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x =
-    - \frac{1}{\rho}\nabla{p} + \nu \nabla\cdot \nabla u_x  
-    \\
-    \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y =
-    - \frac{1}{\rho}\nabla{p} + \nu \nabla\cdot \nabla u_y  
-    \\
-    \text{subject to} \quad \nabla \cdot \mathbf{u} = 0
-$
-
-
-
-Navier-Stokes, in 2D with Boussinesq:
-
-%$\frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x$
-%$ -\frac{1}{\rho} \nabla p $
-
-$
-  \frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x = - \frac{1}{\rho} \nabla p 
-  \\
-  \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y = - \frac{1}{\rho} \nabla p + \eta d
-  \\
-  \text{subject to} \quad \nabla \cdot \mathbf{u} = 0,
-  \\
-  \frac{\partial d}{\partial{t}} + \mathbf{u} \cdot \nabla d = 0 
-$
-
-
-
-Navier-Stokes, in 3D:
-
-$
-  \frac{\partial u_x}{\partial{t}} + \mathbf{u} \cdot \nabla u_x = - \frac{1}{\rho} \nabla p + \nu \nabla\cdot \nabla u_x 
-  \\
-  \frac{\partial u_y}{\partial{t}} + \mathbf{u} \cdot \nabla u_y = - \frac{1}{\rho} \nabla p + \nu \nabla\cdot \nabla u_y 
-  \\
-  \frac{\partial u_z}{\partial{t}} + \mathbf{u} \cdot \nabla u_z = - \frac{1}{\rho} \nabla p + \nu \nabla\cdot \nabla u_z 
-  \\
-  \text{subject to} \quad \nabla \cdot \mathbf{u} = 0.
-$
diff --git a/references.bib b/references.bib
index c48bb3e..3e14d58 100644
--- a/references.bib
+++ b/references.bib
@@ -762,3 +762,34 @@
 	PUBLISHER = {Dept. of Computer Science 10, University of Erlangen-Nuremberg}
 }
 
+
+
+
+
+% ----------------- external --------------------
+
+
+@inproceedings{tompson2017,
+  title =	 {Accelerating Eulerian Fluid Simulation With Convolutional Networks},
+  booktitle =	 {Proceedings of Machine Learning Research},
+  author =	 {Tompson, Jonathan and Schlachter, Kristofer and Sprechmann, Pablo and Perlin, Ken},
+  year =	 2017,
+  pages =	 {3424--3433}
+}
+
+@article{raissi2018hiddenphys,
+  title={Hidden physics models: Machine learning of nonlinear partial differential equations},
+  author={Raissi, Maziar and Karniadakis, George Em},
+  journal={Journal of Computational Physics},
+  volume={357},
+  pages={125--141},
+  year={2018},
+  publisher={Elsevier}
+}
+
+
+
+
+
+
+
diff --git a/resources/placeholder.png b/resources/placeholder.png
new file mode 100644
index 0000000..993d88e
Binary files /dev/null and b/resources/placeholder.png differ
diff --git a/supervised.md b/supervised.md
index f49b776..978f250 100644
--- a/supervised.md
+++ b/supervised.md
@@ -1,5 +1,85 @@
 Supervised Learning
 =======================
 
-Doing things the old fashioned way...
+_Supervised_ here essentially means: "doing things the old fashioned way". Old fashioned in the context of 
+deep learning (DL), of course, so it's still fairly new, and old fashioned of course also doesn't always mean bad.
+In a way this viewpoint is a starting point for all projects one would encounter in the context of DL, and
+hence is worth studying. And although it typically yields inferior results to approaches that more tightly 
+couple with physics, it nonetheless can be the only choice in certain application scenarios where no good
+model equations exist.
+
+## Problem Setting
+
+For supervised learning, we're faced with an 
+unknown function $f^*(x)=y$, collect lots of pairs of data $[x_0,y_0], ...[x_n,y_n]$ (the training data set)
+and directly train a NN to represent an approximation of $f^*$ denoted as $f$, such
+that $f(x)=y$.
+
+The $f$ we can obtain is typically not exact, 
+but instead we obtain it via a minimization problem:
+by adjusting weights $\theta$ of our representation with $f$ such that
+
+$\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y_i)^2$.
+
+This will give us $\theta$ such that $f(x;\theta) \approx y$ as accurately as possible given
+our choice of $f$ and the hyper parameters for training. Note that above we've assumed 
+the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
+to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y_i) )$. The choice
+of a suitable metric is topic we will get back to later on.
+
+Irrespective of our choice of metric, this formulation
+gives the actual "learning" process for a supervised approach.
+
+The training data typically needs to be of substantial size, and hence it is attractive 
+to use numerical simulations to produce a large number of training input-output pairs.
+This means that the training process uses a set of model equations, and approximates
+them numerically, in order to train the NN representation $\tilde{f}$. This
+has a bunch of advantages, e.g., we don't have measurement noise of real-world devices
+and we don't need manual labour to annotate a large number of samples to get training data.
+
+On the other hand, this approach inherits the common challenges of replacing experiments
+with simulations: first, we need to ensure the chosen model has enough power to predict the 
+bheavior of real-world phenomena that we're interested in.
+In addition, the numerical approximations have numerical errors
+which need to be kept small enough for a chosen application. As these topics are studied in depth
+for classical simulations, the existing knowledge can likewise be leveraged to
+set up DL training tasks.
+
+```{figure} resources/placeholder.png
+---
+height: 220px
+name: supervised-training
+---
+TODO, visual overview of supervised training
+```
+
+## Applications
+
+Let's directly look at an example with a fairly complicated context:
+we have a turbulent airflow around wing profiles, and we'd like to know the average motion 
+and pressure distribution around this airfoil for different Reynolds numbers and angles of attack.
+Thus, given an airfoil shape, Reynolds numbers, and angle of attack, we'd like to obtain 
+a velocity field $\mathbf{u}$ and a pressure field $p$ in a computational domain $\Omega$ 
+around the airfoil in the center of $\Omega$.
+
+This is classically approximated with _Reynolds-Averaged Navier Stokes_ (RANS) models, and this 
+setting is still one of the most widely used applications of Navier-Stokes solver in industry.
+However, instead of relying on traditional numerical methods to solve the RANS equations,
+we know aim for training a neural network that completely bypasses the numerical solver,
+and produces the solution in terms of $\mathbf{u}$ and $p$.
+
+## Discussion
+
+TODO , add as separate section after code?
+TODO , discuss pros / cons of supervised learning
+TODO , CNNs powerful, graphs & co likewise possible
+
+Pro: 
+- very fast output and training
+
+Con: 
+- lots of data needed
+- undesirable averaging / inaccuracies due to direct loss
+
+Outlook: interactions with external "processes" (such as embedding into a solver) very problematic, see DP later on...