update supervised chapter

2025-02-06 14:05:50 +08:00 · 2025-02-06 14:05:50 +08:00 · 0981a281fe
commit 0981a281fe
parent 4dd1611430
5 changed files with 94 additions and 110 deletions
--- a/intro.md
+++ b/intro.md
@ -32,7 +32,7 @@ Most importantly, this version has a large new chapter on generative modeling, o

 As a _sneak preview_, the next chapters will show:

- How to train neural networks to [predict the fluid flow around airfoils with diffusion modeling](probmodels-ddpm-fm). This gives a probabilistic _surrogate model_ that replaces and outperforms  traditional simulators.
+- How to train neural networks to [predict the fluid flow around airfoils with diffusion modeling](probmodels-ddpm-fm). This gives a probabilistic _surrogate model_ that replaces and outperforms traditional simulators.

 - How to use model equations as residuals to train networks that [represent solutions](diffphys-dpvspinn), and how to improve upon these residual constraints by using [differentiable simulations](diffphys-code-sol).

--- a/supervised-airfoils.ipynb
+++ b/supervised-airfoils.ipynb
@ -11,16 +11,15 @@
    "## Overview\n",
    "\n",
    "For this example of supervised training\n",
-    "we have a turbulent airflow around wing profiles, and we'd like to know the average motion\n",
-    "and pressure distribution around this airfoil for different Reynolds numbers and angles of attack.\n",
-    "Thus, given an airfoil shape, Reynolds numbers, and angle of attack, we'd like to obtain\n",
-    "a velocity field and a pressure field around the airfoil.\n",
+    "we target turbulent airflows around wing profiles: the learned operator should provide the average motion\n",
+    "and pressure distribution around a given airfoil geometry for different Reynolds numbers and angles of attack.\n",
+    "Thus, inputs to the neural network are airfoil shape, Reynolds numbers, and angle of attack, and it should compute \n",
+    "a time averaged velocity field with 2 components, and the pressure field around the airfoil.\n",
    "\n",
    "This is classically approximated with _Reynolds-Averaged Navier Stokes_ (RANS) models, and this\n",
-    "setting is still one of the most widely used applications of Navier-Stokes solver in industry.\n",
+    "setting is still one of the most widely used applications of Navier-Stokes solvers in industry.\n",
    "However, instead of relying on traditional numerical methods to solve the RANS equations,\n",
-    "we now aim for training a surrogate model via a neural network that completely bypasses the numerical solver,\n",
-    "and produces the solution in terms of velocity and pressure.\n",
+    "we now aim for training a surrogate model via a neural network that completely bypasses the numerical solver.\n",
    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/supervised-airfoils.ipynb)\n"
   ]
  },
@ -52,7 +51,7 @@
    "$[f_x,f_y,\\text{mask}]$, while the outputs store the channels $[p,u_x,u_y]$.\n",
    "This is exactly what we'll specify as input and output dimensions for the NN below.\n",
    "\n",
-    "A point to keep in mind here is that our quantities of interest in $y^*$ contain three different physical fields. While the two velocity components are quite similar in spirit, the pressure field typically has a different behavior with an approximately squared scaling with respect to the velocity (cf. [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli%27s_principle)). This implies that we need to be careful with simple summations (as in the minimization problem above), and that we should take care to normalize the data.\n"
+    "A point to keep in mind here is that our quantities of interest in $y^*$ contain three different physical fields. While the two velocity components are quite similar in spirit, the pressure field typically has a different behavior with an approximately squared scaling with respect to the velocity (cf. [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli%27s_principle)). This implies that we need to be careful with simple summations (as in the minimization problem above), and that we should take care to normalize the data. If we don't take care, one of the components can dominate and the aggregation in terms of mean will lead the NN to spend more resources to learn the large component rather than the other ones causing smaller errors.\n"
   ]
  },
  {
@ -138,9 +137,9 @@
   "source": [
    "## RANS training data\n",
    "\n",
-    "Now we have the training and validation data. In general it's very important to understand the data we're working with as much as possible (for any ML task the _garbage-in-gargabe-out_ principle definitely holds). We should at least understand the data in terms of dimensions and rough statistics, but ideally also in terms of content. Otherwise we'll have a very hard time interpreting the results of a training run. And despite all the DL magic: if you can't make out any patterns in your data, NNs surely won't find any useful ones.\n",
+    "Now we have the training and validation data. In general it's very important to understand the data we're working with as much as possible (for any ML task the _garbage-in-gargabe-out_ principle definitely holds). We should at least understand the data in terms of dimensions and rough statistics, but ideally also in terms of content. Otherwise we'll have a very hard time interpreting the results of a training run. And despite all the _AI magic_: if you can't make out any patterns in your data, NNs most likely also won't find any useful ones.\n",
    "\n",
-    "Hence, let's look at one of the training samples... The following is just some helper code to show images side by side."
+    "Hence, let's look at one of the training samples. The following is just some helper code to show images side by side."
   ]
  },
  {
@ -201,7 +200,7 @@
   "source": [
    "## Network setup\n",
    "\n",
-    "Now we can set up the architecture of our neural network, we'll use a fully convolutional U-net. This is a widely used architecture that uses a stack of convolutions across different spatial resolutions. The main deviation from a regular conv-net is to introduce _skip connection_ from the encoder to the decoder part. This ensures that no information is lost during feature extraction. (Note that this only works if the network is to be used as a whole. It doesn't work in situations where we'd, e.g., want to use the decoder as a standalone component.)\n",
+    "Now we can set up the architecture of our neural network, we'll use a fully convolutional U-net. This is a widely used architecture that uses a stack of convolutions across different spatial resolutions. The main deviation from a regular conv-net is the hierarchy (for a global receptive field), and to introduce _skip connections_ from the encoder to the decoder part. This ensures that no information is lost during feature extraction. (Note that this only works if the network is to be used as a whole. It doesn't work in situations where we'd, e.g., want to use the decoder as a standalone component.)\n",
    "\n",
    "Here's a overview of the architecure:\n",
    "\n",
@ -469,7 +468,7 @@
    "id": "4KuUpJsSL3Jv"
   },
   "source": [
-    "The NN is finally trained! The losses should have nicely gone down in terms of absolute values: With the standard settings from an initial value of around 0.2 for the validation  loss, to ca. 0.02 after training.\n",
+    "The NN is trained, the losses should have gone down in terms of absolute values: With the standard settings from an initial value of around 0.2 for the validation loss, to ca. 0.02 after training.\n",
    "\n",
    "Let's look at the graphs to get some intuition for how the training progressed over time. This is typically important to identify longer-term trends in the training. In practice it's tricky to spot whether the overall trend of 100 or so noisy numbers in a command line log is going slightly up or down - this is much easier to spot in a visualization."
   ]
@ -512,15 +511,9 @@
   "source": [
    "You should see a curve that goes down for ca. 40 epochs, and then starts to flatten out. In the last part, it's still slowly decreasing, and most importantly, the validation loss is not increasing. This would be a certain sign of overfitting, and something that we should avoid. (Try decreasing the amount of training data artificially, then you should be able to intentionally cause overfitting.)\n",
    "\n",
-    "## Training progress and validation\n",
+    "Note that the validation loss is generally higher above, as the dataset here is relatively small. At some point, the network will not be able to get new information from it that transfers to the validation samples.\n",
    "\n",
-    "If you look closely at this graph, you should spot something peculiar:\n",
-    "_Why is the validation loss lower than the training loss in the initial phase_?\n",
-    "The data is similar to the training data of course, but in a way it's slightly \"tougher\", because the network certainly never received any validation samples during training. It is natural that the validation loss slightly deviates from the training loss, but how can the L1 loss be _lower_ for these inputs?\n",
-    "\n",
-    "This is caused by the way the the training loop above is implemented in pytorch: the training loss for each point in the graph is measured with the evolving state of the network in an epoch. The network is updated, and afterwards runs through the validation samples. Thus all validation samples are using a state that is slightly different (and hopefully a bit better) than the initial states of the epoch. Due to both reasons, the validation loss can deviate, and in this example it's slightly lower initially, while later on the network starts to slow down in terms of improving the new conditions of the validation samples. This indicates a slight overfitting for the training data, but nonetheless the validation loss still increases.\n",
-    "\n",
-    "A general word of caution here: never evaluate your network with training data! That won't tell you much because overfitting is a very common problem. At least use data the network hasn't seen before, i.e. validation data, and if that looks good, try some more different (at least slightly out-of-distribution) inputs, i.e., _test data_. The next cell runs the trained network on a batch of samples from the validation data, and displays one with the `plot` function.\n",
+    "A general word of caution here: never evaluate your network with training data. That won't tell you much because overfitting is a very common problem. At least use data the network hasn't seen before, i.e. validation data, and if that looks good, try some more different (at least slightly out-of-distribution) inputs, i.e., _test data_. The next cell runs the trained network on a batch of samples from the validation data, and displays one with the `plot` function.\n",
    "\n"
   ]
  },
@ -579,7 +572,7 @@
   "source": [
    "## Test evaluation\n",
    "\n",
-    "Now let's look at actual test samples: In this case we'll use new airfoil shapes as out-of-distribution (OOD) data. These are shapes that the network never saw in any training samples, and hence it tells us a bit about how well the NN generalizes to unseen inputs (the validation data wouldn't suffice to draw conclusions about generalization).\n",
+    "Now let's look at actual test samples: In this case we'll use new airfoil shapes as out-of-distribution (OOD) data. These are shapes that the network has not seen in any training samples, and hence it tells us a bit about how well the NN generalizes to unseen inputs (the validation data wouldn't suffice to draw conclusions about generalization).\n",
    "\n",
    "We'll use the same visualization as before, and as indicated by the Bernoulli equation, especially the _pressure_ in the first column is a challenging quantity for the network. Due to it's cubic scaling w.r.t. the input freestream velocity and localized peaks, it is the toughest quantity to infer for the network.\n",
    "\n",
@ -676,7 +669,7 @@
   "source": [
    "The average test error with the default settings should be close to 0.025. As the inputs are normalized, this means the average relative error across all three fields is around 2.5% w.r.t. the maxima of each quantity. This is not too bad for new shapes, but clearly leaves room for improvement.\n",
    "\n",
-    "Looking at the visualizations, you'll notice that especially high-pressure peaks and pockets of larger y-velocities are missing in the outputs. This is primarily caused by the small network, which does not have enough resources to reconstruct details.\n",
+    "Looking at the visualizations, you'll notice that especially high-pressure peaks and pockets of larger y-velocities are missing in the outputs. This is primarily caused by the small network, which does not have enough resources to reconstruct details. The $L^2$ also has an averaging behavior, and favours larger structures (the surroundings) over localized peaks.\n",
    "\n",
    "Nonetheless, we have successfully replaced a fairly sophisticated RANS solver with a small and fast neural network architecture. It has GPU support \"out-of-the-box\" (via pytorch), is differentiable, and introduces an error of only a few per-cent. With additional changes and more data, this setup can be made highly accurate {cite}`chen2021highacc`.\n",
    "\n",
--- a/supervised-arch.md
+++ b/supervised-arch.md
@ -1,7 +1,7 @@
 Neural Network Architectures
 =======================

-The connectivity of the individual "neurons" in a neural network has a substantial influence on the capabilities of a network consisting of a large number of them. Over the course of many years, several key architectures have emerged as particularly useful choices, and in the following we'll go over the main considerations for choosing an architecture. Our focus is to introduce ways of incorporating PDE-based models (the "physics"), rather than the subtleties of NN architectures, and hence this section will be kept relatively brief.
+The connectivity of the individual "neurons" in a neural network has a substantial influence on the capabilities of the network. Typical ones consist of a large number of these connected "neuron" units. Over the course of many years, several key architectures have emerged as particularly useful choices, and in the following we'll go over the main considerations for choosing an architecture. Our focus is to introduce ways of incorporating PDE-based models (the "physics"), rather than the subtleties of NN architectures.

 ```{figure} resources/arch01.jpg
 ---
@ -13,30 +13,33 @@ We'll discuss a range of architecture, from regular convolutions over graph- and

 ## Spatial Arrangement

-A first, fundamental aspect to consider for choosing an architecture (and for ruling out unsuitable options) is the spatial arrangement of the data samples. 
-We can distinguish four cases here: 
+A first, fundamental aspect to consider for choosing an architecture in the context of physics simulations 
+(and for ruling out unsuitable options) is the spatial arrangement of the data samples. 
+We can distinguish four main cases: 

 1. No spatial arrangement,
 2. A regular spacing on a grid (_structured_),
 3. An irregular arrangement (_unstructured_), and
 4. Irregular positions without connectivity (_particle_ / _point_-based).

-For certain problems, there is no spatial information or arrangement (case 1). E.g., predicting the temporal evolution of temperature and pressure of a single measurement probe over time does not involve any spatial dimension. The opposite case are probes placed on a perfect grid (case 2), or at arbitrary locations (3). The last variant (case 4) is a slightly special case of the third one, where no clear or persistent links between sample points are known. This can, e.g., be the case for particle-based representations of turbulent fluids, where neighborhood change quickly over time.
+For certain problems, there is no spatial information or arrangement (case 1). E.g., predicting the temporal evolution of temperature and pressure of a single measurement probe over time does not involve any spatial dimension. The opposite case are probes placed on a nicely aligned grid (case 2), or at arbitrary locations (3). The last variant (case 4) is a slightly special case of the third one, where no clear or persistent links between sample points are known. This can, e.g., be the case for particle-based representations of turbulent fluids, where neighborhood change quickly over time.

 ## No spatially arranged inputs

 The first case is a somewhat special one: without any information about spatial arrangements, only _dense_ ("fully connected" / MLP) neural networks are applicable. 

-If you decide to use a **neural fields** like approach where the network receives the position as input, this has the same effect: the NN will not have any direct means of querying neighbors via architectural tricks ("inductive biases"). In all these cases, the building blocks below won't be applicable, and it's worth considering whether you can introduce more structure via a discretization.
+If you decide to use a **neural fields** approach where the network receives the position as input, this has the same effect: the NN will not have any direct means of querying neighbors via architectural tricks ("inductive biases"). In this case, the building blocks below won't be applicable, and it's worth considering whether you can introduce more structure via a discretization.

-Note that _physics-informed neural networks_ (PINNs) also fall into this category. We'll go into more detail here later on ({doc}`diffphys`), but generally it's advisable to consider switching to an approach that employs prior knowledge in the form or a discretization. This usually substantially improves inference accuracy and improves convergence. PINNs were found to be mostly unsuited for real-world problems.
+Note that _physics-informed neural networks_ (PINNs) also fall into this category. We'll go into more detail here later on ({doc}`diffphys`), but generally it's advisable to consider switching to an approach that employs prior knowledge in the form or a discretization. This usually substantially improves inference accuracy and improves convergence. That PINNs can't solve real-world problems despite many years of research points to the fundamental problems of this approach.

-Even when focusing on dense layers, this still leaves a few choices concerning the number of layers, their size, and activations. The other three cases have the same choices, and these hyperparameters of the architectures are typically determined over the course of training runs. Hence, we'll focus on the remaining three cases with spatial information in the following, as differences can have a profound impact here. So, below we'll focus on cases where we have a "computational domain" for a region of interest, in which the samples are located.
+Focusing on dense layers still leaves a few choices concerning the number of layers, their size, and activations. The other three cases have the same choices, and these hyperparameters of the architectures are typically determined over the course of training runs. General recommendations are that _ReLU_ and smoother variants like _GELU_ are good choices, and that the number of layers should scale together with their size.
+Next, we'll focus on the remaining three cases with spatial information in the following, as differences can have a profound impact here. So, below we target cases where we have a "computational domain" specifying the region of interest in which the samples are located.

 ## Local vs Global

-The most important aspect of different architectures then is the question of the receptive field: this means for any single sample in our domain, which neighborhood of other sample points can influence the solution at this point. This is similar to classic considerations for PDE solving, where denoting a PDE as hyperbolic indicates its local, wave-like behavior in contrast to an elliptic one with global behavior. Certain NN architectures such as the classic convolutional neural networks (CNNs) support only local influences and receptive fields, while hierarchies with pooling expand these receptive field to effectively global ones. An interest variant here are spectral architectures like FNOs, which provide global receptive fields at the expense of other aspects. In addition Transformers (with attention mechanisms), provide a more complicated but scalable alternative here.
+Probably the most important aspect of different architectures then is the question of their _receptive field_: this means for any single sample in our domain, which neighborhood of other sample points can influence the solution at this point. This is similar to classic considerations for PDE solving, where denoting a PDE as _hyperbolic_ indicates its local, wave-like behavior in contrast to an _elliptic_ one with global behavior. Certain NN architectures such as the classic convolutional neural networks (CNNs) support only local influences and receptive fields, while hierarchies with pooling expand these receptive field to effectively global ones. An interesting variant here are spectral architectures like FNOs, which provide global receptive fields at the expense of other aspects. In addition Transformers (with attention mechanisms), provide a more complicated but scalable alternative. 

+Thus, a fundamental distinction can be made in terms of local vs global architectures, and for the latter, how they realize the global receptive field. The following table provides a first overview, and below we'll discuss the pros and cons of each variant.

 |             | Grid            | Unstructured      | Points            | Non-spatial |
 |-------------|-----------------|-------------------|-------------------|-------------|
@ -47,12 +50,11 @@ The most important aspect of different architectures then is the question of the
 | - Sequence  | Transformer     | Graph Transformer | Point Trafo.      | -           |


-Thus, a fundamental distinction should be made in terms of local vs global architectures, and for the latter, how they realize the global receptive field. The following table provides a first overview, and below we'll discuss the pros and cons of each variant.

 ```{note}
 Knowledge about the dependencies in your data, i.e., whether the dependencies are local or global, is important knowledge that should be leveraged. 

-If your data has primarily **local** influences, choosing an architecture with support for global receptive fields will most likely have negative effects on accuracy: the network will "waste" resources to try and capture global effects, or worst case approximate local effects with smoothed out global modes.
+If your data has primarily **local** influences, choosing an architecture with support for global receptive fields will most likely have negative effects on accuracy: the network will "waste" resources trying to capture global effects, or worst case approximate local effects with smoothed out global modes.

 Vice versa, trying to approximate a **global** influence with a limited receptive field will be an unsolvable task, and most likely introduce substantial errors.
 ```
@ -72,9 +74,8 @@ A 3x3 convolution (orange) shown for differently deformed regular multi-block gr
 ```

 For unstructured data, graph-based neural networks (GNNs) are a good choice. While they're often discussed in terms of _message-passing_ operations,
-they share a lot of similarities with structured grids: the basic operation of a message-passing step on a GNN is the same as a convolution on a grid {cite}``. 
-And hierarchies can likewise be constructed by graph coarsening. Hence, while we'll primarily discuss grids below, keep in mind that the approaches 
-carry over to GNNs. As dealing with graph structures makes the implementation more complicated, we won't go into details until later on in {doc}`graphs`.
+they share a lot of similarities with structured grids: the basic operation of a message-passing step on a GNN is equivalent to a convolution on a grid {cite}`sanchez2020learning`. 
+Hierarchies can likewise be constructed by graph coarsening {cite}`lino2024dgn`. Hence, while we'll primarily discuss grids below, keep in mind that the approaches carry over to GNNs. As dealing with graph structures makes the implementation more complicated, we won't go into details until later on in {doc}`graphs`.

 ```{figure} resources/arch02.jpg
 ---
@ -91,14 +92,13 @@ to Lagrangian data: continuous convolution kernels are a suitable tool, and neig

 ## Hierarchies

-A powerful and natural tool to work with **local** dependencies are convolutional layers. The corresponding neural networks (CNNs) are 
+A powerful and natural tool to work with **local** dependencies are convolutional layers on regular grids. The corresponding neural networks (CNNs) are 
 a classic building block of deep learning, and very well researched and supported throughout. They are comparatively easy to 
-train, and usually very efficiently implemented in APIs. They also provide a natural connection to classical numerics: classic discretizations
+train, and usually very efficiently implemented in APIs. They also provide a natural connection to classical numerics:  discretizations
 of differential operators such as gradient and Laplacians are often thought of in terms of "stencils", which are an equivalent of
-a convolutional layer with a set of chosen weights. E.g., consider the classic stencil for a normalized Laplacian $\nabla^2$ in 1D: $[1, -2, 1]$. 
-It can directly be mapped to a 1D convolution with kernel size 3 and a single input and output channel.
-
-[TODO, image simple and deformed grids]
+a convolutional layer with a set of specific weights. E.g., consider the classic stencil for a normalized Laplacian $\nabla^2$ in 1D: $[1, -2, 1]$. 
+It can directly be mapped to a 1D convolution with kernel size 3 and a single input and output channel. The non trainable weights of the kernel
+can be set to the coefficients of the Laplacian stencil above.

 Using convolutional layers is quite straight forward, but the question of how to incorporate **global** dependencies into 
 CNNs is an interesting one. Over time, two fundamental approaches have been established here in the field:
@ -116,7 +116,7 @@ A 3x3 convolution shown for a pooling-based hierarchy (left), and a dilation-bas
 * U-Nets are based on _pooling_ operations. Akin to a geometric multigrid hierarchy, the spatial samples are downsampled to coarser and
 coarser grids, and upsampled in the later half of the network. This means that even if we keep the kernel size of a convolution fixed, the 
 convolution will be able to "look" further in terms of physical space due to the previous downsampling operation. 
-The number of sample points decreases logaritmically, making convolutions on lower hierarchy levels very efficient.
+The number of sample points decreases logarithmically, making convolutions on lower hierarchy levels very efficient.
 While the different re-sampling methods (mean, average, point-wise ...) have a minor effect, a crucial ingredient for U-Net are _skip connection_. They connect
 the earlier layers of the first half directly with the second half via feature concatenation. This turns out to be crucial to avoid 
 a loss of information. Typically, the deepest "bottle-neck" layer with the coarsest representation has trouble storing all details 
@ -128,7 +128,7 @@ typically simply ignored. In contrast to a hierarchy, the number of sample point

 While both approaches reach the goal, and can perform very well, there's an interesting tradeoff: U-Nets take a bit more effort to implement, but can be much faster. The reason for the performance boost is the sub-optimal memory access of the dilated convolutions: they skip through memory with a large stride, which gives a slower performance. The U-Nets, on the other had, basically precompute a compressed memory representation in the form of a coarse grid. Convolutions on this coarse grid are then highly efficient to compute. However, this requires slightly more effort to implement in the form of adding appropriate pooling layers (dilated convolutions can be as easy to implement as replacing the call to the regular convolution with a dilated one). The implementation effort of a U-Net can pay off significantly in the long run, when a trained network should be deployed in an application.

-Note that this difference is not present for graph nets: here the memory access is always irregular, and dilation is unpopular as the strides would be costly to compute on general graphs. Hence, hierarchies in the form of multi-scale GNNs are highly recommended if global dependencies exist in the data.
+As mentioned above hierarchies are likewise important for graph nets. However, the question whether to "dilate or not" is not present for graph nets: here the memory access is always irregular, and dilation is unpopular as the strides would be costly to compute on general graphs. Hence, regular hierarchies in the form of multi-scale GNNs are highly recommended if global dependencies exist in the data.


 ## Spectral methods
@ -144,7 +144,7 @@ An inherent advantage and consequence of the frequency domain is that all basis
 height: 200px
 name: arch06-fno
 ---
-Spatial convolutions (left, kernel in orange) and frequency processing in FNOs (right, coverage of dense layer in yellow). Not only do FNOs scale less well in 3D (**6th** instead of 5th power), their constant is also proportional to the domain size, and hence typically larger for FNOs.
+Spatial convolutions (left, kernel in orange) and frequency processing in FNOs (right, coverage of dense layer in yellow). Not only do FNOs scale less well in 3D (**6th** instead of 5th power), their scaling constant is also proportional to the domain size, and hence typically larger.
 ```

 Unfortunately, they're not well suited for higher dimensional problems: Moving from two to three dimensions increases the size of the frequencies to be handled to $M^3$. For the dense layer, this means $M^6$ parameters, a cubic increase. For convolutions, there's no huge difference in 2D:
@ -152,7 +152,7 @@ Unfortunately, they're not well suited for higher dimensional problems: Moving f
 However, in 3D regular convolutions scale much better: in 3D only the kernel size increases to $K^3$, giving an overall complexity of $O(K^5)$ in 3D. 
 Thus, the exponent is 5 instead of 6.

-To make things worse, the frequency coverage $M$ of FNOs needs to scale with the size of the spatial domain, hence typically $M>K$ and $M^6 \gg K^5$. Thus, FNOs in 3D require intractable amounts of parameters, and are thus not recommendable for 3D (or higher dimensional) problems. Architectures like CNNs require much fewer weights when being applied to 3D problems, and in conjunction with hierarchies can still handle global dependencies efficiently.
+To make things worse, the frequency coverage $M$ of FNOs needs to scale with the size of the spatial domain, hence typically $M>K$ and $M^6 \gg K^5$. Thus, FNOs would require intractable amounts of parameters, and are thus not recommendable for 3D (or higher dimensional) problems. Architectures like CNNs require much fewer weights, and in conjunction with hierarchies can still handle global dependencies efficiently.

 <br>

@ -160,17 +160,17 @@ To make things worse, the frequency coverage $M$ of FNOs needs to scale with the

 ## Attention and Transformers 

-A newer and exciting develpoment in the deep learning field are attention mechanisms. They've been hugely successful in the form of _Transformers_ for processing language and natural images, and bear promise for physics-related problems. However, it's still open, whether they're really generally preferable over more "classic" architectures. The following section  will give an overview of the main pros and cons.
+A newer and exciting development in the deep learning field are attention mechanisms. They've been hugely successful in the form of _Transformers_ for processing language and natural images, and bear promise for physics-related problems. However, it's still open, whether they're really generally preferable over more "classic" architectures. The following section will give an overview of the main pros and cons.

-Transformers generally work in two steps: the input is encoded into _tokens_ with an encoder-decoder network. This step can take many forms, and usually primarily serves to reduce the number of inputs, e.g., to work with pieces of an image rather than individual pixels. The attention mechanism then computes a weighting for a collection if incoming tokens. This is a floating point number for each token, traditionally interpreted as indicating which parts of the input are important, and which aren't. In modern architectures, the floating point weighting of the attention are directly used to modify an input. In _self-attention_, the weighting is computed from each input towards all other input tokens. This is a mechanism to handle **global dependencies**, and hence directly fits into the discussion above. In practice, the attention is computed via three matrices: the query $Q$, the key matrix $K$, and a value matrix $V$. For $N$ tokens, the outer product $Q K^T$ produces an $N \times N$ matrix, and runs through a softmax layer, after which it is multiplied with $V$ (containing a linear projection of the input tokens) to produce the attention output vector. 
+Transformers generally work in two steps: the input is encoded into _tokens_ with an encoder-decoder network. This step can take many forms, and usually primarily serves to reduce the number of inputs, e.g., to work with pieces of an image rather than individual pixels. The attention mechanism then computes a weighting for a collection if incoming tokens. This is a floating point number for each token, traditionally interpreted as indicating which parts of the input are important, and which aren't. In modern architectures, the floating point weighting of the attention are directly used to modify an input. In _self-attention_, the weighting is computed from each input towards all other input tokens. This is a mechanism to handle **global dependencies**, and hence directly fits into the discussion above. In practice, the attention is computed via three matrices: the query $Q$, the key matrix $K$, and a value matrix $V$. For $N$ tokens, the outer product $Q K^T$ produces an $N \times N$ matrix, and runs through a Softmax layer, after which it is multiplied with $V$ (containing a linear projection of the input tokens) to produce the attention output vector. 

 In a Transformer architecture, the attention output is used as component of a building block: the attention is calculated and used as a residual (added to the input), stabilized with a layer normalization, and then processed in a two-layer _feed forward_ network. The latter is simply a combination of two dense layers with an activation in between. This Transformer block is applied multiple times before the final output is decoded into the original space.

-This Transformer architecture was shown to scale extremely well to networks with larger numbers of parameters, one of the key advantages of Transformers. Note that a large part of the weights typically ends up in the matrices of the attention, and not just in the dense layers. At the same time, attention offers a powerful way for working with global dependencies in inputs. This comes at the cost of a more complicated architecture, and an inherent difficulty of the self-attention mechanism above is that it's quadratic in the number of tokens $N$. This naturally puts a limit on the size and resolution of inputs. Under the hood, it's also surprisingly simple: the attention algorithm computes an $N \times N$ matrix, which is not too far from applying a simple dense layer (this would likewise come with an $N \times N$ weight matrix) to resolve global influences.
+This Transformer architecture was shown to scale extremely well to networks with huge numbers of parameters, one of the key advantages of Transformers. Note that a large part of the weights typically ends up in the matrices of the attention, and not just in the dense layers. At the same time, attention offers a powerful way for working with global dependencies in inputs. This comes at the cost of a more complicated architecture. An inherent problem of the self-attention mechanism above is that it's quadratic in the number of tokens $N$. This naturally puts a limit on the size and resolution of inputs. Under the hood, it's also surprisingly simple: the attention algorithm computes an $N \times N$ matrix, which is not too far from applying a simple dense layer (this would likewise come with an $N \times N$ weight matrix) to resolve global influences.

-This bottleneck can be addressed with _linear attention_: it changes the algorithm above to multiply Q and (K^T V) instead, applying a non-linearity (e.g., an exponential) to both parts beforehand. This avoids the  $N \times N$ matrix and scales linearly in $N$. However, this improvement comes at the cost of a more approximative attention vector.
+This bottleneck can be addressed with _linear attention_: it changes the algorithm above to multiply Q and (K^T V) instead, applying a non-linearity (e.g., an exponential) to both parts beforehand. This avoids the  $N \times N$ matrix and scales linearly in $N$. However, this improvement comes at the cost of a more approximate attention vector.

-An interesting aspect of Transformer architectures is also that they've been applied to structured as well as unstructured inputs. I.e., they've been used for graphs, points as well as grid-based data. In all cases the differences primarily lie in how the tokens are constructed. The attention is typically still "dense" in the token space. This is a clear limitation: for problems with a known spatial structure, discarding this information will inevitably need to be compensated for, e.g., with a larger weight count or lower inference accuracy. Nonetheless, Transformers are an extremely active field within DL, and clearly a potential contender for future NN algorithms.
+An interesting aspect of Transformer architectures is also that they've been applied to structured as well as unstructured inputs. I.e., they've been used for graphs, points as well as grid-based data. In all cases the differences primarily lie in how inputs are mapped to the tokens. The attention is typically still "dense" in the token space. This is a clear limitation: for problems with a known spatial structure, discarding this information will inevitably need to be compensated for, e.g., with a larger weight count or lower inference accuracy. Nonetheless, Transformers are an extremely active field within DL, and clearly a potential contender for future NN algorithms.


 ![Divider](resources/divider7.jpg)
@ -179,7 +179,11 @@ An interesting aspect of Transformer architectures is also that they've been app
 ## Summary of Architectures

 The paragraphs above have given an overview over several fundamental considerations when choosing a neural network architecture for a physics-related problem. To re-cap, the
-main consideration when choosing an architecture is knowledge local or global dependencies in the data. Tailoring an architecture to this difference can have a big impact. 
-And while the spatial structure of the data seems to dictate certain choices, it can be worth considering to transfer the data to another data structure. E.g., to project unstructured points onto a (deformed) regular grid to potentially improve accuracy and performance.
+main consideration when choosing an architecture is knowledge about **local** or **global** dependencies in the data. Tailoring an architecture to this difference can have a big impact. 
+And while the spatial structure of the data seems to dictate choices, it can be worth considering to transfer the data to another data structure. E.g., to project unstructured points onto a (deformed) regular grid to potentially improve accuracy and performance.

-Also, it should be mentioned that hybrids of the _canonical_ architectures mentioned above exist: e.g., classic U-Nets with skip connections have been equipped with tricks for Transformer architectures (like attention and normalization) to yield an improved performance. E.g., an implementation of such a "modernized" U-Net can be found in {doc}`probmodels-time`.
+Also, it should be mentioned that hybrids of the _canonical_ architectures mentioned above exist: e.g., classic U-Nets with skip connections have been equipped with components of Transformer architectures (like attention and normalization) to yield an improved performance. An implementation of such a "modernized" U-Net can be found in {doc}`probmodels-time`.
+
+## Show me some code!
+
+Let's finally look at a code example that trains a neural network: we'll replace a full solver for _turbulent flows around airfoils_ with a surrogate model from {cite}`thuerey2020dfp` using a U-Net with a global receptive field as operator. 
--- a/supervised-discuss.md
+++ b/supervised-discuss.md
@ -1,11 +1,11 @@
 Discussion of Supervised Approaches
 =======================

-The previous example illustrates that we can quite easily use 
-supervised training to solve complex tasks. The main workload is
-collecting a large enough data set of examples. Once that exists, we can
-train a network to approximate the solution manifold sampled
-by these solutions, and the trained network can give us predictions
+The previous example illustrates that we 
+supervised training serves as a basis that can solve non-trivial tasks. 
+The main workload is collecting a large enough data set of examples. 
+Once that exists, we can train a network to approximate the solution manifold 
+represented by these solutions, and the trained network can give us predictions
 very quickly. There are a few important points to keep in mind when 
 using supervised training.

@ -15,8 +15,8 @@ using supervised training.

 ### Natural starting point

-_Supervised training_ is the natural starting point for **any** DL project. It always,
-and we really mean **always** here, makes sense to start with a fully supervised
+_Supervised training_ is the natural starting point for **any** DL project. It 
+really **always** makes sense to start with a fully supervised
 test using as little data as possible. This will be a pure overfitting test,
 but if your network can't quickly converge and give a very good performance 
 on a single example, then there's something fundamentally wrong
@ -29,10 +29,11 @@ setups that will make finding these fundamental problems more difficult.
 To summarize the scattered comments of the previous sections, here's a set of "golden rules"  for setting up a DL project.

 - Always start with a 1-sample overfitting test.
- Check how many trainable parameters your network has.
- Slowly increase the amount of training data (and potentially network parameters and depth).
+- Check how many trainable parameters your network has, and that your data is normalized properly.
+- Make sure the NN converges.
+- Then slowly increase the amount of training data (and potentially network parameters and depth).
 - Adjust hyperparameters (especially the learning rate).
- Then introduce other components such as differentiable solvers or adversarial training.
+- Finally, introduce other components such as differentiable solvers or diffusion training.

 ```

@ -75,9 +76,10 @@ height: 300px
 name: supervised-example-plot
 ---
 An example from the airfoil case of the previous section: a visualization of a training data 
-set in terms of mean u and v velocity of 2D flow fields. It nicely shows that there are no extreme outliers,
+set in terms of mean u and v velocity of 2D flow fields. 
+It nicely shows that there are no extreme outliers,
 but there are a few entries with relatively low mean u velocity on the left side. 
-A second, smaller data set is shown on top in red, showing that its samples cover the range of mean motions quite well.
+A second, smaller test data set is overlayed with red triangles, showing that its samples cover the range of mean motions well.
 ```

 ### Where's the magic? 🦄 
@ -111,9 +113,11 @@ e.g., by normalization and by focusing on invariants.
 To give a more specific example: if you always train
 your networks for inputs in the range $[0\dots1]$, don't expect it to work
 with inputs of $[27\dots39]$. In certain cases it's valid to normalize
-inputs and outputs by subtracting the mean, and normalize via the standard 
+inputs and outputs by subtracting the mean, and normalizing via the standard 
 deviation or a suitable quantile (make sure this doesn't destroy important
-correlations in your data).
+correlations in your data). Looking ahead, a fast solver might be sufficient
+to handl the large offset of around $27$, so that the NN can focus on a restricted 
+input range in terms of a normalized residual.

 As a rule of thumb: make sure you actually train the NN on the 
 inputs that are as similar as possible to those you want to use at inference time.
@ -124,22 +128,6 @@ it's important to actually include the simulator in the training process. Otherw
 the network might specialize on pre-computed data that differs from what is produced
 when combining the NN with the solver, i.e it will suffer from _distribution shift_.

-### Meshes and grids
-
-The previous airfoil example used Cartesian grids with standard 
-convolutions. These typically give the most _bang-for-the-buck_, in terms
-of performance and stability. Nonetheless, the whole discussion here of course 
-also holds for other types of convolutions, e.g., a less regular mesh
-in conjunction with graph-convolutions, or particle-based data
-with continuous convolutions (cf {doc}`others-lagrangian`). You will typically see reduced learning
-performance in exchange for improved sampling flexibility when switching to these.
-
-Finally, a word on fully-connected layers or _MLPs_ in general: we'd recommend
-to avoid these as much as possible. For any structured data, like spatial functions,
-or _field data_ in general, convolutions are preferable, and less likely to overfit.
-E.g., you'll notice that CNNs typically don't need dropout, as they're nicely
-regularized by construction. For MLPs, you typically need quite a bit to
-avoid overfitting.

 ![Divider](resources/divider2.jpg)

@ -153,12 +141,14 @@ To summarize, supervised training has the following properties.
 - Great starting point.

 ❌ Con: 
- Lots of data needed.
- Sub-optimal performance, accuracy and generalization.
+- Lots of data needed (loading can become a bottleneck).
+- Potentially sub-optimal performance in terms of accuracy and generalization.
 - Interactions with external "processes" (such as embedding into a solver) are difficult.

 The next chapters will explain how to alleviate these shortcomings of supervised training.
 First, we'll look at bringing model equations into the picture via soft constraints, and afterwards
 we'll revisit the challenges of bringing together numerical simulations and learned approaches.
+Finally, we'll extend the basic approach for generative modeling with diffusion models
+and flow matching.


--- a/supervised.md
+++ b/supervised.md
@ -1,10 +1,10 @@
 Supervised Training
 =======================

-_Supervised training_ is a starting point for all projects one would encounter in the context of deep learning, and hence it is worth studying. 
-And while it can yield inferior results to approaches that more tightly 
-couple with physics, it can be the only choice in certain application scenarios where no good
-model equations exist.
+_Supervised training_ is the central starting point for all projects in the context of deep learning. 
+While it can yield suboptimal results compared to approaches that more tightly 
+couple with physics, it can be the only choice in certain application scenarios 
+where no good model equations exist.
 In this chapter, we'll also go over the basics of different neural network _architectures_. Next to training
 methodology, this is an imporant choice.

@ -16,34 +16,34 @@ and directly train an NN to represent an approximation of $f^*$ denoted as $f$.

 The $f$ we can obtain in this way is typically not exact, 
 but instead we obtain it via a minimization problem:
-by adjusting the weights $\theta$ of our NN representation of $f$ such that
+by adjusting the weights $\theta$ of our NN representation of $f$ such that we minimize the error over all data points in the training set 

 $$
 \text{arg min}_{\theta} \sum_i \Big(f(x_i ; \theta)-y^*_i \Big)^2 .
 $$ (supervised-training)

 This will give us $\theta$ such that $f(x;\theta) =  y \approx y^*$ as accurately as possible given
-our choice of $f$ and the hyperparameters for training. Note that above we've assumed 
+our choice of $f$ and the hyperparameters chosen for training. Note that above we've assumed 
 the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$ in the loss $L$
 to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y^*_i) )$. The choice
 of a suitable metric is a topic we will get back to later on.
-
-Irrespective of our choice of metric, this formulation
-gives the actual "learning" process for a supervised approach.
+The minimization above constitutes the actual "learning" process, and is non-trivial because
+$f$ is usually a non-linear function.

 The training data typically needs to be of substantial size, and hence it is attractive 
 to use numerical simulations solving a physical model $\mathcal{P}$ 
 to produce a large number of reliable input-output pairs for training.
 This means that the training process uses a set of model equations, and approximates
-them numerically, in order to train the NN representation $f$. This
-has quite a few advantages, e.g., we don't have measurement noise of real-world devices
+them numerically, in order to fit the NN representation $f$. This
+has quite a few advantages, e.g., we don't have the measurement noise of real-world devices
 and we don't need manual labour to annotate a large number of samples to get training data.

 On the other hand, this approach inherits the common challenges of replacing experiments
 with simulations: first, we need to ensure the chosen model has enough power to predict the 
-behavior of real-world phenomena that we're interested in.
-In addition, the numerical approximations have numerical errors
-which need to be kept small enough for a chosen application. As these topics are studied in depth
+behavior of the simulated phenomena that we're interested in.
+In addition, the numerical approximations have _numerical errors_
+which need to be kept small enough for a chosen application (otherwise even the best NN has no chance
+to be provide a useful answer later on). As these topics are studied in depth
 for classical simulations, and the existing knowledge can likewise be leveraged to
 set up DL training tasks.

@ -52,29 +52,26 @@ set up DL training tasks.
 height: 220px
 name: supervised-training
 ---
-A visual overview of supervised training. Quite simple overall, but it's good to keep this
-in mind in comparison to the more complex variants we'll encounter later on.
+A visual overview of supervised training. It's simple, and a good starting point 
+in comparison to the more complex variants we'll encounter later on.
 ```

 ## Surrogate models

 One of the central advantages of the supervised approach above is that
-we obtain a _surrogate model_, i.e., a new function that mimics the behavior of the original $\mathcal{P}$. 
-The numerical approximations
-of PDE models for real world phenomena are often very expensive to compute. A trained
+we obtain a _surrogate model_ (or "emulator", or "Neural operator"), 
+i.e., a new function that mimics the behavior of the original $\mathcal{P}$. 
+The numerical approximations of PDE models for real world phenomena are often very expensive to compute. A trained
 NN on the other hand incurs a constant cost per evaluation, and is typically trivial
-to evaluate on specialized hardware such as GPUs or NN units.
+to evaluate on specialized hardware such as GPUs or NN compute units.

 Despite this, it's important to be careful:
 NNs can quickly generate huge numbers of in between results. Consider a CNN layer with
 $128$ features. If we apply it to an input of $128^2$, i.e. ca. 16k cells, we get $128^3$ intermediate values.
 That's more than 2 million.
 All these values at least need to be momentarily stored in memory, and processed by the next layer.
-
 Nonetheless, replacing complex and expensive solvers with fast, learned approximations
 is a very attractive and interesting direction.

-## Show me some code!
+An important decision to make at this stage is what neural network architecture to choose.

-Let's finally look at a code example that trains a neural network:
-we'll replace a full solver for _turbulent flows around airfoils_ with a surrogate model from {cite}`thuerey2020dfp`.