more phiflow 3.4 updates for HEAT SIP

more phiflow 3.4 updates; warning SoL code not yet working
phiflow 3.4 updates
2025-08-12 10:58:56 +02:00 · 2025-08-12 09:26:34 +02:00 · 2025-08-12 09:23:18 +02:00 · 2025-08-06 15:08:15 +02:00 · 2025-06-13 16:19:45 +02:00 · 2025-06-03 15:35:29 +02:00
104 changed files with 11934 additions and 1106 deletions
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Welcome to the Physics-based Deep Learning book (PBDL) v0.2
+# Welcome to the Physics-based Deep Learning book (PBDL) v0.3

 This is the source code repository for the Jupyter book "Physics-based Deep Learning". You can find the full, readable version online at:
 [https://physicsbaseddeeplearning.org/](https://physicsbaseddeeplearning.org/)
@@ -9,19 +9,26 @@ A single-PDF version is also available on arXiv: https://arxiv.org/pdf/2109.0523

 # A Short Synopsis

-The PBDL book contains a practical and comprehensive introduction of everything related to deep learning in the context of physical simulations. As much as possible, all topics come with hands-on code examples in the form of Jupyter notebooks to quickly get started. Beyond standard supervised learning from data, we’ll look at physical loss constraints, more tightly coupled learning algorithms with differentiable simulations, as well as reinforcement learning and uncertainty modeling. We live in exciting times: these methods have a huge potential to fundamentally change what we can achieve with simulations.
+The PBDL book contains a hands-on, comprehensive guide to deep learning in the realm of physical simulations. Rather than just theory, we emphasize practical application: every concept is paired with interactive Jupyter notebooks to get you up and running quickly. Beyond traditional supervised learning, we dive into physical loss-constraints, differentiable simulations, diffusion-based approaches for probabilistic generative AI, as well as reinforcement learning and advanced neural network architectures. These foundations are paving the way for the next generation of scientific foundation models. We are living in an era of rapid transformation. These methods have the potential to redefine what’s possible in computational science.

 The key aspects that we will address in the following are:

-* explain how to use deep learning techniques to solve PDE problems,
-* how to combine them with existing knowledge of physics,
-* without discarding our knowledge about numerical methods.
+* How to train neural networks to predict the fluid flow around airfoils with diffusion modeling. This gives a probabilistic surrogate model that replaces and outperforms traditional simulators.
+* How to use model equations as residuals to train networks that represent solutions, and how to improve upon these residual constraints by using differentiable simulations.
+* How to more tightly interact with a full simulator for inverse problems. E.g., we’ll demonstrate how to circumvent the convergence problems of standard reinforcement learning techniques by leveraging simulators in the training loop.
+* We’ll also discuss the importance of choosing the right network architecture: whether to consider global or local interactions, continuous or discrete representations, and structured versus unstructured graph meshes.

 The focus of this book lies on:

-* Field-based simulations (not much on Lagrangian methods)
-* Combinations with deep learning (plenty of other interesting ML techniques exist, but won't be discussed here)
-* Experiments as are left as an outlook (such as replacing synthetic data with real-world observations)
+* how to use deep learning techniques to solve PDE problems,
+* how to combine them with existing knowledge of physics,
+* without discarding numerical methods.
+
+At the same time, it’s worth noting what we won’t be covering:
+
+* There’s no in-depth introduction to deep learning and numerical simulations,
+* and the aim is neither a broad survey of research articles in this area.
+

 The name of this book, _Physics-based Deep Learning_, denotes combinations of physical modeling and numerical simulations with methods based on artificial neural networks. The general direction of Physics-Based Deep Learning represents a very active, quickly growing and exciting field of research.

@@ -29,24 +36,27 @@ The aim is to build on all the powerful numerical techniques that we have at our

 The resulting methods have a huge potential to improve what can be done with numerical methods: in scenarios where a solver targets cases from a certain well-defined problem domain repeatedly, it can for instance make a lot of sense to once invest significant resources to train a neural network that supports the repeated solves. Based on the domain-specific specialization of this network, such a hybrid could vastly outperform traditional, generic solvers.

+![Divider](resources/divider-gen2.jpg)

 # What's new?

-* For readers familiar with v0.1 of this text, the [extended section on differentiable physics training](http://physicsbaseddeeplearning.org/diffphys-examples.html)  and the
-brand new chapter on [improved learning methods for physics problems](http://physicsbaseddeeplearning.org/diffphys-examples.html) are highly recommended starting points.
+What’s new in v0.3? This latest edition takes things even further with a major new chapter on generative modeling, covering cutting-edge techniques like denoising, flow-matching, autoregressive learning, physics-integrated constraints, and diffusion-based graph networks. We’ve also introduced a dedicated section on neural architectures specifically designed for physics simulations. All code examples have been updated to leverage the latest frameworks.


 # Teasers

-To mention a few highlights: the book contains a notebook to train hybrid fluid flow (Navier-Stokes) solvers via differentiable physics to reduce numerical errors. Try it out:
+To mention a few highlights: the book contains a notebook to train hybrid fluid flow (Navier-Stokes) solvers via differentiable physics to reduce numerical errors. Try it out in Colab:
 https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/diffphys-code-sol.ipynb

-In v0.2 there's new notebook for an improved learning scheme which jointly computes update directions for neural networks and physics (via half-inverse gradients):
-https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/physgrad-hig-code.ipynb
+PBDL also has example code to train diffusion denoising and flow matching networks for RANS flow predictions around airfoils that yield uncertainty estimates. You can run the code right away here:
+https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/probmodels-ddpm-fm.ipynb

-It also has example code to train a Bayesian Neural Network for RANS flow predictions around airfoils that yield uncertainty estimates. You can run the code right away here:
-https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/bayesian-code.ipynb
+There's a notebook for an improved learning scheme which jointly computes update directions for neural networks and physics (via half-inverse gradients):
+https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/physgrad-hig-code.ipynb

 And a notebook to compare proximal policy-based reinforcement learning with physics-based learning for controlling PDEs (spoiler: the physics-aware version does better in the end). Give it a try:
 https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/reinflearn-code.ipynb

+
+![Divider](resources/divider-gen4.jpg)
+
--- a/_config.yml
+++ b/_config.yml
@@ -2,9 +2,9 @@
 # Learn more at https://jupyterbook.org/customize/config.html

 title: Physics-based Deep Learning
-author: N. Thuerey, P. Holl, M. Mueller, P. Schnell, F. Trost, K. Um
+author: N. Thuerey, B. Holzschuh, P. Holl, G. Kohl, M. Lino, Q. Liu, P. Schnell, F. Trost
 logo: resources/logo.jpg
-copyright: "2021,2022"
+copyright: "2021 - 2025"
 only_build_toc_files: true

 launch_buttons:
@@ -34,3 +34,12 @@ html:
  use_issues_button: true
  use_repository_button: true
  favicon: "favicon.ico"
+
+# for $$ equations in text
+parse:
+  myst_dmath_double_inline: true
+
+sphinx:
+  extra_extensions:
+    - sphinx_proof
+
--- a/_toc.yml
+++ b/_toc.yml
@@ -10,13 +10,16 @@ parts:
    - file: overview-burgers-forw.ipynb
    - file: overview-ns-forw.ipynb
    - file: overview-optconv.md
+- caption: Neural Surrogates and Operators
+  chapters:
  - file: supervised.md
-    sections:
-    - file: supervised-airfoils.ipynb
-    - file: supervised-discuss.md
+  - file: supervised-arch.md
+  - file: supervised-airfoils.ipynb
+  - file: supervised-discuss.md
 - caption: Physical Losses
  chapters:
  - file: physicalloss.md
+  - file: physicalloss-div.ipynb
  - file: physicalloss-code.ipynb
  - file: physicalloss-discuss.md
 - caption: Differentiable Physics
@@ -25,12 +28,25 @@ parts:
  - file: diffphys-code-burgers.ipynb
  - file: diffphys-dpvspinn.md
  - file: diffphys-code-ns.ipynb
- caption: Differentiable Physics with NNs
-  chapters:
  - file: diffphys-examples.md
  - file: diffphys-code-sol.ipynb
  - file: diffphys-code-control.ipynb
  - file: diffphys-discuss.md
+- caption: Probabilistic Learning
+  chapters:
+  - file: probmodels-intro.md
+  - file: probmodels-normflow.ipynb
+  - file: probmodels-score.ipynb
+  - file: probmodels-diffusion.ipynb
+  - file: probmodels-flowmatching.ipynb
+  - file: probmodels-ddpm-fm.ipynb
+  - file: probmodels-phys.md
+  - file: probmodels-sbisim.ipynb
+  - file: probmodels-time.ipynb
+  - file: probmodels-uncond.md
+  - file: probmodels-graph.md
+  - file: probmodels-graph-ellipse.ipynb
+  - file: probmodels-discuss.md
 - caption: Reinforcement Learning
  chapters:
  - file: reinflearn-intro.md
@@ -44,16 +60,12 @@ parts:
  - file: physgrad-hig.md
  - file: physgrad-hig-code.ipynb
  - file: physgrad-discuss.md
- caption: PBDL and Uncertainty
-  chapters:
-  - file: bayesian-intro.md
-  - file: bayesian-code.ipynb
 - caption: Fast Forward Topics
  chapters:
  - file: others-intro.md
  - file: others-timeseries.md
-  - file: others-GANs.md
  - file: others-lagrangian.md
+  - file: others-GANs.md
 - caption: End Matter
  chapters:
  - file: outlook.md
--- a/diffphys-code-burgers.ipynb
+++ b/diffphys-code-burgers.ipynb
@@ -32,7 +32,7 @@
   },
   "outputs": [],
   "source": [
-    "!pip install --upgrade --quiet phiflow==3.1\n",
+    "!pip install --upgrade --quiet phiflow==3.4\n",
    "from phi.tf.flow import *\n",
    "\n",
    "N = 128\n",
@@ -325,7 +325,7 @@
      "Optimization step 35, loss: 0.008185\n",
      "Optimization step 40, loss: 0.005186\n",
      "Optimization step 45, loss: 0.003263\n",
-      "Runtime 130.33s\n"
+      "Runtime 132.33s\n"
     ]
    }
   ],
--- a/diffphys-code-control.ipynb
+++ b/diffphys-code-control.ipynb
--- a/diffphys-code-ns.ipynb
+++ b/diffphys-code-ns.ipynb
--- a/diffphys-code-sol.ipynb
+++ b/diffphys-code-sol.ipynb
--- a/diffphys-discuss.md
+++ b/diffphys-discuss.md
@@ -1,4 +1,4 @@
-Discussion
+Discussion of Differentiable Physics
 =======================

 The previous sections have explained the _differentiable physics_ approach for deep learning, and have given a range of examples: from a very basic gradient calculation, all the way to complex learning setups powered by advanced simulations. This is a good time to take a step back and evaluate: in the end, the differentiable physics components of these approaches are not too complicated. They are largely based on existing numerical methods, with a focus on efficiently using those methods not only to do a forward simulation, but also to compute gradient information. 
@@ -11,13 +11,19 @@ What is primarily exciting in this context are the implications that arise from
 Most importantly, training via differentiable physics allows us to seamlessly bring the two fields together:
 we can obtain _hybrid_ methods, that use the best numerical methods that we have at our disposal for the simulation itself, as well as for the training process. We can then use the trained model to improve forward or backward solves. Thus, in the end, we have a solver that combines a _traditional_ solver and a _learned_ component that in combination can improve the capabilities of numerical methods.

-## Interaction
+## Reducing data shift via interaction

-One key aspect that is important for these hybrids to work well is to let the NN _interact_ with the PDE solver at training time. Differentiable simulations allow a trained model to "explore and experience" the physical environment, and receive directed feedback regarding its interactions throughout the solver iterations. This combination nicely fits into the broader context of machine learning as _differentiable programming_. 
+One key aspect that is important for these hybrids to work well is to let the NN _interact_ with the PDE solver at training time. Differentiable simulations allow a trained model to "explore and experience" the physical environment, and receive directed feedback regarding its interactions throughout the solver iterations. 
+
+This addresses the classic **data shift** problem of machine learning: rather than relying on a _a-priori_ specified distribution for training the network, the training process generates new trajectories via unrolling on the fly, and computes training signals from them. This can be seen as an _a-posteriori_ approach, and makes the trained NN significantly more resilient to unseen inputs. As we'll evaluate in more detail in {doc}`probmodels-uncond`, it's actually hard to beat a good unrolling setup with other approaches.
+
+Note that the topic of _differentiable physics_ nicely fits into the broader context of machine learning as _differentiable programming_. 

 ## Generalization

-The hybrid approach also bears particular promise for simulators: it improves generalizing capabilities of the trained models by letting the PDE-solver handle large-scale _changes to the data distribution_ such that the learned model can focus on localized structures not captured by the discretization. While physical models generalize very well, learned models often specialize in data distributions seen at training time. This was, e.g., shown for the models reducing numerical errors of the previous chapter: the trained models can deal with solution manifolds with significant amounts of varying physical behavior, while simpler training variants quickly deteriorate over the course of recurrent time steps.
+The hybrid approach also bears particular promise for simulators: it improves generalizing capabilities of the trained models by letting the PDE-solver handle large-scale _changes to the data distribution_. This allows the learned model to focus on localized structures not captured by the discretization. While physical models generalize very well, learned models often specialize in data distributions seen at training time. Hence, this aspect benefits from the previous reduction of data shift, and effectively allows for even larger differences in terms of input distribution. If the NN is set up correctly, these can be handled by the classical solver in a hybrid approach.
+
+These benefits were, e.g., shown for the models reducing numerical errors of {doc}`diffphys-code-sol`: the trained models can deal with solution manifolds with significant amounts of varying physical behavior, while simpler training variants would deteriorate over the course of recurrent time steps.


 ![Divider](resources/divider5.jpg)
@@ -28,16 +34,17 @@ To summarize, the pros and cons of training NNs via DP:
 - Uses physical model and numerical methods for discretization.
 - Efficiency and accuracy of selected methods carries over to training.
 - Very tight coupling of physical models and NNs possible.
- Improved generalization via solver interactions.
+- Improved resilience and generalization.

 ❌ Con: 
 - Not compatible with all simulators (need to provide gradients).
 - Require more heavy machinery (in terms of framework support) than previously discussed methods.

-_Outlook_: the last negative point (regarding heavy machinery) is bound to strongly improve given the current pace of software and API developments in the DL area. However, for now it's important to keep in mind that not every simulator is suitable for DP training out of the box. Hence, in this book we'll focus on examples using phiflow, which was designed for interfacing with deep learning frameworks. 
+_Outlook_: the last negative point (regarding heavy machinery) is strongly improving at the moment. Many existing simulators, e.g. the popular open source framework _OpenFoma_, as well as many commercial simulators are working on tight integrations with NNs. However, there's still plenty room for improvement, and in this book we're focusing on examples using phiflow, which was designed for interfacing with deep learning frameworks from ground up. 

-The training via differentiable physics (DP) allows us to integrate full numerical simulations into the training of deep neural networks.
-It is also a very generic approach that is applicable to a wide range of combinations of PDE-based models and deep learning. 
+The training via differentiable physics (DP) allows us to integrate full numerical simulations into the training of deep neural networks. 
+This effectively provides **hard constraints**, as the coupled solver can project and enforce constraints just like classical solvers would.
+It is a very generic approach that is applicable to a wide range of combinations of PDE-based models and deep learning. 

-In the next chapters, we will first compare DP training to model-free alternatives for control problems, and afterwards target the underlying learning process to obtain even better NN states.
+In the next chapters, we will first expand the scope of the learning tasks to incorporate uncertainties, i.e. to work with full distributions rather than single deterministic states and trajectories. Afterwards, we'll also compare DP training to reinforcement learning, and target the underlying learning process to obtain even better NN states.

--- a/diffphys-dpvspinn.md
+++ b/diffphys-dpvspinn.md
@@ -16,13 +16,13 @@ The DP version on the other hand inherently relies on a numerical solver that is

 The reliance on a suitable discretization requires some understanding and knowledge of the problem under consideration. A sub-optimal discretization can impede the learning process or, worst case, lead to diverging training runs. However, given the large body of theory and practical realizations of stable solvers for a wide variety of physical problems, this is typically not an unsurmountable obstacle.

-The PINN approaches on the other hand do not require an a-priori choice of a discretization, and as such seems to be "discretization-less". This, however, is only an advantage on first sight. As they yield solutions in a computer, they naturally _have_ to discretize the problem. They construct this discretization over the course of the training process, in a way that lies at the mercy of the underlying nonlinear optimization, and is not easily controllable from the outside. Thus, the resulting accuracy is determined by how well the training manages to estimate the complexity of the problem for realistic use cases, and how well the training data approximates the unknown regions of the solution.
+The PINN approaches on the other hand do not require an a-priori choice of a discretization, and as such seems to be "discretization-less". This, however, is only an advantage on first sight. By now, researchers are trying to "re-integrate" discretizations into PINN training. Generally, PINNs inevitably yield solutions in a computer and thus _have_ to discretize the problem. They construct this discretization over the course of the training process, in a way that lies at the mercy of the underlying nonlinear optimization, and is not easily controllable from the outside. Thus, the resulting accuracy is determined by how well the training manages to estimate the complexity of the problem for realistic use cases, and how well the training data approximates the unknown regions of the solution. 

 E.g., as demonstrated with the Burgers example, the PINN solutions typically have significant difficulties propagating information _backward_ in time. This is closely coupled to the efficiency of the method.

 ## Efficiency

-The PINN approaches typically perform a localized sampling and correction of the solutions, which means the corrections in the form of weight updates are likewise typically local. The fulfillment of boundary conditions in space and time can be correspondingly slow, leading to long training runs in practice.
+The PINN approach also results in fundamentally more difficult training tasks that causes convergence problems. PINNs typically perform a localized sampling and correction of the solutions, which means the corrections in the form of weight updates are likewise typically local. The fulfillment of boundary conditions in space and time can be correspondingly slow, leading to long training runs in practice.

 A well-chosen discretization of a DP approach can remedy this behavior, and provide an improved flow of gradient information. At the same time, the reliance on a computational grid means that solutions can be obtained very quickly. Given an interpolation scheme or a set of basis functions, the solution can be sampled at any point in space or time given a very local neighborhood of the computational grid. Worst case, this can lead to slight memory overheads, e.g., by repeatedly storing mostly constant values of a solution.

@@ -54,4 +54,4 @@ The following table summarizes these pros and cons of physics-informed (PI) and

 As a summary, both methods are definitely interesting, and have a lot of potential. There are numerous more complicated extensions and algorithmic modifications that change and improve on the various negative aspects we have discussed for both sides.

-However, as of this writing, the physics-informed (PI) approach has clear limitations when it comes to performance and compatibility with existing numerical methods. Thus, when knowledge of the problem at hand is available, which typically is the case when we choose a suitable PDE model to constrain the learning process, employing a differentiable physics solver can significantly improve the training process as well as the quality of the obtained solution. So, in the following we'll focus on DP variants, and illustrate their capabilities with more complex scenarios in the next chapters. First, we'll consider a case that very efficiently computes space-time gradients for a transient fluid simulations.
+However, as of this writing, the PINN approach has clear limitations when it comes to performance and compatibility with existing numerical methods. Thus, when knowledge of the problem at hand is available, which typically is the case when we choose a suitable PDE model to constrain the learning process, employing a differentiable physics solver to train Neural operators can significantly improve the training process as well as the quality of the obtained solution. So, in the following we'll focus on DP variants, and illustrate their capabilities with more complex scenarios in the next chapters. First, we'll consider a case that very efficiently computes space-time gradients for a transient fluid simulations.
--- a/diffphys-examples.md
+++ b/diffphys-examples.md
@@ -7,7 +7,26 @@ When using DP approaches for learning applications,
 there is a lot of flexibility w.r.t. the combination of DP and NN building blocks. 
 As some of the differences are subtle, the following section will go into more detail.
 We'll especially focus on solvers that repeat the PDE and NN evaluations multiple times,
-e.g., to compute multiple states of the physical system over time.
+e.g., to compute multiple states of the physical system over time. In classical numerics,
+this would be called an iterative time stepping method, while in the context of AI, it's
+an _autoregressive_ method.
+
+
+
+```{admonition} Hint: Correction vs Prediction
+:class: tip
+
+The problems that are best tackled with DP approaches are very fundamental. The combination of 
+a imperfect physical model and an _improvement term_ classically goes under many different names:
+_closure problems_ in fluid dynamics and turbulence, _homogenization_ or _coarse-graining_ 
+in material science, while it's called _parametrization_ in climate and weather.
+
+In the following, we'll generically denote all these tasks containing NN+solver as **correction** task, in contrast
+to pure **prediction** tasks for cases where no solver is involved at inference time.
+
+```
+
+

 To re-cap, here's the previous figure about combining NNs and DP operators. 
 In the figure these operators look like a loss term: they typically don't have weights,
@@ -22,7 +41,11 @@ The DP approach as described in the previous chapters. A network produces an inp
 ```

 This setup can be seen as the network receiving information about how it's output influences the outcome of the PDE solver. I.e., the gradient will provide information how to produce an NN output that minimizes the loss. 
-Similar to the previously described _physical losses_ (from {doc}`physicalloss`), this can mean upholding a conservation law.
+Similar to the previously described {doc}`physicalloss`, this can, e.g., mean upholding a conservation law or generally a PDE-based constraint over time.
+
+
+
+

 ## Switching the order 

@@ -36,15 +59,15 @@ name: diffphys-switch
 A PDE solver produces an output which is processed by an NN.
 ```

-In this case the PDE solver essentially represents an _on-the-fly_ data generator. That's not necessarily always useful: this setup could be replaced by a pre-computation of the same inputs, as the PDE solver is not influenced by the NN. Hence, there's no backpropagation through $\mathcal P$, and it could be replaced by a simple "loading" function. On the other hand, evaluating the PDE solver at training time with a randomized sampling of input parameters can lead to an excellent sampling of the data distribution of the input. If we have realistic ranges for how the inputs vary, this can improve the NN training. If implemented correctly, the solver can also alleviate the need to store and load large amounts of data, and instead produce them more quickly at training time, e.g., directly on a GPU. 
+In this case the PDE solver essentially represents an _on-the-fly_ data generator. That's not necessarily always useful: this setup could be replaced by a pre-computation of the same inputs, as the PDE solver is not influenced by the NN. Hence, there's no backpropagation through $\mathcal P$, and it could be replaced by a simple "loading" function. On the other hand, evaluating the PDE solver at training time with a randomized sampling of input parameters can lead to an excellent sampling of the data distribution of the input. If we have realistic ranges for how the inputs vary, this can improve the NN training. If implemented correctly, the solver can also alleviate the need to store and load large amounts of data, and instead produce them more quickly at training time, e.g., directly on a GPU. Recent methods explore this direction in the context of _Active Learning_.

-However, this version does not leverage the gradient information from a differentiable solver, which is why the following variant is much more interesting.
+However, this version does not leverage the gradient information from a differentiable solver, which is why the following variant is more interesting.

 ## Recurrent evaluation

-In general, there's no combination of NN layers and DP operators that is _forbidden_ (as long as their dimensions are compatible). One that makes particular sense is to "unroll" the iterations of a time stepping process of a simulator, and let the state of a system be influenced by an NN.
+A combination that makes particular sense is to **unroll** the iterations of a time stepping process of a simulator, and let the state of a system be influenced by an NN. (In general, there's no combination of NN layers and DP operators that is _forbidden_ (as long as their dimensions are compatible).)

-In this case we compute a (potentially very long) sequence of PDE solver steps in the forward pass. In-between these solver steps, an NN modifies the state of our system, which is then used to compute the next PDE solver step. During the backpropagation pass, we move backwards through all of these steps to evaluate contributions to the loss function (it can be evaluated in one or more places anywhere in the execution chain), and to backprop the gradient information through the DP and NN operators. This unrollment of solver iterations essentially gives feedback to the NN about how it's "actions" influence the state of the physical system and resulting loss. Here's a visual overview of this form of combination:
+In the case of unrolling, we compute a (potentially very long) sequence of PDE solver steps in the forward pass. In-between these solver steps, an NN modifies the state of our system, which is then used to compute the next PDE solver step. During the backpropagation pass, we move backwards through all of these steps to evaluate contributions to the loss function (it can be evaluated in one or more places anywhere in the execution chain), and to backprop the gradient information through the DP and NN operators. This unrollment of solver iterations essentially gives feedback to the NN about how it's "actions" influence the state of the physical system and resulting loss. Here's a visual overview of this form of combination:

 ```{figure} resources/diffphys-multistep.jpg
 ---
@@ -54,7 +77,7 @@ name: diffphys-mulitstep
 Time stepping with interleaved DP and NN operations for $k$ solver iterations. The dashed gray arrows indicate optional intermediate evaluations of loss terms (similar to the solid gray arrow for the last step $k$), and intermediate outputs of the NN are indicated with a tilde.
 ```

-Due to the iterative nature of this process, errors will start out very small, and then slowly increase exponentially over the course of iterations. Hence they are extremely difficult to detect in a single evaluation, e.g., with a simpler supervised training setup. Rather, it is crucial to provide feedback to the NN at training time how the errors evolve over course of the iterations. Additionally, a pre-computation of the states is not possible for such iterative cases, as the iterations depend on the state of the NN. Naturally, the NN state is unknown before training time and changes while being trained. Hence, a DP-based training is crucial in these recurrent settings to provide the NN with gradients about how its current state influences the solver iterations, and correspondingly, how the weights should be changed to better achieve the learning objectives.
+Due to the iterative nature of this process, errors will start out very small, and then (for modes with eigenvalues larger than one in the Jacobian) slowly increase exponentially over the course of iterations. Hence they are extremely difficult to detect in a single evaluation, e.g., with a simpler supervised training setup. Rather, it is crucial to provide feedback to the NN at training time how the errors evolve over course of the iterations. Additionally, a pre-computation of the states is not possible for such iterative cases, as the iterations depend on the state of the NN. Naturally, the NN state is unknown before training time and changes while being trained. This is the classic ML problem of **data shift**. Hence, a DP-based training is crucial in these recurrent settings to provide the NN with gradients about how its current state influences the solver iterations, and correspondingly, how the weights should be changed to better achieve the learning objectives.

 DP setups with many time steps can be difficult to train: the gradients need to backpropagate through the full chain of PDE solver evaluations and NN evaluations. Typically, each of them represents a non-linear and complex function. Hence for larger numbers of steps, the vanishing and exploding gradient problem can make training difficult. Some practical considerations for alleviating this will follow int {doc}`diffphys-code-sol`.

@@ -169,6 +192,9 @@ for training setups that tend to overfit. However, if possible, it is preferable
 actual solver in the training loop via a DP approach to give the network feedback about the time 
 evolution of the system.

+With the current state of affairs, generative modeling approaches (denoising diffusion or flow matching) or 
+provide a better founded approach for incorporating noise. We'll look into this topic in more detail in {doc}`probmodels-uncond`.
+
 ---

 ## Complex examples
--- a/diffphys.md
+++ b/diffphys.md
@@ -6,14 +6,17 @@ methods and physical simulations we will target incorporating _differentiable
 numerical simulations_ into the learning process. In the following, we'll shorten
 these "differentiable numerical simulations of physical systems" to just "differentiable physics" (DP).

-The central goal of these methods is to use existing numerical solvers, and equip
+The central goal of these methods is to use existing numerical solvers
+to empower and improve AI systems.
+This requires equipping
 them with functionality to compute gradients with respect to their inputs.
 Once this is realized for all operators of a simulation, we can leverage 
 the autodiff functionality of DL frameworks with backpropagation to let gradient 
 information flow from a simulator into an NN and vice versa. This has numerous 
 advantages such as improved learning feedback and generalization, as we'll outline below.

-In contrast to physics-informed loss functions, it also enables handling more complex
+In contrast to the physics-informed loss functions of the previous chapter, 
+it also enables handling more complex
 solution manifolds instead of single inverse problems. 
 E.g., instead of using deep learning
 to solve single inverse problems as in the previous chapter, 
@@ -31,7 +34,7 @@ provide directions in the form of gradients to steer the learning process.

 ## Differentiable operators

-With the DP direction we build on existing numerical solvers. I.e., 
+With DP we build on _existing_ numerical solvers. I.e., 
 the approach is strongly relying on the algorithms developed in the larger field 
 of computational methods for a vast range of physical effects in our world.
 To start with, we need a continuous formulation as model for the physical effect that we'd like 
@@ -39,14 +42,13 @@ to simulate -- if this is missing we're in trouble. But luckily, we can
 tap into existing collections of model equations and established methods
 for discretizing continuous models.

-Let's assume we have a continuous formulation $\mathcal P^*(\mathbf{x}, \nu)$ of the physical quantity of 
-interest $\mathbf{u}(\mathbf{x}, t): \mathbb R^d \times \mathbb R^+ \rightarrow \mathbb R^d$,
+Let's assume we have a continuous formulation $\mathcal P^*(\mathbf{u}, \nu)$ of the physical quantity of 
+interest $\mathbf{u}(\mathbf{u}, t): \mathbb R^d \times \mathbb R^+ \rightarrow \mathbb R^d$,
 with model parameters $\nu$ (e.g., diffusion, viscosity, or conductivity constants).
 The components of $\mathbf{u}$ will be denoted by a numbered subscript, i.e.,
 $\mathbf{u} = (u_1,u_2,\dots,u_d)^T$.
-%and a corresponding discrete version that describes the evolution of this quantity over time: $\mathbf{u}_t = \mathcal P(\mathbf{x}, \mathbf{u}, t)$.
 Typically, we are interested in the temporal evolution of such a system.
-Discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
+Discretization yields a formulation $\mathcal P(\mathbf{u}, \nu)$
 that we re-arrange to compute a future state after a time step $\Delta t$. 
 The state at $t+\Delta t$ is computed via sequence of
 operations $\mathcal P_1, \mathcal P_2 \dots \mathcal P_m$ such that
@@ -60,7 +62,7 @@ $\partial \mathcal P_i / \partial \mathbf{u}$.
 ```

 Note that we typically don't need derivatives 
-for all parameters of $\mathcal P(\mathbf{x}, \nu)$, e.g., 
+for all parameters of $\mathcal P(\mathbf{u}, \nu)$, e.g., 
 we omit $\nu$ in the following, assuming that this is a 
 given model parameter with which the NN should not interact. 
 Naturally, it can vary within the solution manifold that we're interested in, 
@@ -111,7 +113,7 @@ E.g., for two of them
 $$
    \frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} } \Big|_{\mathbf{u}^n}
    = 
-    \frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} } \big|_{\mathcal P_2(\mathbf{u}^n)}
+    \frac{ \partial \mathcal P_1 }{ \partial \mathcal P_2 } \big|_{\mathcal P_2(\mathbf{u}^n)}
    \ 
    \frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} } \big|_{\mathbf{u}^n} \ , 
 $$
@@ -128,6 +130,7 @@ one by one.
 For the details of forward and reverse mode differentiation, please check out external materials such 
 as this [nice survey by Baydin et al.](https://arxiv.org/pdf/1502.05767.pdf).

+
 ## Learning via DP operators 

 Thus, once the operators of our simulator support computations of the Jacobian-vector 
@@ -209,7 +212,7 @@ Informally, we'd like to find a flow that deforms $d^{~0}$ through the PDE model
 The simplest way to express this goal is via an $L^2$ loss between the two states. So we want
 to minimize the loss function $L=|d(t^e) - d^{\text{target}}|^2$. 

-Note that as described here this inverse problem is a pure optimization task: there's no NN involved,
+Note that as described here, this inverse problem is a pure optimization task: there's no NN involved,
 and our goal is to obtain $\mathbf{u}$. We do not want to apply this velocity to other, unseen _test data_,
 as would be custom in a real learning task.

--- a/intro-teaser.ipynb
+++ b/intro-teaser.ipynb
--- a/intro.md
+++ b/intro.md
@@ -7,25 +7,15 @@ name: pbdl-logo-large
 ---
 ```

-Welcome to the _Physics-based Deep Learning Book_ (v0.2) 👋
+Welcome to the _Physics-based Deep Learning Book_ (v0.3, the _GenAI_ edition) 👋

 **TL;DR**: 
-This document contains a practical and comprehensive introduction of everything
-related to deep learning in the context of physical simulations.
-As much as possible, all topics come with hands-on code examples in the 
-form of Jupyter notebooks to quickly get started.
-Beyond standard _supervised_ learning from data, we'll look at _physical loss_ constraints, 
-more tightly coupled learning algorithms with _differentiable simulations_, 
-training algorithms tailored to physics problems,
-as well as 
-reinforcement learning and uncertainty modeling.
-We live in exciting times: these methods have a huge potential to fundamentally 
-change what computer simulations can achieve.
+This document is a hands-on, comprehensive guide to deep learning in the realm of physical simulations. Rather than just theory, we emphasize practical application: every concept is paired with interactive Jupyter notebooks to get you up and running quickly. Beyond traditional supervised learning, we dive into physical _loss-constraints_, _differentiable_ simulations, _diffusion-based_ approaches for _probabilistic generative AI_, as well as reinforcement learning and advanced neural network architectures. These foundations are paving the way for the next generation of scientific _foundation models_.
+We are living in an era of rapid transformation. These methods have the potential to redefine what’s possible in computational science.

 ```{note} 
-_What's new in v0.2?_
-For readers familiar with v0.1 of this text, the extended section {doc}`diffphys-examples` and the
-brand new chapter on improved learning methods for physics problems (starting with {doc}`physgrad`) are highly recommended starting points.
+_What's new in v0.3?_
+This latest edition adds a major new chapter on generative modeling, covering powerful techniques like denoising, flow-matching, autoregressive learning, physics-integrated constraints, and diffusion-based graph networks. We've also introduced a dedicated section on neural architectures specifically designed for physics simulations. All code examples have been updated to leverage the latest frameworks.
 ```

 ---
@@ -34,13 +24,13 @@ brand new chapter on improved learning methods for physics problems (starting wi

 As a _sneak preview_, the next chapters will show:

- How to train networks to infer a fluid flow around shapes like airfoils, and estimate the uncertainty of the prediction. This gives a _surrogate model_ that replaces a traditional numerical simulation.
+- How to train neural networks to [predict the fluid flow around airfoils with diffusion modeling](probmodels-ddpm-fm). This gives a probabilistic _surrogate model_ that replaces and outperforms traditional simulators.

- How to use model equations as residuals to train networks that represent solutions, and how to improve upon these residual constraints by using _differentiable simulations_.
+- How to use model equations as residuals to train networks that [represent solutions](diffphys-dpvspinn), and how to improve upon these residual constraints by using [differentiable simulations](diffphys-code-sol).

- How to more tightly interact with a full simulator for _inverse problems_. E.g., we'll demonstrate how to circumvent the convergence problems of standard reinforcement learning techniques by leveraging simulators in the training loop.
+- How to more tightly interact with a full simulator for [inverse problems](diffphys-code-control). E.g., we'll demonstrate how to circumvent the convergence problems of standard reinforcement learning techniques by leveraging [simulators in the training loop](reinflearn-code).

- We'll also discuss the importance of _inversion_ for the update steps, and how higher-order information can be used to speed up convergence, and obtain more accurate neural networks.
+- We'll also discuss the importance of [choosing the right network architecture](supervised-arch): whether to consider global or local interactions, continuous or discrete representations, and structured versus unstructured graph meshes.

 Throughout this text,
 we will introduce different approaches for introducing physical models
@@ -85,23 +75,25 @@ Some visual examples of numerically simulated time sequences. In this book, we e

 ## Thanks!

-This project would not have been possible without the help of many people who contributed. Thanks to everyone 🙏 Here's an alphabetical list:
+This project would not have been possible without the help of the many people who contributed to it. A big thanks to everyone 🙏 Here's an alphabetical list:

+- [Benjamin Holzschuh](https://ge.in.tum.de/about/)
 - [Philipp Holl](https://ge.in.tum.de/about/philipp-holl/)
- [Maximilian Mueller](https://ge.in.tum.de/)
+- [Georg Kohl](https://ge.in.tum.de/about/georg-kohl/)
+- [Mario Lino](https://ge.in.tum.de/about/mario-lino/)
+- [Qiang Liu](https://ge.in.tum.de/about/qiang-liu/)
 - [Patrick Schnell](https://ge.in.tum.de/about/patrick-schnell/)
- [Felix Trost](https://ge.in.tum.de/)
+- [Felix Trost](https://ge.in.tum.de/about/)
 - [Nils Thuerey](https://ge.in.tum.de/about/n-thuerey/)
- [Kiwon Um](https://ge.in.tum.de/about/kiwon/)
+

 Additional thanks go to 
-Georg Kohl for the nice divider images (cf. {cite}`kohl2020lsim`), 
-Li-Wei Chen for the airfoil data image, 
-and to 
-Chloe Paillard for proofreading parts of the document.
-
-% future:
-% - [Georg Kohl](https://ge.in.tum.de/about/georg-kohl/)
+Li-Wei Chen, 
+Xin Luo,
+Maximilian Mueller,
+Chloe Paillard,
+Kiwon Um,
+and all github contributors!

 ## Citation

@@ -109,10 +101,14 @@ If you find this book useful, please cite it via:
 ```
@book{thuerey2021pbdl,
  title={Physics-based Deep Learning},
-  author={Nils Thuerey and Philipp Holl and Maximilian Mueller and Patrick Schnell and Felix Trost and Kiwon Um},
+  author={N. Thuerey and B. Holzschuh  and P. Holl  and G. Kohl  and M. Lino  and Q. Liu and P. Schnell  and F. Trost},
  url={https://physicsbaseddeeplearning.org},
  year={2021},
  publisher={WWW}
 }
 ```

+## Time to get started
+
+The future of simulation is being rewritten, and with the following AI and deep learning techniques, you’ll be at the forefront of these developments. Let’s dive in!
+
--- a/make-pdf.sh
+++ b/make-pdf.sh
@@ -1,13 +1,11 @@
 # source this file with "." in a shell

-# note this script assumes the following paths/versions: python3.7 , /Users/thuerey/Library/Python/3.7/bin/jupyter-book
-# updated for nMBA !
-
 # do clean git checkout for changes from json-cleanup-for-pdf.py via:
 # git checkout diffphys-code-burgers.ipynb diffphys-code-ns.ipynb diffphys-code-sol.ipynb physicalloss-code.ipynb bayesian-code.ipynb supervised-airfoils.ipynb reinflearn-code.ipynb physgrad-code.ipynb physgrad-comparison.ipynb physgrad-hig-code.ipynb

+
 echo
-echo WARNING - still requires one manual quit of first pdf/latex pass, use shift-x to quit
+echo WARNING - still requires one manual quit of first pdf/latex pass, use shift-x to quit, then fix latex
 echo

 PYT=python3
@@ -17,50 +15,14 @@ ${PYT} json-cleanup-for-pdf.py

 # clean / remove _build dir ?

-/Users/thuerey/Library/Python/3.9/bin/jupyter-book build . 
-xelatex book
+/Users/thuerey/Library/Python/3.9/bin/jupyter-book build .  --builder pdflatex

-exit # sufficient for newer jupyter book versions
+# manual?
+#xelatex book

-
-
-
-
-
-# old "pre" GEN
-#/Users/thuerey/Library/Python/3.7/bin/jupyter-book build . --builder pdflatex
-#/Users/thuerey/Library/Python/3.9/bin/jupyter-book build . --builder pdflatex
-
-# old cleanup
-
-cd _build/latex
-#mv book.pdf book-xetex.pdf # not necessary, failed anyway
-# this generates book.tex
-
-rm -f book-in.tex sphinxmessages-in.sty book-in.aux book-in.toc
-# rename book.tex -> book-in.tex  (this is the original output!)
-mv book.tex book-in.tex
-mv sphinxmessages.sty sphinxmessages-in.sty
-mv book.aux book-in.aux
-mv book.toc book-in.toc
-#mv sphinxmanual.cls sphinxmanual-in.cls
-
-${PYT} ../../fixup-latex.py
-# reads book-in.tex -> writes book-in2.tex
-
-# remove unicode chars via unix iconv
-# reads book-in2.tex -> writes book.tex
-iconv -c -f utf-8 -t ascii book-in2.tex > book.tex
-
-# finally run pdflatex, now it should work:
-# pdflatex -recorder book
-pdflatex book
-pdflatex book
+# unused fixup-latex.py

 # for convenience, archive results in main dir
-mv book.pdf ../../pbfl-book-pdflatex.pdf
-tar czvf ../../pbdl-latex-for-arxiv.tar.gz *
-cd ../..
-ls -l ./pbfl-book-pdflatex.pdf ./pbdl-latex-for-arxiv.tar.gz
-
+#mv book.pdf ../../pbfl-book-pdflatex.pdf
+#tar czvf ../../pbdl-latex-for-arxiv.tar.gz *

--- a/notation.md
+++ b/notation.md
@@ -24,15 +24,19 @@

 | Abbreviation | Meaning |
 | --- | --- |
+| AI   | Mysterious buzzword popping up in all kinds of places these days |
 | BNN  | Bayesian neural network |
-| CNN  | Convolutional neural network |
+| CNN  | Convolutional neural network (specific NN architecure) |
+| DDPM | Denoising diffusion probabilistic models (diffusion modeling variant) |
 | DL   | Deep Learning |
-| GD   | (steepest) Gradient Descent|
+| FM   | Flow matching (diffusion modeling variant) |
+| FNO  | Fourier neural operator (specific NN architecure) |
+| GD   | (steepest) Gradient Descent |
 | MLP  | Multi-Layer Perceptron, a neural network with fully connected layers |
 | NN   | Neural network (a generic one, in contrast to, e.g., a CNN or MLP) |
 | PDE  | Partial Differential Equation |
 | PBDL | Physics-Based Deep Learning |
-| SGD  | Stochastic Gradient Descent|
+| SGD  | Stochastic Gradient Descent |



--- a/others-GANs.md
+++ b/others-GANs.md
@@ -1,11 +1,18 @@
 Generative Adversarial Networks
 =======================

-A fundamental problem in machine learning is to fully represent
+We've dealt with generative AI techniques and diffusion modeling 
+in detail in {doc}`probmodels-intro`.
+As outlined there, the fundamental problem to fully represent
 all possible states of a variable $\mathbf{x}$ under consideration,
-i.e. to capture its full distribution.
-For this task, _generative adversarial networks_ (GANs) were
-shown to be powerful tools in DL. They are important when the data has ambiguous solutions,
+i.e. to capture its full distribution, is a very old topic. Hence, 
+even before DDPMs&Co. there were techniques to make this possible, 
+and _generative adversarial networks_ (GANs) were
+shown to be powerful tools in this context. While they've been largely replaced
+by diffsion approaches in research, GANs use a highly interesting approach,
+and the following sections will give an introduction and show what's possible with GANs.
+
+Traditionally, GANs were employed when the data has ambiguous solutions,
 and no differentiable physics model is available to disambiguate the data. In such a case
 a supervised learning would yield an undesirable averaging that can be prevented with
 a GAN approach.
@@ -21,12 +28,12 @@ results can be highly ambiguous.

 ## Maximum likelihood estimation

-To train a GAN we have to briefly turn to classification problems.
-For these, the learning objective takes a slightly different form than the
+To train a GAN we have to briefly turn to _classification problems_, which we've managed to ignore up to now.
+For classification, the learning objective takes a slightly different form than the
 regression objective in equation {eq}`learn-l2` of {doc}`overview-equations`:
 We now want to maximize the likelihood of a learned representation
-$f$ that assigns a probability to an input $\mathbf{x}_i$ given a set of weights $\theta$. 
-This yields a maximization problem of the form 
+$f$ that assigns a probability to an input $\mathbf{x}_i$ given a set of weights $\theta$ for 
+a chosen set of $i$ distinct classes.  This yields a maximization problem of the form 

 $$
 \text{arg max}_{\theta} \Pi_i f(\mathbf{x}_i;\theta) ,
@@ -163,12 +170,12 @@ This is a highly challenging solution manifold, and requires an extended "cyclic
 that pushes the discriminator to take all the physical parameters under consideration into account.
 Interestingly, the generator learns to produce realistic and accurate solutions despite 
 being trained purely on data, i.e. without explicit help in the form of a differentiable physics solver setup.
+The figure below shows a range of example outputs of a physically-parametrized GAN {cite}`chu2021physgan`. 

 ```{figure} resources/others-GANs-meaningful-fig11.jpg
 ---
 name: others-GANs-meaningful-fig11
 ---
-A range of example outputs of a physically-parametrized GAN {cite}`chu2021physgan`. 
 The network can successfully extrapolate to buoyancy settings beyond the
 range of values seen at training time.
 ```
--- a/others-intro.md
+++ b/others-intro.md
@@ -1,7 +1,7 @@
 Additional Topics
 =======================

-The next sections will give a shorter introduction to other topics that are highly 
+The next sections will give a shorter introduction to other classic topics that are 
 interesting in the context of physics-based deep learning. These topics (for now) do
 not come with executable notebooks, but we will still point to existing open source 
 implementations for each of them.
--- a/others-lagrangian.md
+++ b/others-lagrangian.md
@@ -6,12 +6,13 @@ While this is straight-forward for cases such as data consisting only of integer
 for continuously changing quantities such as the temperature in a room. 
 While the previous examples have focused on aspects beyond discretization
 (and used Cartesian grids as a placeholder), the following chapter will target 
-scenarios where learning with dynamically changing and adaptive discretization has a benefit.
+scenarios where learning Neural operators with dynamically changing 
+and adaptive discretizations have a benefit.


 ## Types of computational meshes

-Generally speaking, we can distinguish three types of computational meshes (or "grids")
+As outlined in {doc}`supervised-arch`, we can distinguish three types of computational meshes (or "grids")
 with which discretizations are typically performed:

 - **structured** meshes: Structured meshes have a regular
@@ -85,7 +86,6 @@ for the next stage of convolutions. After expanding
 the size of the latent space over the course of a few layers, it is contracted again 
 to produce the desired result, e.g., an acceleration.

-% {cite}`prantl2019tranquil`

 ## Continuous convolutions

@@ -161,13 +161,14 @@ to reproduce such behavior.
 Nonetheless, an interesting side-effect of having a trained NN for such a liquid simulation
 by construction provides a differentiable solver. Based on a pre-trained network, the learned solver
 then supports optimization via gradient descent, e.g., w.r.t. input parameters such as viscosity.
+The following image shows an examplary _prediction_ task with continuous convolutions from {cite}`ummenhofer2019contconv`.

 ```{figure} resources/others-lagrangian-canyon.jpg
 ---
 name: others-lagrangian-canyon
 ---
 An example of a particle-based liquid spreading in a landscape scenario, simulated with 
-learned approach using continuous convolutions {cite}`ummenhofer2019contconv`.
+learned, continuous convolutions.
 ```

 ## Source code
--- a/others-timeseries.md
+++ b/others-timeseries.md
@@ -126,6 +126,8 @@ Ideally, this step is furthermore unrolled over time to stabilize the evolution
 The resulting training will be significantly more expensive, as more weights need to be trained at once,
 and a much larger number of intermediate states needs to be processed. However, the increased 
 cost typically pays off with a reduced overall inference error.
+The following images show several time frames of an example prediction of {cite}`wiewel2020lsssubdiv`, 
+which additionally couples the learned time evolution with a numerically solved advection step. 


 ```{figure} resources/others-timeseries-lss-subdiv-prediction.jpg
@@ -133,8 +135,6 @@ cost typically pays off with a reduced overall inference error.
 height: 300px
 name: timeseries-lss-subdiv-prediction
 ---
-Several time frames of an example prediction from {cite}`wiewel2020lsssubdiv`, which additionally couples the
-learned time evolution with a numerically solved advection step. 
 The learned prediction is shown at the top, the reference simulation at the bottom.
 ```

--- a/outlook.md
+++ b/outlook.md
@@ -1,32 +1,31 @@
 Outlook
 =======================

-Despite the lengthy discussions and numerous examples, we've really just barely scratched the surface regarding the possibilities that arise in the context of physics-based deep learning.
+Despite the in-depth discussions and diverse examples we've explored, we've really only begun to tap into the vast potential of physics-based deep learning. The techniques covered in the previous chapters aren’t just useful -— they have the power to reshape computational methods for decades to come. As we've seen in the code examples, there’s no magic at play; rather, deep learning provides an incredibly powerful new tool to work with complex, non-linear functions.

-Most importantly, the techniques that were explained in the previous chapter have an enormous potential to influence all computational methods of the next decades. As demonstrated many times in the code examples, there's no magic involved, but deep learning gives us very powerful tools to represent and approximate non-linear functions. And deep learning by no means makes existing numerical methods deprecated. Rather, the two are an ideal combination.
+Crucially, deep learning doesn’t replace traditional numerical methods. Instead, it enhances them. Together, they form a groundbreaking synergy, with a huge potential to unlock new frontiers in simulation and modeling. One aspect we haven’t yet touched upon is perhaps the most profound: at its core, our ultimate goal is to deepen human understanding of the world. The notion of neural networks as impenetrable “black boxes” is outdated. Instead, they should be seen as just another numerical tool—one that is as interpretable as traditional simulations when used correctly.

-A topic that we have not touched at all so far is, that -- of course -- in the end our goal is to improve human understanding of our world. And here the view of neural networks as "black boxes" is clearly outdated. It is simply another numerical method that humans can employ, and the physical fields predicted by a network are as interpretable as the outcome of a traditional simulation. Nonetheless, it is important to further improve the tools for analyzing learned networks, and to extract condensed formulations of the patterns and regularities the networks have found in the solution manifolds.
+Looking ahead, one of the most exciting challenges is to refine our ability to analyze learned networks. By distilling the patterns and structures these networks uncover, we move closer to extracting fundamental, human-readable insights from their solution manifolds. The future of differentiable simulation isn’t just about better predictions -— it’s about revealing the hidden order of the physical world in ways we’ve never imagined.


 ![Divider](resources/divider2.jpg)

 ## Some specific directions

-Beyond this long term outlook, there are many interesting and immediate steps.
-And while the examples with Burgers equation and Navier-Stokes solvers are clearly non-trivial, there's a wide variety of other potential PDE models that the techniques of this book can be applied to. To name just a few promising examples from other fields:
+Beyond this long-term vision, there are plenty of exciting and immediate next steps. While our deep dives into Burgers’ equation and Navier-Stokes solvers have tackled non-trivial challenges, they represent just a fraction of the landscape of PDE models and operators that these techniques can improve. Here are just a few promising directions from other fields:

-* PDEs for chemical reactions often show complex behavior due to the interactions of multiple species. Here, and especially interesting direction is to train models that quickly learn to predict the evolution of an experiment or machine, and adjust control knobs to stabilize it, i.e., an online _control_ setting.
+* Chemical Reaction PDEs often exhibit intricate behaviors due to multi-species interactions. A particularly exciting avenue is training models that can rapidly predict experimental or industrial processes and dynamically adjust control parameters to stabilize them to enable real-time, intelligent control.

-* Plasma simulations share a lot with vorticity-based formulations for fluids, but additionally introduce terms to handle electric and magnetic interactions within the material. Likewise, controllers for plasma fusion experiments and generators are an excellent topic with plenty of potential for DL with differentiable physics.
+* Plasma Simulations share similarities with vorticity-based fluid formulations but introduce additional complexities due to electric and magnetic interactions. This makes them a prime candidate for deep learning methods, especially for plasma fusion experiments and energy generators, where differentiable physics could be a game-changer.

-* Finally, weather and climate are crucial topics for humanity, and highly complex systems of fluid flows interacting with a multitude of phenomena on the surface of our planet. Accurately modeling all these interacting systems and predicting their long-term behavior shows a lot of promise to benefit from DL approaches that can interface with numerical simulations.
+* Weather and Climate Modeling remain among the most critical scientific challenges for humanity. These highly complex, multi-scale systems involve fluid flows intertwined with countless environmental factors. Leveraging deep learning to enhance numerical simulations in this space holds immense potential. Not just for more accurate forecasts, but for unlocking deeper insights into the dynamics of our planet.


 ![Divider](resources/divider3.jpg)

 ## Closing remarks

-So overall, there's lots of exciting research work left to do - the next years and decades definitely won't be boring. 👍
+These are just a few examples, but they illustrate the incredible breadth of opportunities where differentiable physics and deep learning can make an impact.  There's lots of exciting research work left to do - the next years and decades definitely won't be boring. 🤗 👍

 ```{figure} resources/logo.jpg
 ---
--- a/overview-burgers-forw.ipynb
+++ b/overview-burgers-forw.ipynb
@@ -6,9 +6,9 @@
   "source": [
    "# Simple Forward Simulation of Burgers Equation with phiflow\n",
    "\n",
-    "This chapter will give an introduction for how to run _forward_, i.e., regular simulations starting with a given initial state and approximating a later state numerically, and introduce the Φ<sub>Flow</sub> framework. Φ<sub>Flow</sub> provides a set of differentiable building blocks that directly interface with deep learning frameworks, and hence is a very good basis for the topics of this book. Before going for deeper and more complicated integrations, this notebook (and the next one), will show how regular simulations can be done with Φ<sub>Flow</sub>. Later on, we'll show that these simulations can be easily coupled with neural networks.\n",
+    "This chapter will give an introduction for how to run _forward_, i.e., regular simulations starting with a given initial state and approximating a later state numerically, and introduce the Φ<sub>Flow</sub> framework (in the following \"phiflow\"). Phiflow provides a set of differentiable building blocks that directly interface with deep learning frameworks, and hence is a very good basis for the topics of this book. Before going for deeper and more complicated integrations, this notebook (and the next one), will show how regular simulations can be done with phiflow. Later on, we'll show that these simulations can be easily coupled with neural networks.\n",
    "\n",
-    "The main repository for Φ<sub>Flow</sub> (in the following \"phiflow\") is [https://github.com/tum-pbs/PhiFlow](https://github.com/tum-pbs/PhiFlow), and additional API documentation and examples can be found at [https://tum-pbs.github.io/PhiFlow/](https://tum-pbs.github.io/PhiFlow/).\n",
+    "The main repository for phiflow is [https://github.com/tum-pbs/PhiFlow](https://github.com/tum-pbs/PhiFlow), and additional API documentation and examples can be found at [https://tum-pbs.github.io/PhiFlow/](https://tum-pbs.github.io/PhiFlow/).\n",
    "\n",
    "For this jupyter notebook (and all following ones), you can find a _\"[run in colab]\"_ link at the end of the first paragraph (alternatively you can use the launch button at the top of the page). This will load the latest version from the PBDL github repo in a colab notebook that you can execute on the spot: \n",
    "[[run in colab]](https://colab.research.google.com/github/tum-pbs/pbdl-book/blob/main/overview-burgers-forw.ipynb)\n",
@@ -31,7 +31,7 @@
   "source": [
    "## Importing and loading phiflow\n",
    "\n",
-    "Let's get some preliminaries out of the way: first we'll import the phiflow library, more specifically the `numpy` operators for fluid flow simulations: `phi.flow` (differentiable versions for a DL framework _X_ are loaded via `phi.X.flow` instead).\n",
+    "Let's get some preliminaries out of the way: first we'll import the phiflow library, more specifically the `numpy` operators for fluid flow simulations: `phi.flow` (differentiable versions for a DL framework _X_ are loaded via `phi.X.flow` instead). This allows it to easily switch between different APIs, e.g., phiflow solvers can run in either PyTorch, Tensorflow or also JAX.\n",
    "\n",
    "**Note:** Below, the first command with a \"!\" prefix will install the [phiflow python package from GitHub](https://github.com/tum-pbs/PhiFlow) via `pip` in your python environment once you uncomment it. We've assumed that phiflow isn't installed, but if you have already done so, just comment out the first line (the same will hold for all following notebooks)."
   ]
@@ -45,15 +45,13 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Using phiflow version: 3.1.0\n"
+      "Using phiflow version: 3.4.0\n"
     ]
    }
   ],
   "source": [
-    "!pip install --upgrade --quiet phiflow==3.1\n",
+    "!pip install --upgrade --quiet phiflow==3.4\n",
    "from phi.flow import *\n",
-    "\n",
-    "from phi import __version__\n",
    "print(\"Using phiflow version: {}\".format(phi.__version__))"
   ]
  },
@@ -95,7 +93,7 @@
   "source": [
    "\n",
    "Next, we initialize a 1D `velocity` grid from the `INITIAL` numpy array that was converted into a tensor.\n",
-    "The extent of our domain $\\Omega$ is specifiied via the `bounds` parameter $[-1,1]$, and the grid uses periodic boundary conditions (`extrapolation.PERIODIC`). These two properties are the main difference between a tensor and a grid: the latter has boundary conditions and a physical extent.\n",
+    "The extent of our domain $\\Omega$ is specifiied via the `bounds` parameter $[-1,1]$, and the grid uses periodic boundary conditions (`extrapolation.PERIODIC`). These two properties are the main difference between phiflow's tensor and grid objects: the latter has boundary conditions and a physical extent.\n",
    "\n",
    "Just to illustrate, we'll also print some info about the velocity object: it's a `phi.math` tensor with a size of 128. Note that the actual grid content is contained in the `values` of the grid. Below we're printing five entries by using the `numpy()` function to convert the content of the phiflow tensor into a numpy array. For tensors with more dimensions, we'd need to specify the additional dimenions here, e.g., `'y,x,vector'` for a 2D velocity field. (For tensors with a single dimensions we could leave it out.)"
   ]
--- a/overview-equations.md
+++ b/overview-equations.md
@@ -1,25 +1,22 @@
 Models and Equations
 ============================

-Below we'll give a brief (really _very_ brief!) intro to deep learning, primarily to introduce the notation.
+Below we'll give a _very_ brief intro to deep learning, primarily to introduce the notation.
 In addition we'll discuss some _model equations_ below. Note that we'll avoid using _model_ to denote trained neural networks, in contrast to some other texts and APIs. These will be called "NNs" or "networks". A "model" will typically denote a set of model equations for a physical effect, usually PDEs. 

 ## Deep learning and neural networks

-In this book we focus on the connection with physical
-models, and there are lots of great introductions to deep learning. 
-Hence, we'll keep it short: 
-the goal in deep learning is to approximate an unknown function
+The goal in deep learning is to approximate an unknown function

 $$
 f^*(x) = y^* , 
 $$ (learn-base) 

-where $y^*$ denotes reference or "ground truth" solutions.
-$f^*(x)$ should be approximated with an NN representation $f(x;\theta)$. We typically determine $f$ 
+where $y^*$ denotes reference or "ground truth" solutions, and 
+$f^*(x)$ should be approximated with an NN $f(x;\theta)$. We typically determine $f$ 
 with the help of some variant of a loss function $L(y,y^*)$, where $y=f(x;\theta)$ is the output
 of the NN.
-This gives a minimization problem to find $f(x;\theta)$ such that $e$ is minimized.
+This gives a minimization problem to find $f(x;\theta)$ such that $L$ is minimized.
 In the simplest case, we can use an $L^2$ error, giving

 $$
@@ -28,7 +25,7 @@ $$ (learn-l2)

 We typically optimize, i.e. _train_, 
 with a stochastic gradient descent (SGD) optimizer of choice, e.g. Adam {cite}`kingma2014adam`.
-We'll rely on auto-diff to compute the gradient of a _scalar_ loss $L$ w.r.t. the weights, $\partial L / \partial \theta$.
+We'll rely on auto-diff to compute the gradient of the _scalar_ loss $L$ w.r.t. the weights, $\partial L / \partial \theta$.
 It is crucial for the calculation of gradients that this function is scalar,
 and the loss function is often also called "error", "cost", or "objective" function.

@@ -38,14 +35,14 @@ introduce scalar loss, always(!) scalar...  (also called *cost* or *objective* f
 For training we distinguish: the **training** data set drawn from some distribution, 
 the **validation** set (from the same distribution, but different data),
 and **test** data sets with _some_ different distribution than the training one.
-The latter distinction is important. For the test set we want 
+The latter distinction is important. For testing, we usually want 
 _out of distribution_ (OOD) data to check how well our trained model generalizes.
 Note that this gives a huge range of possibilities for the test data set: 
 from tiny changes that will certainly work,
 up to completely different inputs that are essentially guaranteed to fail. 
 There's no gold standard, but test data should be generated with care.

-Enough for now - if all the above wasn't totally obvious for you, we very strongly recommend to 
+If the overview above wasn't obvious for you, we strongly recommend to 
 read chapters 6 to 9 of the [Deep Learning book](https://www.deeplearningbook.org),
 especially the sections about [MLPs](https://www.deeplearningbook.org/contents/mlp.html) 
 and "Conv-Nets", i.e. [CNNs](https://www.deeplearningbook.org/contents/convnets.html).
@@ -53,7 +50,7 @@ and "Conv-Nets", i.e. [CNNs](https://www.deeplearningbook.org/contents/convnets.
 ```{note} Classification vs Regression

 The classic ML distinction between _classification_ and _regression_ problems is not so important here:
-we only deal with _regression_ problems in the following.
+we only deal with _regression_ problems in the following. 

 ```

@@ -66,8 +63,19 @@ Also interesting: from a math standpoint ''just'' non-linear optimization ...

 The following section will give a brief outlook for the model equations
 we'll be using later on in the DL examples.
-We typically target continuous PDEs denoted by $\mathcal P^*$
-whose solution is of interest in a spatial domain $\Omega \subset \mathbb{R}^d$ in $d \in {1,2,3} $ dimensions.
+We typically target a continuous PDE operator denoted by $\mathcal P^*$,
+which maps inputs $\mathcal U$ to $\mathcal V$, where in the most general case $\mathcal U, \mathcal V$
+are both infinite dimensional Banach spaces, i.e. $\mathcal P^*: \mathcal U \rightarrow \mathcal V$.
+
+```{admonition} Learned solution operators vs traditional ones
+:class: tip
+Later on, the goal will be to learn $\mathcal P^*$ (or parts of it) with a neural network. A
+variety of different names are used in research: learned surrogates / hybrid simulators or emulators, 
+Neural operators or solvers, autoregressive models (if timesteps are involved), to name a few.
+```
+
+In practice, 
+the solution of interest lies in a spatial domain $\Omega \subset \mathbb{R}^d$ in $d \in {1,2,3} $ dimensions.
 In addition, we often consider a time evolution for a finite time interval $t \in \mathbb{R}^{+}$.
 The corresponding fields are either d-dimensional vector fields, for instance $\mathbf{u}: \mathbb{R}^d \times \mathbb{R}^{+} \rightarrow \mathbb{R}^d$, 
 or scalar $\mathbf{p}: \mathbb{R}^d \times \mathbb{R}^{+} \rightarrow \mathbb{R}$.
@@ -79,12 +87,11 @@ To obtain unique solutions for $\mathcal P^*$ we need to specify suitable
 initial conditions, typically for all quantities of interest at $t=0$,
 and boundary conditions for the boundary of $\Omega$, denoted by $\Gamma$ in 
 the following.
-
 $\mathcal P^*$ denotes
-a continuous formulation, where we make mild assumptions about
+a continuous formulation, where we need to make mild assumptions about
 its continuity, we will typically assume that first and second derivatives exist.

-We can then use numerical methods to obtain approximations 
+Traditionally, we can use numerical methods to obtain approximations 
 of a smooth function such as $\mathcal P^*$ via discretization. 
 These invariably introduce discretization errors, which we'd like to keep as small as possible.
 These errors can be measured in terms of the deviation from the exact analytical solution, 
@@ -127,7 +134,7 @@ and the abbreviations used in: {doc}`notation`.
 %This yields $\vc{} \in \mathbb{R}^{d \times d_{s,x} \times d_{s,y} \times d_{s,z} }$ and $\vr{} \in \mathbb{R}^{d \times d_{r,x} \times d_{r,y} \times d_{r,z} }$
 %Typically, $d_{r,i} > d_{s,i}$ and $d_{z}=1$ for $d=2$.

-We solve a discretized PDE $\mathcal{P}$ by performing steps of size $\Delta t$.
+With numerical simulations we solve a discretized PDE $\mathcal{P}$ by performing steps of size $\Delta t$.
 The solution can be expressed as a function of $\mathbf{u}$ and its derivatives:
 $\mathbf{u}(\mathbf{x},t+\Delta t) = 
 \mathcal{P}( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{xx...x} )$, where
--- a/overview-ns-forw.ipynb
+++ b/overview-ns-forw.ipynb
--- a/overview-optconv.md
+++ b/overview-optconv.md
@@ -2,7 +2,7 @@ Optimization and Convergence
 ============================

 This chapter will give an overview of the derivations for different optimization algorithms.
-In contrast to other texts, we'll start with _the_ classic optimization algorithm, Newton's method,
+In contrast to other texts, we'll start with _the most classic_ optimization algorithm, Newton's method,
 derive several widely used variants from it, before coming back full circle to deep learning (DL) optimizers.
 The main goal is the put DL into the context of these classical methods. While we'll focus on DL, we will also revisit
 the classical algorithms for improved learning algorithms later on in this book. Physics simulations exaggerate the difficulties caused by neural networks, which is why the topics below have a particular relevance for physics-based learning tasks.
@@ -62,7 +62,7 @@ In several instances we'll make use of the fundamental theorem of calculus, repe
 $$f(x+\Delta) = f(x) + \int_0^1 \text{d}s ~ f'(x+s \Delta) \Delta \ . $$

 In addition, we'll make use of Lipschitz-continuity with constant $\mathcal L$:
-$|f(x+\Delta) + f(x)|\le \mathcal L \Delta$, and the well-known Cauchy-Schwartz inequality: 
+$|f(x+\Delta) - f(x)|\le \mathcal L \Delta$, and the well-known Cauchy-Schwartz inequality: 
 $ u^T v \le |u| \cdot |v| $.

 ## Newton's method
--- a/overview.md
+++ b/overview.md
@@ -2,10 +2,10 @@ Overview
 ============================

 The name of this book, _Physics-Based Deep Learning_,
-denotes combinations of physical modeling and numerical simulations with
-methods based on artificial neural networks. 
-The general direction of Physics-Based Deep Learning represents a very
-active, quickly growing and exciting field of research. The following chapter will
+denotes combinations of physical modeling and **numerical simulations** with
+methods based on **artificial intelligence**, i.e. neural networks. 
+The general direction of Physics-Based Deep Learning, also going under the name _Scientific Machine Learning_,
+represents a very active, quickly growing and exciting field of research. The following chapter will
 give a more thorough introduction to the topic and establish the basics
 for following chapters.

@@ -15,9 +15,9 @@ height: 240px
 name: overview-pano
 ---
 Understanding our environment, and predicting how it will evolve is one of the key challenges of humankind.
-A key tool for achieving these goals are simulations, and next-gen simulations
-could strongly profit from integrating deep learning components to make even 
-more accurate predictions about our world.
+A key tool for achieving these goals are computer simulations, and the next generation of these simulations
+will likely strongly profit from integrating AI and deep learning components, in order to make even 
+better accurate predictions about the phenomena in our environment.
 ```

 ## Motivation
@@ -28,11 +28,11 @@ to the control of plasma fusion {cite}`maingi2019fesreport`,
 using numerical analysis to obtain solutions for physical models has
 become an integral part of science.  

-In recent years, machine learning technologies and _deep neural networks_ in particular,
+In recent years, artificial intelligence driven by _deep neural networks_,
 have led to impressive achievements in a variety of fields:
 from image classification {cite}`krizhevsky2012` over
 natural language processing {cite}`radford2019language`, 
-and more recently also for protein folding {cite}`alquraishi2019alphafold`.
+and protein folding {cite}`alquraishi2019alphafold`, to various foundation models.
 The field is very vibrant and quickly developing, with the promise of vast possibilities.

 ### Replacing traditional simulations?
@@ -45,14 +45,17 @@ for real-world, industrial applications such as airfoil flows {cite}`chen2021hig
 same time outperforming traditional solvers by orders of magnitude in terms of runtime.

 Instead of relying on models that are carefully crafted
-from first principles, can data collections of sufficient size
-be processed to provide the correct answers?
+from first principles, can sufficiently large datasets
+be processed instead to provide the correct answers?
 As we'll show in the next chapters, this concern is unfounded. 
 Rather, it is crucial for the next generation of simulation systems
 to bridge both worlds: to 
-combine _classical numerical_ techniques with _deep learning_ methods.
+combine _classical numerical_ techniques with _A.I._ methods.
+In addition, the latter offer exciting new possibilities in areas that
+have been challenging for traditional methods, such as dealing
+with complex _distributions and uncertainty_ in simulations.

-One central reason for the importance of this combination is
+One central reason for the importance of the combination with numerics is
 that DL approaches are powerful, but at the same time strongly profit
 from domain knowledge in the form of physical models.
 DL techniques and NNs are novel, sometimes difficult to apply, and
@@ -70,47 +73,48 @@ developed in the field of numerical mathematics, this book will
 show that it is highly beneficial to use them as much as possible
 when applying DL.

-### Black boxes and magic?
+### Black boxes?

-People who are unfamiliar with DL methods often associate neural networks 
-with _black boxes_, and see the training processes as something that is beyond the grasp
+In the past, AI and DL methods have often associated trained neural networks 
+with _black boxes_, implying that they are something that is beyond the grasp
 of human understanding. However, these viewpoints typically stem from
-relying on hearsay and not dealing with the topic enough.
+relying on hearsay and general skepticism about "hyped" topics.

-Rather, the situation is a very common one in science: we are facing a new class of methods,
-and "all the gritty details" are not yet fully worked out. This is pretty common 
+The situation is a very common one in science, though: we are facing a new class of methods,
+and "all the gritty details" are not yet fully worked out. This is and has been pretty common 
 for all kinds of scientific advances.
 Numerical methods themselves are a good example. Around 1950, numerical approximations
 and solvers had a tough standing. E.g., to cite H. Goldstine, 
 numerical instabilities were considered to be a 
 "constant source of anxiety in the future" {cite}`goldstine1990history`. 
 By now we have a pretty good grasp of these instabilities, and numerical methods 
-are ubiquitous and well established.
+are ubiquitous and well established. AI, neural networks follow the same path of 
+human progress.

 Thus, it is important to be aware of the fact that -- in a way -- there is nothing
-magical or otherworldly to deep learning methods. They're simply another set of 
-numerical tools. That being said, they're clearly fairly new, and right now 
+very special or otherworldly to deep learning methods. They're simply a new set of 
+numerical tools. That being said, they're clearly very new, and right now 
 definitely the most powerful set of tools we have for non-linear problems.
-Just because all the details aren't fully worked out and nicely written up,
-that shouldn't stop us from including these powerful methods in our numerical toolbox.
+That all the details aren't fully worked out and have nicely been written up
+shouldn't stop us from including these powerful methods in our numerical toolbox.

-### Reconciling DL and simulations
+### Reconciling AI and simulations

 Taking a step back, the aim of this book is to build on all the powerful techniques that we have
 at our disposal for numerical simulations, and use them wherever we can in conjunction
 with deep learning.
-As such, a central goal is to _reconcile_ the data-centered viewpoint with physical simulations.
+As such, a central goal is to _reconcile_ the AI viewpoint with physical simulations.

 ```{admonition} Goals of this document
 :class: tip
 The key aspects that we will address in the following are:
- explain how to use deep learning techniques to solve PDE problems,
+- how to use deep learning techniques to **solve PDE** problems,
 - how to combine them with **existing knowledge** of physics,
- without **discarding** our knowledge about numerical methods.
+- without **discarding** numerical methods.

 At the same time, it's worth noting what we won't be covering:
- introductions to deep learning and numerical simulations,
- we're neither aiming for a broad survey of research articles in this area.
+- there's no in-depth **introduction** to deep learning and numerical simulations (there are great other works already taking care of this),
+- and the aim is neither a broad survey of research articles in this area.
 ```

 The resulting methods have a huge potential to improve
@@ -118,26 +122,28 @@ what can be done with numerical methods: in scenarios
 where a solver targets cases from a certain well-defined problem
 domain repeatedly, it can for instance make a lot of sense to once invest 
 significant resources to train
-a neural network that supports the repeated solves. Based on the
-domain-specific specialization of this network, such a hybrid solver
-could vastly outperform traditional, generic solvers. And despite
+a neural network that supports the repeated solves. 
+The development of large so-called "foundation models" is especially 
+promising in this area.
+Based on the domain-specific specialization via fine-tuning with a smaller dataset, 
+a hybrid solver could vastly outperform traditional, generic solvers. And despite
 the many open questions, first publications have demonstrated
-that this goal is not overly far away {cite}`um2020sol,kochkov2021`. 
+that this goal is a realistic one {cite}`um2020sol,kochkov2021`. 

 Another way to look at it is that all mathematical models of our nature
 are idealized approximations and contain errors. A lot of effort has been
 made to obtain very good model equations, but to make the next 
-big step forward, DL methods offer a very powerful tool to close the
+big step forward, AI and DL methods offer a very powerful tool to close the
 remaining gap towards reality {cite}`akkaya2019solving`.

 ## Categorization

 Within the area of _physics-based deep learning_, 
 we can distinguish a variety of different 
-approaches, from targeting constraints, combined methods, and
-optimizations to applications. More specifically, all approaches either target
+approaches, e.g., targeting constraints, combined methods, 
+optimizations and applications. More specifically, all approaches either target
 _forward_ simulations (predicting state or temporal evolution) or _inverse_
-problems (e.g., obtaining a parametrization for a physical system from
+problems (e.g., obtaining a parametrization or state for a physical system from
 observations).

 ![An overview of categories of physics-based deep learning methods](resources/physics-based-deep-learning-overview.jpg)
@@ -160,17 +166,14 @@ techniques:
  gradients from a PDE-based formulation. These soft constraints sometimes also go
  under the name "physics-informed" training.

- _Interleaved_: the full physical simulation is interleaved and combined with
-  an output from a deep neural network; this requires a fully differentiable
-  simulator and represents the tightest coupling between the physical system and
-  the learning process. Interleaved differentiable physics approaches are especially important for
-  temporal evolutions, where they can yield an estimate of the future behavior of the
-  dynamics.
+- _Hybrid_: the full physical simulation is interleaved and combined with
+  an output from a deep neural network; this usually requires a fully differentiable
+  simulator. It represents the tightest coupling between the physical system and
+  the learning process and results in a hybrid solver that combines classic techniques with AI-based ones. 

 Thus, methods can be categorized in terms of forward versus inverse
-solve, and how tightly the physical model is integrated into the
-optimization loop that trains the deep neural network. Here, especially 
-interleaved approaches that leverage _differentiable physics_ allow for 
+solve, and how tightly the physical model is integrated with the neural network. 
+Here, especially hybrid approaches that leverage _differentiable physics_ allow for 
 very tight integration of deep learning and numerical simulation methods.


@@ -186,19 +189,28 @@ In contrast, we'll focus on _physical_ simulations from now on, hence the name.
 When coming from other backgrounds, other names are more common however. E.g., the differentiable
 physics approach is equivalent to using the adjoint method, and coupling it with a deep learning
 procedure. Effectively, it is also equivalent to apply backpropagation / reverse-mode differentiation 
-to a numerical simulation. However, as mentioned above, motivated by the deep learning viewpoint, 
+to a numerical simulation. 
+However, as mentioned above, motivated by the deep learning viewpoint, 
 we'll refer to all these as "differentiable physics" approaches from now on.

+The hybrid solvers that result from integrating DL with a traditional solver can also be seen 
+as a classic topic: in this context, the neural network has the task to _correct_ the solver.
+This correction can in turn either target numerical errors, or unresolved terms in an equation.
+This is a fundamental problem  in science that has been addressed under various names, e.g.,
+as the _closure problem_ in fluid dynamics and turbulence, as _homogenization_ or _coarse-graining_ 
+in material science, and _parametrization_ in climate and weather simulation. The re-invention
+of this goal in the different fields points to the importance of the underlying problem,
+and this text will illustrate the new ways that DL offers to tackle it.

 ---


 ## Looking ahead

-_Physical simulations_ are a huge field, and we won't be able to cover all possible types of physical models and simulations.
+_Physics simulations_ are a huge field, and we won't be able to cover all possible types of physical models and simulations.

 ```{note} Rather, the focus of this book lies on:
- _Field-based simulations_ (no Lagrangian methods)
+- Dense _field-based simulations_ (no Lagrangian methods)
 - Combinations with _deep learning_ (plenty of other interesting ML techniques exist, but won't be discussed here)
 - Experiments are left as an _outlook_ (i.e., replacing synthetic data with real-world observations)
 ```
@@ -218,24 +230,17 @@ A brief look at our _notation_ in the {doc}`notation` chapter won't hurt in both

 ## Implementations

-This text also represents an introduction to a wide range of deep learning and simulation APIs.
-We'll use popular deep learning APIs such as _pytorch_ [https://pytorch.org](https://pytorch.org) and _tensorflow_ [https://www.tensorflow.org](https://www.tensorflow.org), and additionally
-give introductions into the differentiable simulation framework _Φ<sub>Flow</sub> (phiflow)_ [https://github.com/tum-pbs/PhiFlow](https://github.com/tum-pbs/PhiFlow). Some examples also use _JAX_ [https://github.com/google/jax](https://github.com/google/jax). Thus after going through
-these examples, you should have a good overview of what's available in current APIs, such that
+This text also represents an introduction to deep learning and simulation APIs.
+We'll primarily use the popular deep learning API _pytorch_ [https://pytorch.org](https://pytorch.org), but also a bit of _tensorflow_ [https://www.tensorflow.org](https://www.tensorflow.org), and additionally
+give introductions into the differentiable simulation framework _Φ<sub>Flow</sub> (phiflow)_ [https://github.com/tum-pbs/PhiFlow](https://github.com/tum-pbs/PhiFlow). Some examples also use _JAX_ [https://github.com/google/jax](https://github.com/google/jax), which provides an interesting alternative. 
+Thus after going through these examples, you should have a good overview of what's available in current APIs, such that
 the best one can be selected for new tasks.

-As we're (in most Jupyter notebook examples) dealing with stochastic optimizations, many of the following code examples will produce slightly different results each time they're run. This is fairly common with NN training, but it's important to keep in mind when executing the code. It also means that the numbers discussed in the text might not exactly match the numbers you'll see after re-running the examples.
-
+As we're  dealing with stochastic optimizations in most of the Jupyter notebooks, many of the following code examples will produce slightly different results each time they're run. This is fairly common with NN training, but it's important to keep in mind when executing the code. It also means that the numbers discussed in the text might not exactly match the numbers you'll see after re-running the examples.

 <!-- ## A brief history of PBDL in the context of Fluids
-
-First:
-
-Tompson, seminal...
-
+First: Tompson, seminal...
 Chu, descriptors, early but not used
-
 Ling et al. isotropic turb, small FC, unused?
-
 PINNs ... and more ... -->

--- a/physgrad-code.ipynb
+++ b/physgrad-code.ipynb
@@ -84,7 +84,7 @@
        }
      ],
      "source": [
-        "!pip install --upgrade --quiet phiflow==3.1\n",
+        "!pip install --upgrade --quiet phiflow==3.4\n",
        "from phi.torch.flow import *    # switch to TF with \"phi.tf.flow\""
      ]
    },
@@ -254,9 +254,9 @@
        "    signal_prior = 0.5\n",
        "    expected_amp = 1. * kernel.shape.get_size('x') * inv_kernel  # This can be measured\n",
        "    signal_likelihood = math.exp(-0.5 * (abs(amp) / expected_amp) ** 2) * signal_prior  # this can be NaN\n",
-        "    signal_likelihood = math.where(math.isfinite(signal_likelihood), signal_likelihood, math.zeros_like(signal_likelihood))\n",
+        "    signal_likelihood = math.where(math.is_finite(signal_likelihood), signal_likelihood, math.zeros_like(signal_likelihood))\n",
        "    noise_likelihood = math.exp(-0.5 * (abs(amp) / f_uncertainty) ** 2) * (1 - signal_prior)\n",
-        "    probability_signal = math.divide_no_nan(signal_likelihood, (signal_likelihood + noise_likelihood))\n",
+        "    probability_signal = math.safe_div(signal_likelihood, (signal_likelihood + noise_likelihood))\n",
        "    action = math.where((0.5 >= probability_signal) | (probability_signal >= 0.68), 2 * (probability_signal - 0.5), 0.)  # 1 sigma required to take action\n",
        "    prob_kernel = math.exp(log_kernel * action)\n",
        "    return prob_kernel, probability_signal\n",
@@ -310,7 +310,7 @@
        "BATCH = batch(batch=128)\n",
        "STEPS = 50\n",
        "\n",
-        "math.seed(0)\n",
+        "#math.seed(0)\n",
        "net = u_net(1, 1)\n",
        "optimizer = adam(net, 0.001)"
      ]
@@ -347,7 +347,7 @@
        "def loss_function(net, x_gt: CenteredGrid, sip: bool):\n",
        "    y_target = diffuse.fourier(x_gt, 8., 1)\n",
        "    with math.precision(32):\n",
-        "        prediction = field.native_call(net, field.to_float(y_target)).vector[0]\n",
+        "        prediction = field.native_call(net, field.to_float(y_target))\n",
        "        prediction += field.mean(x_gt) - field.mean(prediction)\n",
        "    x = field.stop_gradient(prediction)\n",
        "    if sip:\n",
@@ -459,7 +459,7 @@
        }
      ],
      "source": [
-        "math.seed(0)\n",
+        "#math.seed(0)\n",
        "net_gd = u_net(1, 1)\n",
        "optimizer_gd = adam(net_gd, 0.001)\n",
        "\n",
@@ -648,4 +648,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
--- a/physgrad-nn.md
+++ b/physgrad-nn.md
@@ -126,7 +126,7 @@ The value of $\xi$ determines the conditioning of $\mathcal P$ with large $\xi$

 Here's an example of the resulting loss landscape for $y^*=(0.3, -0.5)$, $\xi=1$, $\phi=15^\circ$ that shows the entangling of the sine function for $x_1$ and linear change for $x_2$:

-```{figure} resources/physgrad-sin-loss.png
+```{figure} resources/physgrad-sin-loss.jpg
 ---
 height: 200px
 name: physgrad-sin-loss
@@ -137,7 +137,7 @@ Next we train a fully-connected neural network to invert this problem via equati
 We'll compare SIP training using a saddle-free Newton solver to various state-of-the-art network optimizers.
 For fairness, the best learning rate is selected independently for each optimizer.
 When choosing $\xi=1$ the problem is perfectly conditioned. In this case all network optimizers converge, with Adam having a slight advantage. This is shown in the left graph:
-```{figure} resources/physgrad-sin-time-graphs.png
+```{figure} resources/physgrad-sin-time-graphs.jpg
 ---
 height: 180px
 name: physgrad-sin-time-graphs
@@ -154,7 +154,7 @@ While the evaluation of the Hessian inherently requires more computations, the p
 By increasing $\xi$ while keeping $\phi=0$ fixed we can show how the conditioning continually influences the different methods, 
 as shown on the left here:

-```{figure} resources/physgrad-sin-add-graphs.png
+```{figure} resources/physgrad-sin-add-graphs.jpg
 ---
 height: 180px
 name: physgrad-sin-add-graphs
--- a/physicalloss-code.ipynb
+++ b/physicalloss-code.ipynb
--- a/physicalloss-discuss.md
+++ b/physicalloss-discuss.md
@@ -3,12 +3,13 @@ Discussion of Physical Losses

 The good news so far is - we have a DL method that can include 
 physical laws in the form of soft constraints by minimizing residuals.
-However, as the very simple previous example illustrates, this is just a conceptual
-starting point.
+However, as the very simple previous example illustrates, this causes
+new difficulties, and is just a conceptual starting point.

 On the positive side, we can leverage DL frameworks with backpropagation to compute
-the derivatives of the model. At the same time, this puts us at the mercy of the learned
-representation regarding the reliability of these derivatives. Also, each derivative
+the derivatives of the model. At the same time, this makes the loss landscape more
+complicated, relies on the  learned
+representation regarding the reliability of the derivatives. Also, each derivative
 requires backpropagation through the full network. This can be very expensive, especially 
 for higher-order derivatives.

@@ -16,16 +17,12 @@ And while the setup is relatively simple, it is generally difficult to control.
 has flexibility to refine the solution by itself, but at the same time, tricks are necessary
 when it doesn't focus on the right regions of the solution.

-## Is it "Machine Learning"?
+## Generalization?

-One question that might also come to mind at this point is: _can we really call it machine learning_?
-Of course, such denomination questions are superficial - if an algorithm is useful, it doesn't matter
-what name it has. However, here the question helps to highlight some important properties
-that are typically associated with algorithms from fields like machine learning or optimization.

-One main reason _not_ to call the optimization of the previous notebook machine learning (ML), is that the
+One aspect to note with the previous PINN optimization is that the
 positions where we test and constrain the solution are the final positions we are interested in.
-As such, there is no real distinction between training, validation and test sets.
+As such, from a classic ML standpoint, there is no real distinction between training, validation and test sets.
 Computing the solution for a known and given set of samples is much more akin to classical optimization,
 where inverse problems like the previous Burgers example stem from.

@@ -33,7 +30,8 @@ For machine learning, we typically work under the assumption that the final perf
 model will be evaluated on a different, potentially unknown set of inputs. The _test data_
 should usually capture such _out of distribution_ (OOD) behavior, so that we can make estimates
 about how well our model will generalize to "real-world" cases that we will encounter when 
-we deploy it in an application.
+we deploy it in an application. The v1 version, using a prescribed discretization actually 
+had this property, and could generalized to new inputs.

 In contrast, for the PINN training as described here, we reconstruct a single solution in a known 
 and given space-time region. As such, any samples from this domain follow the same distribution
@@ -47,26 +45,27 @@ have to start training the NN from scratch.
 ## Summary

 Thus, the physical soft constraints allow us to encode solutions to 
-PDEs with the tools of NNs.
-An inherent drawback of this variant 2 is that it yields single solutions,
-and that it does not combine with traditional numerical techniques well. 
+PDEs with the tools of NNs. As they're more widely used, we'll focus on PINNs (v2) here:
+An inherent drawback is that they typically yield single solutions or very narrow solution manifolds,
+and that they do not combine with traditional numerical techniques well. 
+In comparison to the Neural surrogates/operators from {doc}`supervised` we've made a step backwards in some way.
+
 E.g., the learned representation is not suitable to be refined with 
 a classical iterative solver such as the conjugate gradient method. 
-
 This means many
 powerful techniques that were developed in the past decades cannot be used in this context.
 Bringing these numerical methods back into the picture will be one of the central
 goals of the next sections.

 ✅ Pro: 
- Uses physical model.
- Derivatives can be conveniently computed via backpropagation.
+- Uses physical model
+- Derivatives can be conveniently computed via backpropagation

 ❌ Con: 
- Quite slow ...
- Physical constraints are enforced only as soft constraints.
- Largely incompatible with _classical_ numerical methods.
- Accuracy of derivatives relies on learned representation.
+- Problematic convergence
+- Physical constraints are enforced only as soft constraints
+- Largely incompatible with _classical_ numerical methods
+- Usefulness of derivatives relies on learned representation

 To address these issues,
 we'll next look at how we can leverage existing numerical methods to improve the DL process
--- a/physicalloss-div.ipynb
+++ b/physicalloss-div.ipynb
--- a/physicalloss.md
+++ b/physicalloss.md
@@ -2,9 +2,9 @@ Physical Loss Terms
 =======================

 The supervised setting of the previous sections can quickly 
-yield approximate solutions with a fairly simple training process. However, what's
-quite sad to see here is that we only use physical models and numerical methods 
-as an "external" tool to produce a big pile of data 😢.
+yield approximate solutions with a simple and stable training process. However, it's
+unfortunate that we only use physical models and numerical methods 
+as an "external" tool to produce lots of data 😢.

 We as humans have a lot of knowledge about how to describe physical processes
 mathematically. As the following chapters will show, we can improve the
@@ -23,20 +23,21 @@ $$
  \mathbf{u}_t = \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{xx...x} ) ,
 $$

-where the $_{\mathbf{x}}$ subscripts denote spatial derivatives with respect to one of the spatial dimensions
-of higher and higher order (this can of course also include mixed derivatives with respect to different axes). $\mathbf{u}_t$ denotes the changes over time.
-
-In this context, we can approximate the unknown $\mathbf{u}$ itself with a neural network. If the approximation, which we call $\tilde{\mathbf{u}}$, is accurate, the PDE should be satisfied naturally. In other words, the residual R should be equal to zero:
+where the $_{\mathbf{x}}$ subscripts denote spatial derivatives with respect to the spatial dimensions 
+(this could of course also include mixed derivatives with respect to different axes). $\mathbf{u}_t$ denotes the changes over time.
+Given a solution $\mathbf{u}$, we can compute the residual R, which naturally should be equal to zero for a correct solution:

 $$
  R = \mathbf{u}_t - \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{xx...x} ) = 0 .
 $$

+In this context, we can approximate the unknown $\mathbf{u}$ itself with a neural network. 
+If the approximation is accurate, the PDE residual should likewise be zero.
+
 This nicely integrates with the objective for training a neural network: we can train for 
 minimizing this residual in combination with direct loss terms.
-Similar to before, we can use pre-computed solutions 
-$[x_0,y_0], ...[x_n,y_n]$ for $\mathbf{u}$ with $\mathbf{u}(\mathbf{x})=y$ as constraints
-in addition to the residual terms. 
+In addition to relying on the residual, we can use pre-computed solutions 
+$[x_0,y_0], ...[x_n,y_n]$ for $\mathbf{u}$ with $\mathbf{u}(\mathbf{x})=y$ as targets.
 This is typically important, as most practical PDEs do not have unique solutions
 unless initial and boundary conditions are specified. Hence, if we only consider $R$ we might
 get solutions with random offset or other undesirable components. The supervised sample points
@@ -51,19 +52,22 @@ where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the s
 the residual term, respectively. We could of course add additional residual terms with suitable scaling factors here.

 It is instructive to note what the two different terms in equation {eq}`physloss-training` mean: The first term is a conventional, supervised L2-loss. If we were to optimize only this loss, our network would learn to approximate the training samples well, but might average multiple modes in the solutions, and do poorly in regions in between the sample points. 
-If we, instead, were to optimize only the second term (the physical residual), our neural network might be able to locally satisfy the PDE, but still could produce solutions that are still far away from our training data. This can happen due to "null spaces" in the solutions, i.e., different solutions that all satisfy the residuals.
+If we, instead, were to optimize only the second term (the physical residual), our neural network might be able to locally satisfy the PDE, but 
+could have large difficulties find a solution that fits globally.
+This can happen due to "null spaces" in the solutions, i.e., different solutions that all satisfy the residuals. Then local points can converge
+to different solutions, in combination yielding a very suboptimal one.
 Therefore, we optimize both objectives simultaneously such that, in the best case, the network learns to approximate the specific solutions of the training data while still capturing knowledge about the underlying PDE.

 Note that, similar to the data samples used for supervised training, we have no guarantees that the
 residual terms $R$ will actually reach zero during training. The non-linear optimization of the training process
 will minimize the supervised and residual terms as much as possible, but there is no guarantee. Large, non-zero residual 
 contributions can remain. We'll look at this in more detail in the upcoming code example, for now it's important 
-to remember that physical constraints in this way only represent _soft constraints_, without guarantees
+to keep in mind that the physical constraints formulated this way only represent _soft constraints_, without guarantees
 of minimizing these constraints.

 The previous overview did not really make clear how an NN produces $\mathbf{u}$.
 We can distinguish two different approaches here:
-via a chosen explicit representation of the target function (v1 in the following), or via using fully-connected neural networks to represent the solution (v2). 
+via a chosen explicit representation of the target function (v1 in the following), or with a _Neural field_ based on fully-connected neural networks to represent the solution (v2). 
 E.g., for v1 we could set up a _spatial_ grid (or graph, or a set of sample points), while in the second case no explicit representation exists, and the NN instead receives the _spatial coordinate_ to produce the solution at a query position.
 We'll outline these two variants in more detail the following.

@@ -96,30 +100,28 @@ To learn this decomposition, we can approximate $p$ with a CNN on our computatio
 $\nabla \cdot \big( \mathbf{u}(0) - \nabla f(\mathbf{u}(0);\theta) \big)$. 
 To implement this residual, all we need to do is provide the divergence operator $(\nabla \cdot)$ of $\mathbf u$ on our computational mesh. This is typically easy to do via 
 a convolutional layer in the DL framework that contains the finite difference weights for the divergence.
-Nicely enough, in this case we don't even need additional supervised samples, and can typically purely train with this residual formulation. Also, in contrast to variant 2 below, we can directly handle fairly large spaces of solutions here (we're not restricted to learning single solutions)
+Nicely enough, in this case we don't even need additional supervised samples, and can typically purely train with this residual formulation. Also, in contrast to variant 2 below, we can directly handle fairly large spaces of solutions here (we're not restricted to learning single solutions).
 An example implementation can be found in this [code repository](https://github.com/tum-pbs/CG-Solver-in-the-Loop).

-Overall, this variant 1 has a lot in common with _differentiable physics_ training (it's basically a subset). As we'll discuss differentiable physics in a lot more detail
-in {doc}`diffphys` and after, we'll focus on direct NN representations (variant 2) from now on. 
+Overall, this variant 1 has a lot in common with _differentiable physics_ training (it's basically a subset) that will be covered with a lot more detail in {doc}`diffphys`. Hence, we'll focus a bit more on direct NN representations (variant 2) in this chapter. 

 ---

 ## Variant 2: Derivatives from a neural network representation

 The second variant of employing physical residuals as soft constraints 
-instead uses fully connected NNs to represent $\mathbf{u}$. This _physics-informed_ approach was popularized by Raissi et al. {cite}`raissi2019pinn`, and has some interesting pros and cons that we'll outline in the following. We will target  this  physics-informed version (variant 2) in the following code examples and discussions.
+instead uses fully connected NNs to represent $\mathbf{u}$. This _physics-informed_ (PINN) approach was popularized by Raissi et al. {cite}`raissi2019pinn`, and has some interesting pros and cons that we'll outline in the following. By now, this approach can be seen as part of the _Neural field_ representations that e.g. also include NeRFs and learned signed distance functions.

-
-The central idea here is that the aforementioned general function $f$ that we're after in our learning problems
+The central idea with Neural fields is that the aforementioned general function $f$ that we're after 
 can also be used to obtain a representation of a physical field, e.g., a field $\mathbf{u}$ that satisfies $R=0$. This means $\mathbf{u}(\mathbf{x})$ will 
 be turned into $\mathbf{u}(\mathbf{x}, \theta)$ where we choose the NN parameters $\theta$ such that a desired $\mathbf{u}$ is 
-represented as precisely as possible.
+represented as precisely as possible, and $\mathbf{u}$ simply returns the right value at spatial location $\mathbf{x}$.

-One nice side effect of this viewpoint is that NN representations inherently support the calculation of derivatives. 
+One nice side effect of this viewpoint is that NN representations inherently support the calculation of derivatives w.r.t. inputs. 
 The derivative $\partial f / \partial \theta$ was a key building block for learning via gradient descent, as explained 
-in {doc}`overview`. Now, we can use the same tools to compute spatial derivatives such as $\partial \mathbf{u} / \partial x$,
+in {doc}`overview`. Now, we can use the same tools to compute spatial derivatives such as $\partial \mathbf{u} / \partial x = \partial f / \partial x$,
 Note that above for $R$ we've written this derivative in the shortened notation as $\mathbf{u}_{x}$.
-For functions over time this of course also works for $\partial \mathbf{u} / \partial t$, i.e. $\mathbf{u}_{t}$ in the notation above.
+For functions over time this of course also works by adding $t$ as input to compute $\partial \mathbf{u} / \partial t$, i.e. $\mathbf{u}_{t}$ in the notation above.

 ```{figure} resources/physloss-overview-v2.jpg
 ---
@@ -139,14 +141,22 @@ To pick a simple example, Burgers equation in 1D,
 $\frac{\partial u}{\partial{t}} + u \nabla u = \nu \nabla \cdot \nabla u $ , we can directly
 formulate a loss term $R = \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2}$ that should be minimized as much as possible at training time. For each of the terms, e.g. $\frac{\partial u}{\partial x}$,
 we can simply query the DL framework that realizes $u$ to obtain the corresponding derivative. 
-For higher order derivatives, such as $\frac{\partial^2 u}{\partial x^2}$, we can simply query the derivative function of the framework multiple times. In the following section, we'll give a specific example of how that works in tensorflow.
+For higher order derivatives, such as $\frac{\partial^2 u}{\partial x^2}$, we can query the derivative function of the framework multiple times. 
+In the following section, we'll give a specific example of how that works in tensorflow.


 ## Summary so far

 The approach above gives us a method to include physical equations into DL learning as a soft constraint: the residual loss.
-Typically, this setup is suitable for _inverse problems_, where we have certain measurements or observations
-for which we want to find a PDE solution. Because of the high cost of the reconstruction (to be 
-demonstrated in the following), the solution manifold shouldn't be overly complex. E.g., it is typically not possible 
-to capture a wide range of solutions, such as with the previous supervised airfoil example, by only using a physical residual loss.
+While v1 relies on an inductive bias in the form of a discretization, v2 relies on derivatives computed by via autodiff.
+Typically, v2 is especially suitable for _inverse problems_, where we have certain measurements or observations
+for which we want to find a PDE solution. 
+Because of the ill-posedness of the optimization and learning problem,
+and the high cost of the reconstruction (to be 
+demonstrated in the following), the solution manifold shouldn't be overly complex for these PINN approaches. 
+E.g., it is typically very difficult to capture time dependence or a wide range of solutions, 
+such as with the previous supervised airfoil example.
+
+Next, we'll demonstrate these concepts with code: first, we'll show how learning the Helmholtz decomposition works out in 
+practice with a **v1**-approach. Afterwards, we'll illustrate the **v2** PINN-approaches with a practical example.

--- a/probmodels-ddpm-fm.ipynb
+++ b/probmodels-ddpm-fm.ipynb
--- a/probmodels-diffusion.ipynb
+++ b/probmodels-diffusion.ipynb
--- a/probmodels-discuss.md
+++ b/probmodels-discuss.md
@@ -0,0 +1,26 @@
+Discussion of Probabilistic Learning
+=======================
+
+As the previous sections have demonstrated, probabilistic learning offers a wide range of very exciting possibilities in the context of physics-based learning. First, these methods come with a highly interesting and well developed theory. Surprisingly, some parts are actually more developed than basic questions about simpler learning approaches.
+
+At the same time, they enable a fundamentally different way to work with simulations: they provide a simple way to work with complex distributions of solutions. This is of huge importance for inverse problems, e.g. in the context of obtaining likelihood-based estimates for _simulation-based inference_. 
+
+![Divider](resources/divider-gen1.jpg)
+
+That being said, diffusion based approaches will not show relatively few advantages for deterministic settings: they are not more accurate, and typically induce slightly larger computational costs. An interesting exception is the long-term stability, as discussed in {doc}`probmodels-uncond`. To summarize the key aspects of probabilistic deep learning approaches:
+
+✅ Pro: 
+- Enable training and inference for distributions
+- Well developed theory
+- Stable training
+
+❌ Con: 
+- (Slightly) increased inference cost
+- No real advantage for deterministic settings
+
+One more concluding recommendation: if your problems contains ambiguities, diffusion modeling in the form of _flow matching_ is the method of choice. If your data contains reliable input-output pairs, go with simpler _deterministic training_ instead.
+
+![Divider](resources/divider-gen3.jpg)
+
+Next, we can turn to a new viewpoint on learning problems, the field of _reinforcement learning_. As the next sections will point out, it is actually not so different from the topics of the previous chapters despite the new viewpoint.
+
--- a/probmodels-dppm-fm.ipynb
+++ b/probmodels-dppm-fm.ipynb
--- a/probmodels-flowmatching.ipynb
+++ b/probmodels-flowmatching.ipynb
--- a/probmodels-graph-ellipse.ipynb
+++ b/probmodels-graph-ellipse.ipynb
--- a/probmodels-graph.md
+++ b/probmodels-graph.md
@@ -0,0 +1,201 @@
+Graph-based Diffusion Models
+=======================
+
+Similar to classical numerics, regular grids are ideal for certain situations, but sub-optimal for others. Diffusion models are no different, but luckily the concepts of the previous sections do carry over when replacing the regular grids with graphs. Importantly, denoising and flow matching work similarly well on unstrucuted Eulerian meshes, as will be demonstrated below. This test case will illustrate another important aspect: diffusion models excel at _completing_ data distributions. I.e., even when the training data has an incomplete distribution for a single example (defined by the geometry of the physical domain, boundary conditions and physical parameters), the "global" view of learning from different examples let's the networks _complete_ the posterior distribution over the course of seeing partial data for many different examples.
+
+Most simulation problems like fluid flows are often poorly represented by a single mean solution. E.g., for many practical applications involving turbulence, it is crucial to **access the full distribution of possible flow states**, from which relevant statistics (e.g., RMS and two-point correlations) can be derived. This is where diffusion models can leverage their strengths: instead of having to simulate a lengthy transient phase to converge towards an equilibrium state, diffusion models can completely skip the transient warm-up, and directly produce the desired samples. Hence, this allows for computing the relevant flow statistics very efficiently compared to classic solvers.
+
+## Diffusion Graph Net (DGN) 
+
+In the following, we'll demonstrate these capabilities based on the _diffusion graph net_ (DGN) approach {cite}`lino2025dgn`, the full source code for which [can be found here](https://github.com/tum-pbs/dgn4cfd/).
+
+To learn the probability distribution of dynamical states of physical systems, defined by their discretization mesh and their physical parameters,  the DDPM and flow matching frameworks can directly be applied to the mesh nodes. Additionally, DGN introduces a second model variant, which operates in a pre-trained semantic _latent space_ rather than directly in the physical space (this variant will be called LDGN).
+
+In contrast to relying on regular grid discretizations as in previous sections, the system’s geometry is now represented using a mesh with nodes $\mathcal{V}_M$ and edges ${\mathcal{E}}_M$, where each node $i$ is located at ${x}_i$. The system’s state at time $t$, ${Y}(t)$, is defined by $F$ continuous fields sampled at the mesh nodes: ${Y}(t) := \{ {y}_i(t) \in \mathbb{R}^{F} \ | \ i \in {\mathcal{V}}_M \}$, with the short form ${y}_i(t) \equiv {y}({x}_i,t)$. Simulators evolve the system through a sequence of states, $\mathcal{Y} = \{{Y}(t_0), {Y}(t_1), \dots, {Y}(t_n), \dots \}$, starting from an initial state ${Y}(t_0)$.
+We assume that after an initial transient phase, the system reaches a statistical equilibrium. In this stage, statistical measures of ${Y}$, computed over sufficiently long time intervals, are time-invariant, even if the dynamics display oscillatory or chaotic behavior. The states in the equilibrium stage, ${\mathcal{Z}} \subset {\mathcal{Y}}$, depend only on the system’s geometry and physical parameters, and not on its initial state. This is illustrated in the following picture.
+
+```{figure} resources/probmodels-graph-over.jpg
+---
+height: 180px
+name: probmodels-graph-over
+---
+(a) DGN learns the probability distribution of the systems' converged states provided only a short trajectory of length $\delta << T$ per system. (b) An example with a turbulent wing experiment. The distribution learned by the LDGN model accurately captures the variance of all states (bottom right), despite seeing only an incomplete distribution for each wing during training (top right).
+```
+
+In many engineering applications, such as aerodynamics and structural vibrations, the primary focus is not on each individual state along the trajectory, but rather on the statistics that characterize the system’s dynamics. However, simulating a trajectory of converged states $\mathcal{Z}$ long enough to accurately capture these statistics can be very expensive, especially for real-world problems involving 3D chaotic systems. The following DGN approachs aims for directly sampling converged states ${Z}(t) \in \mathcal{Z}$ without simulating the initial transient phase. Subsequently, we can analyze the system's dynamics by drawing enough samples.
+
+Given a dataset of short trajectories from $N$ systems, $\mathfrak{Z} = \{\mathcal{Z}_1, \mathcal{Z}_2, ..., \mathcal{Z}_N\}$, the goal in the following is to learn a probabilistic model of $\mathfrak{Z}$ that enables sampling of a converged state ${Z}(t) \in \mathcal{Z}$, conditioned on the system's mesh, boundary conditions, and physical parameters. This model must capture the underlying probability distributions even when trained on trajectories that are too short to fully characterize their individual statistics. Although this is an ill-posed problem, given sufficient training trajectories, diffusion models on graphs manage to uncover the statistical correlations and shared patterns, enabling interpolation across the condition space.
+
+
+## Diffusion on Graphs
+
+We'll use DDPM (and later flow matching) to generate states ${Z}(t)$ by denoising a sample ${Z}^R \in \mathbb{R}^{|\mathcal{V}_M| \times F}$ drawn from an isotropic Gaussian distribution. The system’s conditional information is encoded in a directed graph ${\mathcal{G}} :=({\mathcal{V}}, {\mathcal{E}})$, where ${\mathcal{V}} \equiv {\mathcal{V}}_M$ and the mesh edges ${\mathcal{E}}_M$ are represented as bi-directional graph edges ${\mathcal{E}}$. Node attributes ${V}_c = \{{v}_{i}^c \ | \ i \in {\mathcal{V}} \}$ and edge attributes ${E}_c = \{{e}_{ij}^c \ | \ (i,j) \in {\mathcal{E}} \}$ encode the conditional features, including the relative positions between adjacent node, ${x}_j - {x}_i$.
+
+In the *diffusion* (or *forward*) process, node features from ${Z}^1 \in \mathbb{R}^{|\mathcal{V}| \times F}$ to ${Z}^R \in \mathbb{R}^{|\mathcal{V}| \times F}$ are generated by sequentially adding Gaussian noise:
+$
+q({Z}^r|{Z}^{r-1})=\mathcal{N}({Z}^r; \sqrt{1-\beta_r} {Z}^{r-1}, \beta_r \mathbf{I}),
+$
+where $\beta_r \in (0,1)$, and $Z^0 \equiv Z(t)$. Any ${Z}^r$ can be sampled directly via:
+
+$$
+{Z}^r =  \sqrt{\bar{\alpha}_r} {Z}^0 +  \sqrt{1-\bar{\alpha}_r} {\epsilon},
+$$
+
+with $\alpha_r := 1 - \beta_r$, $\bar{\alpha}_r := \prod_{s=1}^r \alpha_s$ and ${\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
+The denoising process removes noise through learned Gaussian transitions:
+$
+p_\theta({Z}^{r-1}|{Z}^r) =\mathcal{N} ({Z}^{r-1}; {\mu}_\theta^r, {\Sigma}_\theta^r),
+$
+where the mean and variance are parameterized as:
+
+$$
+\begin{aligned}
+{\mu}_\theta^r = \frac{1}{\sqrt{\alpha_r}} \left( {Z}^r - \frac{\beta_r}{\sqrt{1-\bar{\alpha}_r}} {\epsilon}_\theta^r \right),
+\qquad
+{\Sigma}_\theta^r = \exp\left( \mathbf{v}_\theta^r \log \beta_r + (1-\mathbf{v}_\theta^r)\log \tilde{\beta}_r \right),
+\end{aligned}
+$$
+
+with $\tilde{\beta}_r := (1 - \bar{\alpha}_{r-1}) / (1 - \bar{\alpha}_r) \beta_r$. Here, ${\epsilon}_\theta^r \in \mathbb{R}^{|\mathcal{V}| \times F}$ predicts the noise ${\epsilon}$ in equation (1), and $\mathbf{v}_\theta^r \in \mathbb{R}^{|\mathcal{V}| \times F}$ interpolates between the two bounds of the process' entropy,  $\beta_r$ and $\tilde{\beta}_r$.
+
+DGNs predict ${\epsilon}_\theta^r$ and $\mathbf{v}_\theta^r$ using a regular message-passing-based GNN {cite}`sanchez2020learning`. This takes ${Z}^{r-1}$ as input, and it is conditioned on the graph ${\mathcal{G}}$, its node and edge features, and the diffusion step $r$:
+
+$$
+\begin{aligned}
+[{\epsilon}_\theta^r, \mathbf{v}_\theta^r] \leftarrow \text{{DGN}}_\theta({Z}^{r-1}, {\mathcal{G}}, {V}_c, {E}_c, r).
+\end{aligned}
+$$
+
+The _DGN_ network is trained using the hybrid loss function proposed in *"Improved Denoising Diffusion Probabilistic Models"* by Nichol et al. The full denoising process requires $R$ evaluations of the DGN to transition from ${Z}^R$ to ${Z}^0$.
+
+DGN follows the widely used encoder-processor-decoder GNN architecture. In addition to the node and edge encoders, the encoder includes a diffusion-step encoder, which generates a vector ${r}_\text{emb} \in \mathbb{R}^{F_\text{emb}}$ that embeds the diffusion step $r$. The node encoder processes the conditional node features ${v}_i^c$, alongside ${r}_\text{emb}$. Specifically, the diffusion-step encoder and the node encoder operate as follows:
+
+$$
+\begin{aligned}
+{r}_\text{emb} \leftarrow
+    \phi \circ {\small \text{Linear}} \circ {\small \text{SinEmb}} (r),
+\quad
+{v}_i \leftarrow {\small \text{Linear}} \left( \left[ \phi \circ {\small \text{Linear}} ({v}_i^c) \ | \ {r}_\text{emb} 
+    \right] \right), 
+\quad \forall i \in \mathcal{V},
+\end{aligned}
+$$
+
+where $\phi$ denotes the activation function and ${\small \text{SinEmb}}$ is the sinusoidal embedding function. The edge encoder applies a linear layer to the conditional edge features ${e}_{ij}^c$. 
+The encoded node and edge features are $\mathbb{R}^{F_{h}}$-dimensional vectors  ($F_\text{emb} = 4 \times F_h$). We condition each message-passing layer on $r$ by projecting ${r}_\text{emb}$ to an $F_{h}$-dimensional space and adding the result to the node features before each of these layers — i.e.,  ${v}_i \leftarrow  {v}_i + {\small \text{Linear}}({r}_\text{emb})$. Each message-passing layer follows:
+
+$$
+\begin{aligned}
+        \mathbf{e}_{ij} &\leftarrow W_e \mathbf{e}_{ij} + \text{MLP}^e \left( \text{LN} \left([\mathbf{e}_{ij}|\mathbf{v}_{i}|\mathbf{v}_{j}] \right) \right), \qquad \forall (i,j) \in \displaystyle \mathcal{E},\\
+        \bar{\mathbf{e}}_{j} &\leftarrow \sum_{i \in \mathcal{N}^-_j} \mathbf{e}_{ij}, \qquad  \forall j \in \displaystyle \mathcal{V},\\
+        \mathbf{v}_j &\leftarrow W_v \mathbf{v}_j + \text{MLP}^v \left( \text{LN} \left( [\bar{\mathbf{e}}_{j} | \mathbf{v}_j]\right) \right), \qquad \forall j \in \displaystyle \mathcal{V}.
+\end{aligned}
+$$
+
+Previous work on graph-based diffusion models has used sequential message passing to propagate node features across the graph. However, this approach fails for large-scale phenomena, such as the flows studied in the context of DGN,  as denoising of global features becomes bottlenecked by the limited reach of message passing.
+To address this, a multi-scale GNN is adopted for the processor, applying message passing on ${\mathcal{G}}$ and multiple coarsened versions of it in a U-Net fashion. This design leverages the U-Net’s effectiveness in removing both high- and low-frequency noise. To obtain each lower-resolution graph from its higher-resolution counterpart, we use Guillard’s coarsening algorithm, originally developed for fast mesh coarsening in CFD applications. As in the conventional U-Net, pooling and unpooling operations, now based on message passing, are used to transition between higher- and lower-resolution graphs. 
+
+
+```{figure} resources/probmodels-graph-pooling.jpg
+---
+height: 200px
+name: probmodels-graph-pooling
+---
+Message passing is applied on ${\mathcal{G}}$ and multiple coarsened versions of it in a U-Net fashion. The lower-resolution graphs are obtained using a mesh coarsening algorithm popularised in CFD applications.
+```
+
+
+## Diffusion in Latent Space
+
+Diffusion models can also operate in a lower-dimensional graph-based representation that is perceptually equivalent to $\mathfrak{Z}$. This space is defined as the latent space of a Variational Graph Auto-Encoder (VGAE) trained to reconstruct ${Z}(t)$. We'll refer to a DGN trained on this latent space as a Latent DGN (LDGN).
+
+```{figure} resources/probmodels-graph-arch.jpg
+---
+height: 220px
+name: probmodels-graph-arch
+---
+(a) The VGAE consists of a condition encoder, a (node) encoder, and a (node) decoder. The multi-scale latent features from the condition encoder serve as conditioning inputs to both the encoder and the decoder. (b) During LDGN inference, Gaussian noise is sampled in the VGAE latent space and, after multiple denoising steps conditioned on the low-resolution outputs from the VGAE's condition encoder, transformed into the physical space by the VGAE's decoder.
+```
+
+
+In this configuration, the VGAE captures high-frequency information (e.g., spatial gradients and small vortices), while the LDGN focuses on modeling mid- to large-scale patterns (e.g., the wake and vortex street). By decoupling these two tasks, the generative learning process is simplified, allowing the LDGN to concentrate on more meaningful latent representations that are less sensitive to small-scale fluctuations. Additionally, during inference, the VGAE’s decoder helps remove residual noise from the samples generated by the LDGN. This approach significantly reduces sampling costs since the LDGN operates on a smaller graph rather than directly on ${\mathcal{G}}$.
+
+For the VGAE, an encoder-decoder architecture is used with an additional condition encoder to handle conditioning inputs. The condition encoder processes ${V}_c$ and ${E}_c$, encoding these into latent node features ${V}^\ell_c$ and edge features ${E}^\ell_c$ across $L$ graphs $\{{\mathcal{G}}^\ell := ({\mathcal{V}}^\ell, {\mathcal{E}}^\ell) {I}d 1 \leq \ell \leq L\}$, where ${\mathcal{G}}^1 \equiv {\mathcal{G}}$ and the size of the graphs decreases progressively, i.e., $|{\mathcal{V}}^1| > |{\mathcal{V}}^2| > \dots > |{\mathcal{V}}^L|$. This transformation begins by linearly projecting ${V}_c$ and ${E}_c$ to a $F_\text{ae}$-dimensional space and applying two message-passing layers to yield ${V}^1_c$ and ${E}^1_c$. Then, $L-1$ encoding blocks are applied sequentially:
+
+$$
+\begin{aligned}
+\left[{V}^{\ell+1}_c, {E}^{\ell+1}_c \right] \leftarrow {\small MP} \circ {\small MP} \circ {\small \text{GraphPool}} \left({V}^\ell_c, {E}^\ell_c \right), \quad \text{for} \ l = 1, 2, \dots, L-1, 
+\end{aligned}
+$$
+
+where _MP_ denotes a message-passing layer and _GraphPool_ denotes a graph-pooling layer.
+
+The encoder produces two $F_L$-dimensional vectors for each node $i \in {\mathcal{V}}^L$, the mean ${\mu}_i$ and standard deviation ${\sigma}_i$ that parametrize a Gaussian distribution over the latent space. It takes as input a state ${Z}(t)$, which is linearly projected to a $F_\text{ae}$-dimensional vector space and then passed through $L-1$ sequential down-sampling blocks (message passing + graph pooling), each conditioned on the outputs of the condition encoder:
+
+$$
+\begin{aligned}
+    {V} \leftarrow {\small \text{GraphPool}} \circ {\small MP} \circ {\small MP} \left( {V} + {\small \text{Linear}}\left({V}^\ell_c \right), {\small \text{Linear}}\left({E}^\ell_c \right) \right), \ \text{for} \ l = 1, 2, \dots, L-1;
+\end{aligned}
+$$
+
+and a bottleneck block:
+
+$$
+\begin{aligned}
+    {V} \leftarrow {\small MP} \circ {\small MP} \left( {V} + {\small \text{Linear}}\left({V}^L_c \right), {\small \text{Linear}}\left({E}^L_c \right) \right).
+\end{aligned}
+$$
+
+The output features are passed through a node-wise MLP that returns ${\mu}_i$ and ${\sigma}_i$ for each node $i \in {\mathcal{V}}^L$. The latent variables are then computed as ${\zeta}_i = {\small \text{BatchNorm}}({\mu}_i + {\sigma}_i {\epsilon}_i$), where ${\epsilon}_i \sim \mathcal{N}(0, {I})$. Finally, the decoder mirrors the encoder, employing a symmetric architecture (replacing graph pooling by graph unpooling layers) to upsample the latent features back to the original graph ${\mathcal{G}}$. Its blocks are also conditioned on the outputs of the condition encoder. The message passing and the graph pooling and unpooling layers in the VGAE are the same as in the (L)DGN.
+
+The VGAE is trained to reconstruct states ${Z}(t) \in \mathfrak{Z}$ with a KL-penalty towards a standard normal distribution on the learned latent space. Once trained, the LDGN can be trained following the approach in Section~\ref{sec:DGN}. However, the objective is now to learn the distribution of the latent states ${\zeta}$, defined on the coarse graph ${\mathcal{G}}^L$, conditioned on the outputs ${V}^L_c$ and ${E}^L_c$ from the condition encoder.
+During inference, the condition encoder generates the conditioning features ${V}^\ell_c$ and ${E}^\ell_c$ (for $l = 1, 2, \dots, L$), and after the LDGN completes its denoising steps, the decoder transforms the generated ${\zeta}_0$ back into the physical feature-space defined on ${\mathcal{G}}$.
+
+Unlike in conventional VGAEs, the condition encoder is necessary because, at inference time, an encoding of ${V}_c$ and ${E}_c$ is needed on graph ${\mathcal{G}}^L$, where the LDGN operates. This encoding cannot be directly generated by the encoder, as it also requires ${Z}(t)$ as input, which is unavailable during inference. An alternative approach would be to define the conditions directly in the coarse representation of the system provided by ${\mathcal{G}}^L$, but this representation lacks fine-grained details, leading to sub-optimal results.
+
+
+![Divider](resources/divider-gen4.jpg)
+
+
+## Turbulent Flows around Wings in 3D
+
+Let's directly turn to a complex case to illustrate the capabilities of DGN. (A more basic case will be studied in the Jupyter notebook on the following page.)
+
+The Wing experiments of the DGN project target wings in 3D turbulent flow, characterized by detailed vortices that form and dissipate on the wing surface.  This task is particularly challenging due to the high-dimensional, chaotic nature of turbulence and its inherent multi-scale interactions across a wide range of scales.
+The geometry of the wings varies in terms of relative thickness, taper ratio, sweep angle, and twist angle. 
+These simulations are computationally expensive, and using GNNs allows us to concentrate computational effort on the wing's surface, avoiding the need for costly volumetric fields. A regular grid around the wing would require over $10^5$ cells, in contrast to approximately 7,000 nodes for the surface mesh representation. The surface pressure can be used to determine both the aerodynamic performance of the wing and its structural requirements.
+Fast access to the probabilistic distribution of these quantities would be highly valuable for aerodynamic modeling tasks. 
+The training dataset for this task was generated using Detached Eddy Simulation (DES) with OpenFOAM’s PISO solver,
+using 250 consecutive states shortly after the data-generating simulator reached statistical equilibrium.
+This represents about **10%** of the states needed to achieve statistically stationary variance, thus the models are trained with a very partial view on each case.
+
+
+## Distributional accuracy
+
+A high accuracy for each sample does not necessarily imply that a model is learning the true distribution. In fact, these properties often conflict. For instance, in VGAEs, the KL-divergence penalty allows control over whether to prioritize sample quality or mode coverage.
+To evaluate how well models capture the probability distribution of system states, we use the Wasserstein-2 distance. This metric can be computed in two ways: (i) by treating the distribution at each node independently and averaging the result across all nodes, or (ii) by considering the joint distribution across all nodes in the graph. These metrics are denoted as $W_2^\text{node}$ and $W_2^\text{graph}$, respectively. The node-level measure ($W_2^\text{node}$) provides insights into how accurately the model estimates point-wise statistics, such as the mean and standard deviation at each node. However, it does not penalize inaccurate spatial correlations, whereas the graph-wise measure ($W_2^\text{graph}$) does.
+
+To ensure stable results when computing these metrics, the target distribution is represented by 2,500 consecutive states, and the predicted one by 3,000 samples.
+While the trajectories in the training data are long enough to capture the mean flow, they fall short of capturing the standard deviation, spatial correlations, or higher-order statistics. Despite these challenges, the DGN, and especially the LDGN, are capable of accurately learning the complete probability distributions of the training trajectories and accurately generating new distribution for both in- and out-of-distribution physical settings. The figure below shows a qualitative evaluation together with correlation measurements. Both DGN variants also fare much better than the _Gaussian-Mixture model_ baseline denoted as GM-GNN.
+
+```{figure} resources/probmodels-graph-wing.jpg
+---
+height: 220px
+name: probmodels-graph-wing
+---
+(a) The _Wing_ task targets pressure distributions on a wing in 3D turbulent flow. (b) The standard deviation of the distribution generated by the LDGN is the closest to the ground-truth (shown here in terms or correlation).
+```
+
+In terms of Wasserstein distance $W_2^\text{graph}$, the latent-space diffusion model also outperforms the others, with a distance of $\textbf{1.95 ± 0.89}$, while DGN follows with $2.12 ± 0.90$, and the gaussian mixture model gives $4.32 ± 0.86$.
+
+## Computational Performance
+
+Comparisons between runtimes of different implementations always should be taken with a grain of salt.
+Nonetheless, for the Wing experiments, the ground-truth simulator, running on 8 CPU threads, required 2,989 minutes to simulate the initial transient phase plus 2,500 equilibrium states. This duration is just enough to obtain a well converged variance. In contrast, the LDGN model took only 49 minutes on 8 CPU threads and 2.43 minutes on a single GPU to generate 3,000 samples.
+If we consider the generation of a single converged state (for use as an initial condition in another simulator, for example), the speedup is four orders of magnitude on the CPU, and five orders of magnitude on the GPU. 
+Thanks to its latent space, the LDGN model is not only more accurate, but also $8\times$ faster than the DGN model, while requiring only about 55\% more training time. 
+These significant efficiency advantages suggest that graph-based diffusion models can be particularly valuable in scenarios where computational costs are otherwise prohibitive.
+
+These results indicate that diffusion modeling in the context of unstructured simulations represent a significant step towards leveraging probabilistic methods in real-world engineering applications.
+To highlight the aspects of DGN and its implementation, we now turn to a simpler test case that can be analyzed in detail within a Jupyter notebook.
+
--- a/probmodels-intro.md
+++ b/probmodels-intro.md
@@ -0,0 +1,119 @@
+Introduction to Probabilistic Learning
+=======================
+
+So far we've treated the target function $f(x)=y$ as being deterministic, with a unique solution $y$ for every input. That's certainly a massive simplification: in practice, solutions can be ambiguous, our learned model might mix things up, and both effects could show up in combination. This all calls for moving towards a probabilistic setting, which we'll address here. The machinery from previous sections will come in handy, as the probabilistic viewpoint essentially introduces another dimension for the problem. Instead of a single $y$, we now have a multitude of solutions drawn from a distribution $Y$, each with a probability $p_Y(y)$, often shortened to $p(y)$.
+Samples $y \sim p(y)$ drawn from the distribution should follow this probability, so that we can distinguish rare and and frequent cases. 
+
+To summarize, instead of individual solutions $y$ we're facing a large number of samples $y \sim p(y)$.
+
+![Divider](resources/divider-gen-full.jpg)
+
+## Uncertainty 
+
+All measurements, models, and discretizations that we are working with exhibit uncertainties. For measurements and observations, they typically appear in the form of measurement errors. Model equations, on the other hand, usually encompass only parts of a system we're interested in (leaving the remainder as an uncertainty), while for numerical simulations we inevitably introduce discretization errors. In the context of machine learning, we additionally have errors introduced by the trained model. All these errors and unclear aspects make up the uncertainties of the predicted outcomes, the _predictive uncertainty_. For practical applications it's crucial to have means for quantifying this uncertainty. This is a central motivation for working with probabilistic models, and for adjacent fields such as in "uncertainty quantification" (UQ).
+
+
+```{note} Aleatoric vs. Epistemic Uncertainty.
+The predictive uncertainty in many cases can 
+be distinguished in terms of two types of uncertainty:
+
+- _Aleatoric_ uncertainty denotes uncertainty within the data, e.g., noise in measurements.
+
+- _Epistemic_ uncertainty, on the other hand, describes uncertainties within a model such as a trained neural network.
+
+A word of caution is important here:
+while this distinction seems clear cut, both effects overlay and can be difficult to tell apart. E.g., when facing discretization errors, uncertain outcomes could be caused by unknown ambiguities in the data, or by a suboptimal discrete representation.
+These aspects can be very difficult to disentangle in practice. 
+```
+
+Closely aligned, albeit taking a slightly different perspective, are so-called _simulation-based inference_ (SBI) methods. Here the main motivation is to estimate likelihoods in computer-based simulations, so that reliable probability distributions for the solutions can be obtained. The SBI viewpoint provides a methodological approach for working with computer simulations and uncertainties, and will provide a red thread for the following sections.
+
+
+## Forward or Backward?
+
+At this point it's important to revisit the central distinction between forward and inverse ("backward") problems: most classic numerical methods target ➡️ **forward** ➡️ problems to compute solutions for steady-state or future states of a system.
+
+Forward problems arise in many settings, but across the board, at least as many problems are ⬅️ **inverse** ⬅️ problems, where a forward simulation plays a central role, but the main question is not a state that it generates, but rather the value of parameter of simulator to explain a certain measurement or observation. To formalize this, our simulator $f$ is parametrized by a set of inputs $\nu$, e.g., a viscosity, and takes states $x$ to produce a modified state $y$. We have an observation $\tilde{y}$ and are interested in the value of $\nu$ to produce the observation. In the easiest case this inverse problem can tackled as a minimization problem 
+$\text{arg min}_{\nu} | f(x;\nu) - \tilde{y} |_2^2$. Solving it would tell us the viscosity of an observed material, and similar problems arise in pretty much all fields, from material science to cosmology. To simplify  the notation, we'll merge $\nu$ into $x$, and minimize for $x$ correspondingly, but it's important to keep in mind that $x$ can encompass any set of parameters or state samples that we'd like to solve for with our inverse problem.
+
+In the following, we will focus on inverse problems, as these best illustrate the capabilities of the probabilistic modeling, but the algorithms discussed are not exclusively applicable to inverse problems (an example will follow).
+
+## Simulation-based Inference
+
+For inverse problems, it is in practice not sufficient to match a single observation $\tilde{y}$. Rather, we'd like to ensure that the parameter we obtain explains a wide range of observations, and we might be interested in the possibility of multiple values explaining our observations. Similarly, quantifying the uncertainty of the estimate is important in real world settings: is the observation explained by only a very narrow range of parameters, or could the parameter vary by orders of magnitude without really influencing the observation? These questions require a statistical analysis, typically called _inference_, to draw conclusions about the results obtained from the inverse problem solve. To connect this viewpoint with the distinction regarding epistemic and aleatoric uncertainties above, we're primarily addressing the latter here: which uncertainties lie in our observations, given a scientific hypothesis in the form of a simulator.
+
+To formalize these inverse problems let's consider
+a vector-valued input$x$ that can contain states and / or
+the aforementioned parameters (like $\nu$).
+We also have a
+distribution of latent variables  
+$z \sim p(z|x)$ that describes the unknown part of our system.
+Examples for z are unobservable and stochastic variables , intermediate simulation steps, or the control flow of simulator.
+
+```{note} Bayes theorem is fundamental for all of the following. For completeness, here it is: $p(x|y)~p(y) = p(y|x)~p(x)$. And it's worth keeping in mind that both sides are equivalent to the joint probabilities, i.e. $... = p(x,y) = p(y,x)$.
+```
+
+
+For $x$ there is a prior distribution X with a probability density  $p(x)$for the inputs, 
+and the simulator produces an observation or output $y \sim p(y | x, z)$. Thus, $x$ can take different values, maybe it contains some noise, and the $z$ is out of our control, and can likewise influence the $y$ that are produced.
+
+The function for the conditional probability $p(y|x)$ is called the **likelihood** function, and is a crucial value in the following. Note that it does not depend on $z$, as these latent states are out of our control. 
+So we actually need to 
+compute the marginal likelihood  $p(y|x) = \int p(y, z | x) dz$  by integrating over all possible $z$. 
+This is necessary because the likelihood function shouldn't depend on $z$, otherwise we'd need to know the exact values of $z$ before being able to calculate the likelihood.
+Unfortunately, this is often intractable, as $z$ can be difficult to sample, and in some case we can't even control it in a reasonable way. 
+Some algorithms have been proposed to compute likelihoods, one popular one is Approximate Bayesian Computation (ABC), but all approaches are highly expensive and require a lot of expert knowledge to set up. They suffer from the _curse of dimensionality_, i.e. become very expensive when facing larger numbers of degrees of freedom. Thus,
+obtaining good approximations of the likelihood will be a topic that we'll revisit below.
+
+![Divider](resources/divider-gen4.jpg)
+
+With a function for the likelihood we can compute the 
+**distribution of the posterior**, the main quantity we're after,
+in the following way:
+$p(x|y) = \frac{p(y|x)p(x)}{\int p(y|x') p(x') dx'}$, 
+where the denominator
+$\int p(y|x') p(x') dx'$ is called the _evidence_. 
+The evidence is just $p(y)$, which shows
+that the equation for the posterior follows directly from Bayes' theorem $p(x|y) = p(y|x) p(x) / p(y)$. 
+
+The evidence can be computed with stochastic methods such  as Markov Chain Monte Carlo (MCMC).
+It primarily "normalizes" our posterior distribution and is typically easier to obtain than the likelihood, but nonetheless still a challenging term.
+
+
+```{admonition} Leveraging Deep Learning
+:class: tip
+
+This is were deep learning turns out to be extremely useful: we can use it to train a conditional density estimator $q_\theta(x|y)$ for the posterior $p(x|y)$ that allows sampling, and can be trained from simulations $y \sim p(y|x)$ alone.
+
+```
+
+Deep learning has been instrumental to provide new ways of addressing the classic challenges of obtaining accurate estimates of posterior distributions, and this is what we'll focus on in this chapter.  Previously, we called our neural networks $f_\theta$, but in the following we'll use $q_\theta = f_\theta$ to make clear we're dealing with  a learned probability. Specifically, we'll target neural networks that learn a probability density, i.e. $\int q_\theta(x) dx = 1$.
+We'll often first target unconditional densities, and then show how they can be modified to learn conditional versions $q_\theta(x|y)$. 
+
+Looking ahead, the learned SBI methods, i.e. approaches for computing posterior distributions, have the following properties:
+
+✅ Pro:
+* Fast inference (once trained)
+* Less affected by curse of dimensionality
+* Can represent arbitrary priors
+
+❌ Con:
+* Require costly upfront training 
+* Lacks rigorous theoretical guarantees
+
+In the following we'll explain how to obtain and derive a very popular and powerful family of methods that can be summarized as **diffusion models**. We could simply provide the final algorithm (which will turn out to be surprisingly simple), but it's actually very interesting to see where it all comes from. 
+We'll focus on the basics, and leave the _physics-based extensions_ (i.e. including differentiable simulators) for a later section. The path towards diffusion models also introduces a few highly interesting concepts from machine learning along the way, and provides a nice "red thread" for discussing seminal papers from the past few years. Here we go...
+
+<br>
+
+![Divider](resources/divider-gen6.jpg)
+
+```{note} Historic Alternative: Bayesian Neural Networks
+
+A classic variant that should be mentioned here are "Bayesian Neural Networks". They 
+follow Bayes more closely, and pre-scribe a prior distribution on the neural network 
+parameters to learn the posterior distribution. Every weight and bias in the NN are assumed to be Gaussian with an own mean and variance, which are adjusted at training time. For inference, we can then "sample" a network, and use it like any regular NN.
+Despite being a very good idea on paper, this method turned out to have problems with learning complex distributions, and requires careful tuning of the hyperparameters involved. Hence, these days, it's strongly recommended to use flow matching (or at least a diffusion model) instead.
+If you're interested in details, BNNs with a code example can be found, e.g., in v0.3 of PBDL: https://arxiv.org/abs/2109.05237v3 .
+```
+
--- a/probmodels-normflow.ipynb
+++ b/probmodels-normflow.ipynb
--- a/probmodels-phys.md
+++ b/probmodels-phys.md
@@ -0,0 +1,287 @@
+Incorporating Physical Constraints
+=======================
+
+Despite the powerful capabilities of diffusion- and flow-based networks for generative modeling that we discussed in the previous sections, there is no direct feedback loop between the network, the observation and the sample at training time. This means there is no direct mechanism to include **physics-based constraints** such as priors from PDEs. As a consequence, it's very difficult to produce highly accurate samples based on learning alone: For scientific applications, we often want to make sure the errors go down to any chosen threshold.
+
+In this chapter, we will outline strategies to remedy this shortcoming, and building on the content of previous chapters, the central goal of both methods is to get **differentiable simulations** back into the training and inference loop. The previous chapters have shown that they're very capable tools, so the main question is how to best employ them in the context of diffusion modeling.
+
+```{note} 
+Below we'll focus on the inverse problem setting from {doc}`probmodels-intro`. I.e., we have a system $y=f(x)$ (with numerical simulator $y=\mathcal P(x)$) and given an observation $y$, we'd like to obtain the posterior distribution for the distributional solution $x \sim p(x|y)$ of the inverse problem. 
+
+```
+
+
+
+
+## Guiding Diffusion Models
+
+Having access to a physical model with a differentiable simulation $\mathcal{P}(x)=y$ means we can obtain gradients $\nabla_x$ through the simulation. As before, we aim for solving _inverse_ problems where, given an output $y$ we'd like to sample from the conditional posterior distribution $p(x|y)$ to obtain samples $x$ that explain $y$. The previous chapter demonstrated learning such distributions with diffusion models, and given a physics prior $\mathcal{P}$, there's a first fundamental choice: should be use the gradient at _training time_, i.e., trying to improve the learned distribution $p_\theta$, or at _inference time_, to improve sampling $x \sim p_\theta(x|y)$.
+
+**Training with physics priors:** The hope of incorporating physics-based signals in the form of gradients at training time would be to improve the state of $p_\theta$ after training. While there's a certain hope this could, e.g., compensate for sparse training data, there is little hope for substantially improving the accuracy of the learned distribution. The training process for diffusion and flow matching models typically yields very capable neural networks, that are excellent at producing approximate samples from the posterior. They're typically limited in terms of their accuracy by model and training data size, but it's difficult to fundamentally improve the capabilities of a model at this stage. Rather, in this context it is more interesting to obtain higher accuracies at inference time.
+
+**Inference with physics priors:** For scientific applications, classic simulations typically yield control knobs that allow for choosing a level of accuracy. E.g., iterative solvers for linear systems provide iteration counts and residual thresholds, and if a solution is not accurate enough, a user can simply reduce the residual threshold to obtain a more accurate output. In contrast, neural networks typically come without such controls, and even the iteration count of denoising or velocity integration (for flow matching) are bounded in terms of final accuracy. More steps typically reduce noise, and correspondingly the error, but will plateau at a level of accuracy given by the capabilities of the trained model. This is exactly where the gradients of physics solver show promise: they provide an external process that can guide and improve the output of a diffusion model. As we'll show below, this makes it possible to push the levels of accuracy beyond those of pure learning, and can yield inverse problem solvers that really outperform traditional solvers.
+
+
+Recall that for denoising, we train a noise estimator $\epsilon_\theta$, and at inference time iterate denoising steps of the form
+$x_{\text{new}} =  x - \hat \alpha_t \epsilon_\theta(x, t) + \hat \sigma_t \mathcal N(0,I)$ , where $\hat \alpha,\hat \sigma$ denote the merged scaling factors for both terms.
+The most straight-forward approach for including gradients is to additionally include a step in the direction of the gradient $\nabla_x || \mathcal P(x) - y||_2$. For simplicity, we take an $L^2$ distance towards the observation $y$ here. This was shown to direct sampling even when the posterior is not conditional, i.e., if we only have access to $x \sim p_\theta(x)$, and is known as _diffusion posterior sampling_ {cite}`chung2023dps`.
+
+While this approach manages to includes $\mathcal P$, there are two challenges: $x$ is typically noisy, and the gradient step can distort the distributional sampling of the denoising process. The first point is handled quite easily with an _extrapolation step_ (more details below), while the second one is more difficult to address: the gradient descent steps via $\nabla_x \mathcal P$ are akin to a classic optimization for the inverse problem and could strongly distort the outputs of the diffusion model. E.g., in the worst case they could pull the different points of the posterior distribution towards a single case favored by the simulator $\mathcal P$. Hence, the following paragraphs will outline a strategy that merges simulator and learning, while preserving the distribution of the posterior.
+We'll focus on flow matching as a state-of-the-art approach next, and afterwards discuss variant that treats the diffusion steps themselves as a physical process.
+
+
+
+![Divider](resources/divider-genA.jpg)
+
+
+
+## Physics-Guided Flow Matching
+
+To reintroduce control signals using simulators into the flow matching algorithm we'll follow {cite}`holzschuh2024fm`. The goal is to transform an existing pretrained flow-based network, as outlined in {doc}`probmodels-intro`, with a flexible control signal by aggregating the learned flow and control signals into a _controlled flow_. This is the task of a second neural network, the _control network_, in  order to make sure that the posterior distribution is not negatively affected by the signals from the simulator. This second network is small compared to the pretrained flow network, and freezing the weights of the pretrained network works very well; thus, the refinement for control needs only a fairly small amount of additional parameters and computing resources.
+
+
+```{figure} resources/probmodels-phys-overview.jpg
+---
+height: 240px
+name: probmodels-phys-overview
+---
+An overview of the control framework. We will consider a pretrained flow network $v_\theta$ and use the predicted flow for the trajectory point $x_t$ at time $t$ to estimate $\hat{x}_1$.
+On the right, we show a gradient-based control signal with a differentiable simulator and cost function $C$ for improving $\hat{x}_1$.
+An additional network learns to combine the predicted flow with feedback via the control signal to give a new controlled flow.
+By combining learning-based updates with suitable controls, we avoid local optima and obtain high-accuracy samples with low inference times.
+```
+
+The control signals can be based on gradients and a cost function, if the simulator is differentiable, but they can also be learned directly from the simulator output.
+Below, we'll show that performance gains due to simulator feedback are substantial and cannot be achieved by training on larger datasets alone.
+Specifically, we'll show that flow matching with simulator feedback is competitive with MCMC baselines for a problem from gravitational lensing in terms of accuracy, and it beats them significantly regarding inference time. This indicates that it provides a very attractive tool for practical applications.
+
+
+**Controlled flow $v_\theta^C$** First, it's a good idea to pretrain a regular, conditional flow network $v_\theta(x,y,t)$ without any control signals to make sure that we can realize the best achievable performance possible based on learning alone. 
+ 
+Then, in a second training phase, a control network $v_\theta^C(v, c,t)$ is introduced. It receives the pretrained flow $v$ and control signal $c$ as input. Based on these additional inputs, it can used, e.g., the gradient of a PDE to produce an improved flow matching velocity. At inference time, we integrate
+$dx/dt = v^C_\theta(v,c,t)$ just like before, only now this means evaluating $v_\theta(x,y,t)$ and then $c$ beforehand. (We'll focus on the details of $c$ in a moment.)
+ 
+First, the control network is much smaller in size than the regular flow network, making up ca. $10\%$ of the weights $\theta$. The network weights of $v_\theta$ can be frozen, to train with the conditional flow matching loss {eq}`conditional-flow-matching` for a small number of additional steps. This reduces training time and compute since we do not need to backpropagate gradients through $v_\theta(x, y,t)$. Freezing the weights of $v_\theta$ typically does not negatively affects the performance, although a joint end-to-end training could provide some additional improvements. 
+
+
+**1-step prediction** The conditional flow matching networks $v_\theta(x,y,t)$ from {doc}`probmodels-intro` gradually transform samples from $p_0$ to $p_1$ during inference via integrating the simple ODE $dx_t/dt = v_\theta(x_t,y,t)$ step by step. There is no direct feedback loop between the current point on the trajectory $x_t$, the observation $y$, and a physical model that we could bring into the picture. An important first issue is that the current trajectory point $x_t$ is often not be close to a good estimate of a posterior sample $x_1$. 
+This is especially severe at the beginning of inference, where $x_0$ is drawn from the source distribution (typically a Gaussian), and hence $x_t$ will be very noisy. Most simulators really don't like very noisy inputs, and trying to compute gradients on top of it is clearly a very bad idea.
+
+This issue is alleviated by extrapolating $x_t$ forward in time to obtain an estimated $\hat{x}_1$ 
+
+$$
+\begin{align} 
+    \hat{x}_1 = x_t + (1-t) v_\theta(x_t, y, t).
+\end{align}
+$$ (eq:1_step_prediction)
+
+and then performing subsequent operations for control and guidance on $\hat{x}_1$ instead of the current, potentially noisy $x_1$.
+
+Note that this 1-step prediction is also conceptually related to diffusion sampling using [_likelihood-guidance_](http://DBLP:conf/nips/WuTNBC23). For inference in diffusion models, where sampling is based on the conditional score $\nabla_{x_t} \log p(x_t|y)$ and can be decomposed into 
+
+$$
+\begin{align}
+    \nabla_{x_t} \log p(x_t|y) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p(y|x_t).
+\end{align}
+$$
+
+The first expression can be estimated using a pretrained diffusion network, whereas the latter is usually intractable, but can be approximated using 
+$p(y|x_t) \approx p_{y|x_0}(y|\hat{x}(x_t))$,
+where the denoising estimate $\hat{x}(x_t) = \mathbb{E}_q[x_0|x_t]$ is usually obtained via Tweedie's formula $(\mathbb{E}_q[x_0|x_t] - x_t) / t\sigma^2$. In practice, the estimate $\hat{x}(x_t)$ is very poor when $x_t$ is still noisy, making inference difficult in the early stages. In contrast, flows based on linear conditional transportation paths have empirically been shown to have trajectories with less curvature compared to, for example, denoising-based networks. This property of flow matching enables inference in fewer steps and providing better estimates for $\hat{x}_1$.
+
+
+
+
+
+
+
+### Physics-based Controls
+
+Now we focus on the content of the control signal $c$ that was already used above. We extend the idea of self-conditioning via physics-based control signals to include an additional feedback loop between the network output and an underlying physics-based prior. We'll distinguish between two types of controls in the following: a gradient-based control from a differentiable simulator, and one from a learned estimator network. 
+
+```{figure} resources/probphys02-control.jpg
+---
+height: 240px
+name: probphys02-control
+---
+Types of control signals. (a) From a differentiable simulator, and (b) from a learned encoder.
+```
+
+
+**Gradient-based control signal** In the first case, we make use of a differentiable simulator $\mathcal{P}$ to construct a cost function $C$. Naturally, $C$ will likewise be differentiable such that we can compute a gradient for a predicted solution. Also, we will rely on the stochasticity of diffusion/flow matching, and as such the simulator can be deterministic.
+
+Given an observation $y$ and the estimated 1-step prediction $\hat{x}_1$, the control signal computes to how well $\hat{x}_1$ explains $y$ via the cost function $C$. Good choices for the cost are, e.g., an $L^2$ loss or a likelihood $p(y|\hat{x}_1)$. We define the control signal $c$ to consist of two components: the cost itself, and the gradient w.r.t. the cost function:  
+
+$$
+    \begin{align}
+    c(\hat{x}_1, y) := [C(\mathcal{P}(\hat{x}_1), y); \nabla_{\hat{x}_1} C(\mathcal{P}(\hat{x}_1), y)].
+    \end{align}
+$$
+
+As this information is passed to a network, the network can freely make use of the current distance to the target (the value of $C$) and the direction towards lowering it in the form of $\nabla_{\hat{x}_1} C$.
+
+**Learning-based control signal** When the simulator is non-differentiable, the second variant of using a learned estimator comes in handy. 
+To combine the simulator output with the observation $y$, a learnable encoder network _Enc_ with parameters $\theta_E$ can be introduced to judge the similarity of the simulation and the observation. The output of the encoder is small and of size $O(\mathrm{dim}(x))$.
+The control signal is then defined as 
+
+$$
+\begin{align}
+    c(\hat{x}_1, y) := Enc(\mathcal{P}(\hat{x}_1), y).
+\end{align}
+$$
+
+The gradient backpropagation is stopped at the output of the simulator $\mathcal{P}$, as shown in {numref}`figure {number} <probphys02-control>`. 
+Before showing some examples of the capabilities of these two types of control, we'll discuss some of their properties.
+
+![Divider](resources/divider-genB.jpg)
+
+
+### Additional Considerations 
+
+**Stochastic simulators** Many Bayesian inference problems have a stochastic simulator. For simplicity, we assume that all stochasticity within such a simulator can be controlled via a variable $z \sim \mathcal{N}(0, I)$, which is an additional input. Motivated by the equivalence of exchanging expectation and gradient  
+
+$$
+\begin{align}
+    \nabla_{\hat{x}_1} \mathbb{E}_{z\sim \mathcal{N}(0,1)} [ C(\mathcal P_z(\hat{x}_1), y)] = \mathbb{E}_{z\sim \mathcal{N}(0,1)} [ \nabla_{\hat{x}_1} C(\mathcal P_z(\hat{x}_1), y)],
+\end{align}
+$$
+
+when calling the simulator, we draw a random realization of $z$. During training, we randomly draw $z$ for each sample and step while during inference we keep the value of $z$ fixed for each trajectory. 
+
+**Time-dependence**
+
+If the estimate $\hat{x}_1$ is bad and the corresponding cost $C(\hat{x}_1, y)$ is high, gradients and control signals can become unreliable. It turns out that the estimates $\hat{x}_1$ become more reliable for later times in the flow matching process. 
+
+In practice, $t \geq 0.8$ is a good threshold. Therefore, we only train the control network $v_\theta^C$ in this range, which allows for focusing on control signals containing more useful information to, e.g. fine tune the solutions with the accurate gradients of a differentiable simulator. For $t < 0.8$, we directly output the pretrained flow $v_\theta(t, x, y)$.
+
+
+**Theoretical correctness**
+
+In the formulation above, the approximation $\hat{x}_1$ only influences the control signal, which is an input to the controlled flow network $v_\theta^C$. In the case of a deterministic simulator, this makes the control signal a function of $x_t$. The controlled flow network is trained with the same loss as vanilla flow matching. This has the nice consequence that the theoretical properties are preserved.
+This is in contrast to e.g. "likelihood-based guidance", which uses an approximation for $\nabla_{x_t} \log p(y|x_t)$ as a guidance term during inference, which is not covered by the original flow matching theory. 
+
+
+
+
+### An Example from Astrophysics
+
+To demonstrate how these guidance from a physics solver affect the accuracy of samples and the posterior, we show an example from strong gravitational lensing:  an inverse problem in astrophysics that is challenging and requires precise posteriors for accurate modeling of observations. In galaxy-scale strong lenses, light from a source galaxy is deflected by the gravitational potential of a galaxy between the source and observer, causing multiple images of the source to be seen. Traditional computational approaches require several minutes to many hours or days to model a single lens system. Therefore, there is an urgent need to reduce the compute and inference with learning-based methods. In this experiment, it's shown that using flow matching and the control signals with feedback from a simulator gives posterior distributions for lens modeling that are competitive with the posteriors obtained by MCMC-based methods. At the same time, they are much faster at inference. 
+
+
+```{figure} resources/probmodels-astro.jpg
+---
+height: 240px
+name: probmodels-astro
+---
+Results from flow matching for reconstructing gravitational lenses. Left: flow matching with a differentiable simulator (bottom) clearly outperforms pure flow matching (top). Right: comparisons against classic baselines. The FM+simulator variant is more accurate while being faster.
+```
+
+The image aboves shows an example reconstruction and the residual errors. While flow matching and the physics-based variant are both very accurate (it's hard to visually make out differences), the FM version is just on par with classic inverse solvers. The version with the simulator, however, provides a substantial boost in terms of accuracy that is very difficult to achieve even for classic solvers. The quantitative results are shown in the table on the right: the best classic baseline is AIES with an average $\chi_2$ statistic of 1.74, while FM with simulator yields 1.48. Provided that the best possible result due to noisy observations is 1.17 for this scenario, the FM+simulation version is really highly accurate. 
+
+At the same time, the performance numbers for _modeling time_ in the right column show that the FM variant clearly outperforms the classic solvers. While the simulator increases inference time compared to only the neural network (10s to 19s), the classic baselines require more than $50\times$ longer reconstruction times. Interestingly, this example also highlights the problems of "simpler" physics combinations in the form of DPS. The DPS version does not manage to keep up with the classic solvers in terms of accuracy. To conclude, the _FM+simulator_ variant is not only substantially more accurate, but also ca. $35\times$ faster than the best classic solver above (AIES). (Source code for this approach will be available soon [in this repository](https://github.com/tum-pbs/sbi-sim).)
+
+---
+
+A summary of the physics-based flow matching is given by the following bullet points:
+
+✅ Pro:
+* Improved accuracy over purely learned diffusion models
+* Gives control over residual accuracy
+* Reduced runtime compared to traditional inverse solvers
+
+❌ Con:
+* Requires differentiable physical process
+* Increased computational resources
+
+
+
+
+![Divider](resources/divider-gen1.jpg)
+
+
+
+## Score Matching with Differentiable Physics
+
+So far we have treated the _diffusion time_ of denoising and flow matching as a process that is purely virtual and orthogonal to the time of the physical process to be represented by the forward and inverse problems. This is the most generic viewpoint, and works nicely, as demonstrated above. However, it's interesting to think about the alternative: merging the two processes, i.e., treating the diffusion process as an inherent component of the physics system.
+
+```{figure} resources/probmodels-smdp-1trainB.jpg
+---
+height: 240px
+name: probmodels-smdp-trainB
+---
+The physics process (heat diffusion as an example, left) perturbs and "destroys" the initial state. At inference time (right, Buoyancy flow as an example), the solver is used to compute inverse steps and produce solutions by combining steps along the score and the gradient of the solver.
+```
+
+The following sections will explain such a combined approach, following the paper "Solving Inverse Physics Problems with Score Matching" {cite}`holzschuh2023smdp`, which which [code is available in this repository](https://github.com/tum-pbs/SMDP).
+
+
+This approach solves inverse physics problems by leveraging the ideas of score matching. The system’s current state is moved backward in time step by step by combining an approximate inverse physics simulator and a learned correction function. A central insight of this work is that training the learned correction with a single-step loss is equivalent to a score matching objective, while recursively predicting longer parts of the trajectory during training relates to maximum likelihood training of a corresponding probability flow. The resulting inverse solver exhibits good accuracy and temporal stability. In line with diffusion modeling and in contrast to classic learned solvers, it allows for sampling the posterior of the solutions. The method will be called _SMDP_ (for _Score Matching with Differentiable Physics_) in the following.
+
+### Training and Inference with SMDP
+
+For training, SMDP fits a neural ODE, the probability flow, to the set of perturbed training trajectories. The probability flow is comprised of an approximate reverse physics simulator $\tilde{\mathcal{P}}^{-1}$ as well as a correction function $s_\theta$. For inference, we simulate the system backward in time from $\mathbf{x}_T$ to $\mathbf{x}_0$ by combining $\tilde{\mathcal{P}}^{-1}$, the trained $s_\theta$ and Gaussian noise in each step. 
+For optimizing $s_\theta$, our approach moves a sliding window of size $S$ along the training trajectories and reconstructs the current window. Gradients for $\theta$ are accumulated and backpropagated through all prediction steps. This process is illustrated in the following figure:
+
+```{figure} resources/probmodels-smdp-1train.jpg
+---
+height: 240px
+name: probmodels-smdp-train
+---
+Overview of the score matching training process while incorporating a physics solver $\mathcal P$ and it's approximate inverse solver $\matcal{P}^{-1}.
+```
+
+A differentiable solver or a learned surrogate model is employed for $\tilde{\mathcal{P}}^{-1}$. 
+The neural network $s_\theta(\mathbf{x}, t)$ parameterized by $\theta$ is trained such that
+
+$$
+\mathbf{x}_{m} \approx \mathbf{x}_{m+1} + \Delta t \left[ \tilde{\mathcal{P}}^{-1}(\mathbf{x}_{m+1}) + s_\theta(\mathbf{x}_{m+1}, t_{m+1}) \right].
+$$
+
+In this equation, the term $s_\theta(\mathbf{x}_{m+1}, t_{m+1})$ corrects approximation errors and resolves uncertainties from the stochastic forcing $F_{t_m}(z_m)$. Potentially, this process can be unrolled over multiple steps at training time to improve accuracy and stability. At inference, time the stochastic differential equation 
+
+$$
+d\mathbf{x} = \left[ -\tilde{\mathcal{P}}^{-1}(\mathbf{x}) + C \, s_\theta(\mathbf{x},t) \right] dt + g(t) dW
+$$
+
+is integrated via the Euler-Maruyama method to obtain a solution for the inverse problem.
+Setting $C=1$ and excluding the noise gives the probability flow ODE: a unique, deterministic solution. This deterministic variant is not probablistic anymore, but has other interesting properties. 
+
+```{figure} resources/probmodels-smdp-2infer.jpg
+---
+height: 148px
+name: probmodels-smdp-infer
+---
+An overview of SMDP at inference time.
+```
+
+
+### SMDP in Action
+
+This section shows experiments for the stochastic heat equation: $\frac{\partial u}{\partial t} = \alpha \Delta u$, which plays a fundamental role in many physical systems. It slightly perturbs the heat diffusion process and includes an additional term $g(t)\ \xi$, where $\xi$ is space-time white noise. For the experiments, we fix the diffusivity constant to $\alpha = 1$ and sample initial conditions at $t=0$ from Gaussian random fields with $n=4$ at resolution $32 \times 32$. We simulate the heat diffusion with noise from $t=0$ until $t=0.2$ using the Euler-Maruyama method and a spectral solver $\mathcal{P}_h$ with a fixed step size and $g \equiv 0.1$. Given a simulation end state $\mathbf{x}_T$, we want to recover a possible initial state $\mathbf{x}_0$. 
+
+In this experiment, the forward solver cannot be used to infer $\mathbf{x}_0$ directly since high frequencies due to noise are amplified, leading to physically implausible solutions. Instead, the reverse physics step $\tilde{P}^{-1}$ is implemented by using the forward step of the solver $\mathcal{P}_h(\mathbf{x})$, i.e. $\tilde{\mathcal{P}}^{-1}(\mathbf{x}) \approx - \mathcal{P}_h (\mathbf{x})$.
+
+A small ResNet-like architecture is used based on an encoder and decoder part as representation for the score function $s_\theta(\mathbf{x}, t)$. The spectral solver is implemented via differentiable programming in _JAX_. As baseline methods, a supervised training of the same architecture as $s_\theta(\mathbf{x}, t)$, a Bayesian neural network (BNN), as well as a FNO network are considered. An $L_2$ loss is used for all these methods, i.e., the training data consists of pairs of initial state $\mathbf{x}_0$ and end state $\mathbf{x}_T$. Additionally, a variant of the SMDP method is included for which the reverse physics step $\tilde{\mathcal{P}}^{-1}$ is reomved, such that the inversion of the dynamics has to be learned entirely by $s_\theta$, denoted by ''$s_\theta$~only''.
+
+
+```{figure} resources/probmodels-smdp-3heat.jpg
+---
+name: probmodels-smdp-heat
+---
+While the ODE trajectories provide smooth solutions with the lowest reconstruction MSE, the SDE solutions synthesize high-frequency content, significantly improving spectral error.  
+The ``$s_\theta$ only'' version without the reverse physics step exhibits a significantly larger spectral error. Metrics (right) are averaged over three runs.
+```
+
+SMDP and the baselines are evaluated by considering the _reconstruction MSE_ on a test set of $500$ initial conditions and end states. For the reconstruction MSE, the prediction of the network is simulated forward in time with the solver $\mathcal{P}_h$ to obtain a corresponding end state, which is compared to the ground truth via the $L_2$ distance. This metric has the disadvantage that it does not measure how well the prediction matches the training data manifold. I.e., for this case, whether the prediction resembles the properties of the initial Gaussian random field. For that reason, the power spectral density of the states is shown as a _spectral loss_. An evaluation and visualization of the reconstructions are given in figure \ref{fig:stochastic_heat_eq_overview}, which shows that the ODE inference performs best regarding the reconstruction MSE. However, its solutions are smooth and do not contain the necessary small-scale structures. This is reflected in a high spectral error. The SDE variant, on the other hand, performs very well in terms of spectral error and yields visually convincing solutions with only a slight increase in the reconstruction MSE. 
+
+This highlights the role of noise as a source of entropy in the inference process for diffusion models, such as the SDE in SMDP, which is essential for synthesizing small-scale structures. Note that there is a natural tradeoff between both metrics, and the ODE and SDE inference perform best for each of the cases while using an identical set of weights. This heat diffusion example highlights the advantages and properties of treating the physical process as part of the diffusion process. This, of course, extends to other physics. E.g., [the SMDP repository](https://github.com/tum-pbs/SMDP) additionally shows a case with an inverse Navier-Stokes solve.
+
+## Summary of Physics-based Diffusion Models
+
+Overall, the sections above have explained two methods to incorporate physics-based constraints and models in the form of PDEs into diffusion modeling. Interestingly, the inclusion is largely in line with {doc}`diffphys`, i.e. gradients of the physics solver are a central quantity, and concepts like unrolling play an important role. On the other hand, the probabilistic modeling introduces additional complexity on the training and inference sides. It provides powerful tools and access to distribiutions of solutions (we haven't even touched follow up applications such as uncertainty quantification above), but this comes at a cost. 
+
+As a rule of thumb 👍, diffusion modeling should only be used if the solution is a distribution that is _not_ well represented by the mean of the solutions. If the mean is accetable, "regular" neural networks offer substantial advantages in terms of reduced complexity for training and inference.
+
+However, if the solutions are a distribution 🌦️, diffusion models are powerful tools to work with complex and varied solutions. Given its capabilties, deep learning with diffusion models arguably introduces surprisingly _little_ additional complexity. E.g., training flow matching models is suprisingly robust, can be build on top of deterministic training, and introduces only a mild computational overhead.
+
+To show how the combination of physics solvers and diffusion models turns out in terms of an implementation, the next section shows source code for an SMDP use case.
--- a/probmodels-sbisim.ipynb
+++ b/probmodels-sbisim.ipynb
--- a/probmodels-score.ipynb
+++ b/probmodels-score.ipynb
--- a/probmodels-time.ipynb
+++ b/probmodels-time.ipynb
--- a/probmodels-uncond.md
+++ b/probmodels-uncond.md
@@ -0,0 +1,130 @@
+Unconditional Stability
+=======================
+
+The results of the previous section, for time predictions with diffusion models, and earlier ones ({doc}`diffphys-discuss`)
+make it clear that unconditionally stable networks are definitely possible. 
+This has also been reported various other works. However, there's still a fair amount of approaches that seem to have trouble with long term stability.
+This poses a very interesting question: which ingredients are necessary to obtain _unconditional stability_? 
+Unconditional stability here means obtaining trained networks that are stable for arbitrarily long rollouts. Are inductive biases or special training methodologies necessary, or is it simply a matter of training enough different initializations? Our setup provides a very good starting point to shed light on this topic.
+
+The "success stories" from earlier chapters, some with fairly simple setups, indicate that unconditional stability is “nothing special” for neural network based predictors. I.e., it does not require special loss functions or tricks beyond a proper learning setup (suitable hyperparameters, sufficiently large model plus enough data).
+As errors will accumulate over time, we can expect that network size and the total number of update steps in training are important. Interestingly, it seems that the neural network architecture doesn’t really matter: we can obtain stable rollouts with pretty much “any” architecture once it’s sufficiently large.
+
+Note that we'll focus on time steps with a **fixed length** in the following. The "unconditional stability" refers to being stable over an arbitrary number of iterative steps. The following networks could potentially trained for variable time step sizes as well, but we will focus on the "dimension" of stability of multiple, iterative network calls below.
+
+![Divider](resources/divider-gen2.jpg)
+
+## Main Considerations for an Evaluation
+
+As shown in the previous chapter, diffusion models perform extremely well. This can be attributed to the underlying task of working with pure noise as input (e.g., for denoising or flow matching tasks). Likewise, the network architecture has only a minor influence: the network simply needs to be large enough to provide a converging iteration. For supervised  or unrolled training, we can leverage a variety of discrete and continuous neural operators. CNNs, Unets, FNOs and Transformers are popular approaches here.
+Interestingly, FNOs, due to their architecture _project_ the solution onto a subspace of the frequencies in the discretization. This inherently removes high frequencies that primarily drive instabilities. As such, they're influenced by unrolling to a lesser extent [(details can be found, e.g., here)](https://tum-pbs.github.io/apebench-paper/).
+Operators that better preserve small-scale details, such as convolutions, can strongly benefit from unrolling. This will be a focus of the following ablations.
+
+Interestingly, it turns out that the batch size and the length of the unrolling horizon play a crucial but conflicting role: small batches are preferable, but in the worst case under-utilize the hardware and require long training runs. Unrolling on the other hand significantly stabilizes the rollout, but leads to increased resource usage due to the longer computational graph for each NN update. Thus, our experiments show that a “sweet spot” along the Pareto-front of batch size vs unrolling horizon can be obtained by aiming for as-long-as-possible rollouts at training time in combination with a batch size that sufficiently utilizes the available GPU memory.
+
+Learning Task: To analyze the temporal stability of autoregressive networks on long rollouts, two flow prediction tasks from the [ACDM benchmark](https://github.com/tum-pbs/autoreg-pde-diffusion) are considered: an easier incompressible cylinder flow (denoted by _Inc_), and a complex transonic wake flow (denoted by _Tra_) at Reynolds number 10 000. For Inc, the networks are trained on flows with Reynolds number 200 – 900 and required to extrapolate to Reynolds numbers of 960, 980, and 1000 during inference (_Inc-high_). For Tra, the training data consists of flows with Mach numbers between 0.53 and 0.9, and networks are tested on the Mach numbers 0.50, 0.51, and 0.52 (denoted by _Tra-ext_). This Mach number is tough as it contains a substantial amounts of shocks that interact with the flow.
+For each sequences in both data sets, three training runs of each architecture are unrolled over 200.000 steps. This unrolling length is no proof that these networks yield infinitely long stable rollouts, but they feature an extremely small probability for blowups.
+
+## Comparing Architectures
+
+As a first comparison, we'll train three network architectures with an identical U-Net architecture, that use different stabilization techniques. This comparison shows that it is possible to successfully achieve the task "unconditional stability" in different ways:
+- Unrolled training (_U-Net-ut_) where gradients are backpropagated through multiple time steps during training.
+- Networks trained on a single prediction step with added training noise (_U-Net-tn_). This technique is known to improve stability by reducing data shift, as the added noise emulates errors that accumulate during inference.
+- Autoregressive conditional diffusion models (ACDM). A denoising diffusion model is conditioned on the previous time step and iteratively refines noise to create a prediction for the next step, as shown in {doc}`probmodels-time`. 
+
+
+```{figure} resources/probmodels-uncond01.png
+---
+height: 240px
+name: probmodels-uncond-inc
+---
+Vorticity predictions for an incompressible flow with a Reynolds number of 1000 over 200 000 time steps (Inc-high).
+```
+
+The figure above illustrates the resulting predictions. All methods and training runs remain unconditionally stable over the entire rollout on Inc-high. Since this flow is unsteady but fully periodic, the results of all networks are simple, periodic trajectories that prevent error accumulation. This example serves to show that for simpler tasks, long term stability is less of an issue. Networks have a relatively easy time to keep their predictions within the manifold of the solutions. Let's consider a tougher example: the transonic flows with shock waves in Tra.
+
+```{figure} resources/probmodels-uncond02.png
+---
+height: 240px
+name: probmodels-uncond-tra
+---
+Vorticity predictions for transonic flows with a Mach number 0.52 (Tra-ext, outside the trainig data range) over 200 000 time steps.
+```
+
+For the test sequences from Tra-ext, one from the three trained U-Net-tn networks has stability issues within the first few thousand steps. This network deteriorates to a simple, mean flow prediction without vortices. Unrolled training (U-Net-ut) and diffusion models (ACDM), on the other hand, are fully stable across sequences and training runs for this case, indicating a higher resistance to rollout errors which normally cause instabilities. The autoregressive diffusion models turn out to be unconditionally stable across the board [(details here)](https://arxiv.org/abs/2309.01745), so we'll drop them in the following evaluations and focus on models where stability is more difficult to achieve: the U-Nets, as representatives of convolutional, discrete neural operators.
+
+## Stability Criteria
+
+Focusing on the U-Net networks with unrolled training, we will next focus on training multiple models (3 each time), and measure the percentage of stable runs they achieve. This provides more thorough statistics compared to the single, qualitative examples above.
+We'll investigate the first key criteria rollout length, to show how it influences fully stable rollouts over extremely long horizons.
+Figure 2 lists the percentage of stable runs for a range of ablation networks on the Tra-ext data set with rollouts over 200 000 time steps. Results on the individual Mach numbers, as well as an average (top row) are shown.
+
+```{figure} resources/probmodels-uncond03-ma.png
+---
+height: 210px
+name: probmodels-uncond03-ma
+---
+Percentage of stable runs on the Tra-ext data set for different ablations of unrolled training.
+```
+
+The different generalization test over Mach numbers make no difference.
+The most important criterion for stability is the number of unrolling steps m: while networks with m <= 4 consistently do not achieve stable rollouts, using m >= 8 is sufficient for stability across different Mach numbers. 
+
+**Negligible Aspects:** 
+Three factors that did not substantially impact rollout stability in experiments are the prediction strategy, the amount of training data, and the backbone architecture. We'll only briefly summarize the results here. First, using residual predictions, i.e., predicting the difference to the previous time step instead of the full time steps itself, did not impact stability. Second, the stability is not affected when reducing the amount of available training data by a factor of 8, from 1000 time steps per Mach number to 125 steps (while training with 8× more epochs to ensure a fair comparison). This training data reduction still retains the full physical behavior, i.e., complete vortex shedding periods. Third, it possible to train other backbone architectures with unrolling to achieve fully stable rollouts as well, such as dilated ResNets. For ResNets without dilations only one trained network is stable, most likely due to the reduced receptive field. However, we expect achieving full stability is also possible with longer training rollout horizons.
+
+------
+
+## Batch Size vs Rollout
+
+Interestingly, the batch size turns out to be an important factor:
+it can substantially impact the stability of autoregressive networks. This is similar to the image domain, where smaller batches are know to improve generalization (this is the motivation for using mini-batching instead of gradients over the full data set). The impact of the batch size on the stability and training time is shown in the figure below, for both investigated data sets. Networks that only come close to the ideal rollout length at a large batch size, can be stabilized with smaller batches. However, this effect does not completely remove the need for unrolled training, as networks without unrolling were unstable across all tested batch sizes. For the Inc case, the U-Net width was reduced by a factor of 8 across layers (in comparison to above), to artificially increase the difficulty of this task. Otherwise all parameter configurations would already be stable and show the effect of varying the batchsize.
+
+```{figure} resources/probmodels-uncond04a.png
+---
+height: 210px
+name: probmodels-uncond04a
+---
+Percentage of stable runs and training time for different combinations of rollout length and batch size for the Tra-ext data set. Grey configurations are omitted due to memory limitations (mem) or due to high computational demands (-).
+```
+
+```{figure} resources/probmodels-uncond04b.png
+---
+height: 210px
+name: probmodels-uncond04b
+---
+Percentage of stable runs and training time for rollout length and batch size for the Inc-high data set. Grey again indicates out-of-memory (mem) or overly high computational demands (-).
+```
+
+This shows that increasing the batch size is more expensive in terms of training time on both data sets, due to less memory efficient computations. Using longer rollouts during training does not necessarily induce longer training times, as we compensate for longer rollouts with a smaller number of updates per epoch. E.g., we use either 250 batches with a rollout of 4, or 125 batches with a rollout of 8. Thus the number of simulation states that each network sees over the course of training remains constant. However, we did in practice observe additional computational costs for training the larger U-Net network on Tra-ext. This leads to the "central" question in these ablations: which combination of rollout length and batch size is most efficient?
+
+```{figure} resources/probmodels-uncond05.png
+---
+height: 180px
+name: probmodels-uncond05
+---
+Training time for different combinations of rollout length and batch size to on the Tra-ext data set (left) and the Inc-high data set (right). Only configurations that to lead to highly stable networks (stable run percentage >= 89%) are shown.
+```
+
+The figure above answers this question by showing the central tradeoff between rollout length and batch size (only stable versions are included here). 
+To achieve _unconditionally stable_ networks and neural operators, it is consistently beneficial to choose configurations where large rollout lengths are paired with a batch size that is big enough the sufficiently utilize the available GPU memory. This means, improved stability is achieved more efficiently with longer training rollouts rather than smaller batches, as indicated by the green dots with the lowest training times.
+
+## Summary 
+
+To conclude the results above: With a suitable training setup, unconditionally stable predictions with extremely long rollout are clearly possible, even for complex flows. According to the experiments, the most important factors that impact stability are the decision for or against diffusion-based training
+
+Without diffusion, several factors need to be considered:
+- Long rollouts at training time
+- Small batch sizes
+- Comparing these two factors: longer rollouts are preferable, and result in faster training times than smaller batch sizes
+- At the same time, sufficiently large networks are necessary (this depends on the complexity of the learning task).
+
+Factors that did not substantially impact long-term stability are:
+
+- Prediction paradigm during training, i.e., residual and direct prediction are viable
+- Additional training data without new physical behavior
+- Different network architectures, although the ideal number of unrolling steps might vary for each architecture
+
+This concludes the topic of "unconditional stability".
+Further details of these experiments can be found in the [ACDM paper](https://arxiv.org/abs/2309.01745) 
+
--- a/references.bib
+++ b/references.bib
@@ -13,6 +13,68 @@
@STRING{NeurIPS        = "Advances in Neural Information Processing Systems"}


+@article{braun2025msbg,
+  title  ={{Adaptive Phase-Field-FLIP for Very Large Scale Two-Phase Fluid Simulation}},
+  author = {Braun, Bernhard and Bender, Jan and Thuerey, Nils},
+  journal = {{ACM} Transaction on Graphics},
+  volume = {44 (3)},
+  year = {2025},
+  publisher = {ACM},
+} 
+
+@inproceedings{lino2025dgn,
+  title={Learning Distributions of Complex Fluid Simulations with Diffusion Graph Networks},
+  author={Mario Lino and Tobias Pfaff and Nils Thuerey},
+  booktitle={International Conference on Learning Representations},
+  year={2025}
+}
+
+@inproceedings{liu2025config,
+  title={ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks},
+  author={Qiang Liu and Mengyu Chu and Nils Thuerey},
+  booktitle={International Conference on Learning Representations},
+  year={2025}
+}
+ 
+@inproceedings{bhatia2025prdp,
+  title={Progressively Refined Differentiable Physics},
+  author={Kanishk Bhatio and Felix Koehler and Nils Thuerey},
+  booktitle={International Conference on Learning Representations},
+  year={2025}
+}
+
+
+@inproceedings{koehler2024ape,
+  title={APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs},
+  author={Felix Koehler and Simon Niedermayr and Ruediger Westermann and Nils Thuerey},
+  journal={Advances in Neural Information Processing Systems (NeurIPS)},
+  year={2024}
+}
+
+@article{list2025differentiability,
+  title={Differentiability in unrolled training of neural physics simulators on transient dynamics},
+  author={List, Bjoern and Chen, Li-Wei and Bali, Kartik and Thuerey, Nils},
+  journal={Computer Methods in Applied Mechanics and Engineering},
+  volume={433},
+  pages={117441},
+  year={2025},
+  publisher={Elsevier}
+}
+
+ 
+@inproceedings{shehata2025trunc,
+  title={Truncation Is All You Need: Improved Sampling Of Diffusion Models For Physics-Based Simulations},
+  author={Youssef Shehata and Benjamin Holzschuh and Nils Thuerey},
+  booktitle={International Conference on Learning Representations},
+  year={2025}
+}
+ 
+@inproceedings{schnell2025td,
+  title={Temporal Difference Learning: Why It Can Be Fast and How It Will Be Faster},
+  author={Patrick Schnell and Luca Guastoni and Nils Thuerey},
+  booktitle={International Conference on Learning Representations},
+  year={2025}
+}

@inproceedings{holl2024phiflow,
  title={phiflow: Differentiable Simulations for PyTorch, TensorFlow and Jax},
@@ -21,7 +83,6 @@
  year={2024}
 }

-
@inproceedings{liu2024airfoils,
 	  title={Uncertainty-aware Surrogate Models for Airfoil Flow Simulations with Denoising Diffusion Probabilistic Models},
 	  author={Liu, Qiang and Thuerey, Nils},
@@ -51,35 +112,59 @@
 	  url={https://joss.theoj.org/papers/10.21105/joss.06171},
 }

+@article{kohl2023benchmarking,
+	title={Benchmarking autoregressive conditional diffusion models for turbulent flow simulation},
+	author={Kohl, Georg and Chen, Li-Wei and Thuerey, Nils},
+	journal={arXiv:2309.01745},
+	year={2023}
+}
+
+@article{brahmachary2024unsteady,
+	title={Unsteady cylinder wakes from arbitrary bodies with differentiable physics-assisted neural network},
+	author={Brahmachary, Shuvayan and Thuerey, Nils},
+	journal={Physical Review E},
+	volume={109},
+	number={5},
+	year={2024},
+	publisher={APS}
+}
+
+@article{holzschuh2024fm,
+	title={Solving Inverse Physics Problems with Score Matching},
+	author={Benjamin Holzschuh and Nils Thuerey},
+	journal={Advances in Neural Information Processing Systems (NeurIPS)},
+	volume={36},
+	year={2023}
+}

@article{holzschuh2023smdp,
-  title={Solving Inverse Physics Problems with Score Matching},
-  author={Benjamin Holzschuh and Simona Vegetti and Thuerey, Nils},
-  journal={Advances in Neural Information Processing Systems (NeurIPS)},
-  volume={36},
-  year={2023}
+	title={Solving Inverse Physics Problems with Score Matching},
+	author={Benjamin Holzschuh and Simona Vegetti and Nils Thuerey},
+	journal={Advances in Neural Information Processing Systems (NeurIPS)},
+	volume={36},
+	year={2023}
 }

@inproceedings{franz2023nglobt,
 	  title={Learning to Estimate Single-View Volumetric Flow Motions without 3D Supervision},
-	  author={Erik Franz, Barbara Solenthaler, and Thuerey, Nils},
+	  author={Erik Franz and Barbara Solenthaler and Nils Thuerey},
 	  booktitle={ICLR},
 	  year={2023},
 	  url={https://github.com/tum-pbs/Neural-Global-Transport},
 }

@inproceedings{kohl2023volSim,
-  title={Learning Similarity Metrics for Volumetric Simulations with Multiscale CNNs},
-  author={Kohl, Georg and Chen, Li-Wei and Thuerey, Nils},
-  booktitle={AAAI Conference on Artificial Intelligence},
-  year={2022}, 
-  url={https://github.com/tum-pbs/VOLSIM},
+	title={Learning Similarity Metrics for Volumetric Simulations with Multiscale CNNs},
+	author={Kohl, Georg and Chen, Li-Wei and Thuerey, Nils},
+	booktitle={AAAI Conference on Artificial Intelligence},
+	year={2022}, 
+	url={https://github.com/tum-pbs/VOLSIM},
 }

@inproceedings{list2022piso,
 	  title={Learned Turbulence Modelling with Differentiable Fluid Solvers},
-	  author={Bjoern List and Liwei Chen and Thuerey, Nils},
-	  booktitle={arXiv:2202.06988},
+	  author={Bjoern List and Liwei Chen and Nils Thuerey},
+      booktitle={Journal of Fluid Mechanics (929/25)},
 	  year={2022},
 	  url={https://ge.in.tum.de/publications/},
 }
@@ -120,8 +205,8 @@
 }

@article{chu2021physgan,
-	author = {Chu, Mengyu and Thuerey, Nils and Seidel, Hans-Peter and Theobalt, Christian and Zayer, Rhaleb},
 	 title  ={{Learning Meaningful Controls for Fluids}},
+	author = {Chu, Mengyu and Thuerey, Nils and Seidel, Hans-Peter and Theobalt, Christian and Zayer, Rhaleb},
 	 journal = ACM_TOG,
 	 volume = {40(4)},
 	 year = {2021},
@@ -1032,5 +1117,81 @@
  year={2019}
 }

+# archs & prob mod
+
+@article{goodfellow2014gan,
+  title={Generative adversarial networks},
+  author={Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua},
+  journal={Advances in neural information processing systems},
+  volume={27},
+  year={2014}
+}
+
+@inproceedings{ronneberger2015unet,
+  title={U-net: Convolutional networks for biomedical image segmentation},
+  author={Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas},
+  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
+  year={2015},
+}
+
+@article{yu2015dilate,
+  title={Multi-scale context aggregation by dilated convolutions},
+  author={Yu, Fisher and Koltun, Vladlen},
+  journal={arXiv preprint arXiv:1511.07122},
+  year={2015}
+}
+
+@article{li2021fno,
+  title={Fourier neural operator for parametric partial differential equations},
+  author={Z. Li and N. B. Kovachki and K. Azizzadenesheli and B. Liu and K. Bhattacharya and A. M. Stuart and A. Anandkumar},
+  journal={ICLR}, year={2021} 
+}
+
+@article{chen2019node,
+  title={Neural Ordinary Differential Equations}, 
+  author={Ricky T. Q. Chen and Yulia Rubanova and Jesse Bettencourt and David Duvenaud},
+  journal={arXiv:1806.07366}, year={2019} 
+}
+
+@article{vincent2011dsm,
+  title={A connection between score matching and denoising autoencoders},
+  author={Vincent, Pascal},
+  journal={Neural computation},
+  volume={23},
+  number={7},
+  pages={1661--1674},
+  year={2011},
+  publisher={MIT Press}
+}
+
+@article{kobyzev2020nf,
+  title={Normalizing flows: An introduction and review of current methods},
+  author={Kobyzev, Ivan and Prince, Simon JD and Brubaker, Marcus A},
+  journal={IEEE transactions on pattern analysis and machine intelligence},
+  volume={43}, number={11},
+  year={2020},
+  publisher={IEEE}
+}
+
+@article{lipman2022flow,
+  title={Flow matching for generative modeling},
+  author={Lipman, Yaron and Chen, Ricky TQ and Ben-Hamu, Heli and Nickel, Maximilian and Le, Matt},
+  journal={arXiv:2210.02747}, year={2022} 
+}
+
+@article{liu2022rect,
+  title={Flow straight and fast: Learning to generate and transfer data with rectified flow},
+  author={Liu, Xingchao and Gong, Chengyue and Liu, Qiang},
+  journal={arXiv:2209.03003}, year={2022} 
+}
+
+@inproceedings{chung2023dps,
+  title={Diffusion posterior sampling for general noisy inverse problems},
+  author={Chung, Hyungjin and Kim, Jeongsol and Mccann, Michael and Klasky, Marc and Ye, Jong Chul},
+  booktitle={International Conference on Learning Representations},
+  year={2023}
+}
+
+


--- a/reinflearn-intro.md
+++ b/reinflearn-intro.md
@@ -1,7 +1,8 @@
 Introduction to Reinforcement Learning
 =======================

-Deep reinforcement learning, which we'll just call _reinforcement learning_ (RL) from now on, is a class of methods in the larger field of deep learning that lets an artificial intelligence agent explore the interactions with a surrounding environment. While doing this, the agent receives reward signals for its actions and tries to discern which actions contribute to higher rewards, to adapt its behavior accordingly. RL has been very successful at playing games such as Go {cite}`silver2017mastering`, and it bears promise for engineering applications such as robotics.
+Deep reinforcement learning, which we'll just call _reinforcement learning_ (RL) from now on, is a class of methods in the larger field of deep learning that takes a different viewpoint from classic "train with data" one:
+RL effectively lets an AI agent learn from interactions with an environment. While performing actions, the agent receives reward signals and tries to discern which actions contribute to higher rewards, to adapt its behavior accordingly. RL has been very successful at playing games such as Go {cite}`silver2017mastering`, and it bears promise for engineering applications such as robotics.

 The setup for RL generally consists of two parts: the environment and the agent. The environment receives actions $a$ from the agent while supplying it with observations in the form of states $s$, and rewards $r$. The observations represent the fraction of the information from the respective environment state that the agent is able to perceive. The rewards are given by a predefined function, usually tailored to the environment and might contain, e.g., a game score, a penalty for wrong actions or a bounty for successfully finished tasks.

--- a/resources/arch01.jpg
+++ b/resources/arch01.jpg
--- a/resources/arch02.jpg
+++ b/resources/arch02.jpg
--- a/resources/arch03.jpg
+++ b/resources/arch03.jpg
--- a/resources/arch04.jpg
+++ b/resources/arch04.jpg
--- a/resources/arch05.jpg
+++ b/resources/arch05.jpg
--- a/resources/arch06-fno.jpg
+++ b/resources/arch06-fno.jpg
--- a/resources/divider-gen-full.jpg
+++ b/resources/divider-gen-full.jpg
--- a/resources/divider-gen1.jpg
+++ b/resources/divider-gen1.jpg
--- a/resources/divider-gen2.jpg
+++ b/resources/divider-gen2.jpg
--- a/resources/divider-gen3.jpg
+++ b/resources/divider-gen3.jpg
--- a/resources/divider-gen4.jpg
+++ b/resources/divider-gen4.jpg
--- a/resources/divider-gen5.jpg
+++ b/resources/divider-gen5.jpg
--- a/resources/divider-gen6.jpg
+++ b/resources/divider-gen6.jpg
--- a/resources/divider-gen7.jpg
+++ b/resources/divider-gen7.jpg
--- a/resources/divider-gen8.jpg
+++ b/resources/divider-gen8.jpg
--- a/resources/divider-gen9.jpg
+++ b/resources/divider-gen9.jpg
--- a/resources/divider-genA.jpg
+++ b/resources/divider-genA.jpg
--- a/resources/divider-genB.jpg
+++ b/resources/divider-genB.jpg
--- a/resources/divider-genC.jpg
+++ b/resources/divider-genC.jpg
--- a/resources/intro-teaser-side-by-side.jpg
+++ b/resources/intro-teaser-side-by-side.jpg
--- a/resources/intro-teaser-side-by-side.png
+++ b/resources/intro-teaser-side-by-side.png
--- a/resources/logo-xl.jpg
+++ b/resources/logo-xl.jpg
--- a/resources/logo.jpg
+++ b/resources/logo.jpg
--- a/resources/overview-arch-tblock.jpg
+++ b/resources/overview-arch-tblock.jpg
--- a/resources/pbdl-arch-figures.key
+++ b/resources/pbdl-arch-figures.key
--- a/resources/pbdl-figures.key
+++ b/resources/pbdl-figures.key
--- a/resources/physgrad-sin-add-graphs.jpg
+++ b/resources/physgrad-sin-add-graphs.jpg
--- a/resources/physgrad-sin-loss.jpg
+++ b/resources/physgrad-sin-loss.jpg
--- a/resources/physgrad-sin-time-graphs.jpg
+++ b/resources/physgrad-sin-time-graphs.jpg
--- a/resources/physics-based-deep-learning-overview.jpg
+++ b/resources/physics-based-deep-learning-overview.jpg
--- a/resources/prob01-cnf.jpg
+++ b/resources/prob01-cnf.jpg
--- a/resources/prob02-ddpm.jpg
+++ b/resources/prob02-ddpm.jpg
--- a/resources/prob03-fm.jpg
+++ b/resources/prob03-fm.jpg
--- a/resources/probmodels-astro.jpg
+++ b/resources/probmodels-astro.jpg
--- a/resources/probmodels-graph-arch.jpg
+++ b/resources/probmodels-graph-arch.jpg
--- a/resources/probmodels-graph-ellipse.jpg
+++ b/resources/probmodels-graph-ellipse.jpg
--- a/resources/probmodels-graph-over.jpg
+++ b/resources/probmodels-graph-over.jpg
--- a/resources/probmodels-graph-pooling.jpg
+++ b/resources/probmodels-graph-pooling.jpg
--- a/resources/probmodels-graph-wing.jpg
+++ b/resources/probmodels-graph-wing.jpg
--- a/resources/probmodels-phys-overview.jpg
+++ b/resources/probmodels-phys-overview.jpg
--- a/resources/probmodels-smdp-1train.jpg
+++ b/resources/probmodels-smdp-1train.jpg
--- a/resources/probmodels-smdp-1trainB.jpg
+++ b/resources/probmodels-smdp-1trainB.jpg
--- a/resources/probmodels-smdp-2infer.jpg
+++ b/resources/probmodels-smdp-2infer.jpg
--- a/resources/probmodels-smdp-3heat.jpg
+++ b/resources/probmodels-smdp-3heat.jpg
--- a/resources/probmodels-time1.png
+++ b/resources/probmodels-time1.png
--- a/resources/probmodels-time2.png
+++ b/resources/probmodels-time2.png
--- a/resources/probmodels-uncond01.png
+++ b/resources/probmodels-uncond01.png
--- a/resources/probmodels-uncond02.png
+++ b/resources/probmodels-uncond02.png
--- a/resources/probmodels-uncond03-ma.png
+++ b/resources/probmodels-uncond03-ma.png
--- a/resources/probmodels-uncond04a.png
+++ b/resources/probmodels-uncond04a.png
--- a/resources/probmodels-uncond04b.png
+++ b/resources/probmodels-uncond04b.png
--- a/resources/probmodels-uncond05.png
+++ b/resources/probmodels-uncond05.png
--- a/resources/probphys02-control.jpg
+++ b/resources/probphys02-control.jpg
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
N_T	45d2b6529e	more phiflow 3.4 updates for HEAT SIP	2025-08-12 10:58:56 +02:00
N_T	1396482270	more phiflow 3.4 updates; warning SoL code not yet working	2025-08-12 09:26:34 +02:00
N_T	eda7ba974e	phiflow 3.4 updates	2025-08-12 09:23:18 +02:00
N_T	a3de575c19	fixed several typos	2025-08-06 15:08:15 +02:00
N_T	be1dba99e4	fixed typos	2025-06-13 16:19:45 +02:00
N_T	cc2a7ef4ce	clarified JVPs	2025-06-03 15:35:29 +02:00
N_T	68bd753ceb	clarified FNO scaling	2025-04-27 16:07:38 +02:00
N_T	4919e7a429	Fixed 2nd device bug in teaser example	2025-03-31 16:25:31 +02:00
N_T	cf13364482	Merge branch 'main' of github.com:tum-pbs/pbdl-book	2025-03-31 16:24:59 +02:00
N_T	8eb2c3c7f7	Fixed device bug in teaser example	2025-03-31 16:23:15 +02:00
N_T	50044397a4	fixed pm graph equations	2025-03-24 20:51:21 +01:00
N_T	d95c94ac58	fixed typo in README title	2025-03-22 21:59:31 +01:00
N_T	3503fc77bf	Updated links	2025-03-21 12:28:49 +01:00
N_T	971f397e79	Updated readme	2025-03-21 12:26:57 +01:00
N_T	f5e25a9d78	missing file	2025-03-20 20:19:42 +01:00
N_T	b7667370d2	updated intro teaser	2025-03-20 15:55:14 +01:00
N_T	39fcd963ab	added genAI dividers	2025-03-20 13:56:40 +01:00
N_T	ac3586cfc1	added transformer block figure	2025-03-19 21:28:54 +01:00
N_T	8afab892b1	smaller tweaks	2025-03-19 17:01:16 +01:00
N_T	a70c03517c	updated for graphs Mario	2025-03-19 17:01:00 +01:00
N_T	27b3940d06	fixes Mario (arch) and equation cleanup	2025-03-18 20:33:37 +01:00
N_T	4589cf2860	updates and teaks flow matching and continuity across the whole chapter	2025-03-18 16:20:10 +01:00
N_T	6391dbab10	updated NF	2025-03-17 16:54:53 +01:00
NT	fb5229a105	prob models introduction with code from Benjamin	2025-03-17 15:32:01 +01:00
N_T	3b73717017	updated references	2025-02-20 09:49:35 +08:00
N_T	16f0f351ac	cleanup of code examples	2025-02-19 09:53:02 +08:00
N_T	c59992f349	updated intro and outlook	2025-02-19 09:52:46 +08:00
N_T	3f8c7bc672	large update of differentiable physics chapter	2025-02-17 14:01:59 +08:00
N_T	3907a75d1a	fixed jupyterbook error for indented pip install	2025-02-17 13:53:49 +08:00
N_T	deaf4c5066	physgrad sin images as jpgs	2025-02-17 13:49:12 +08:00
N_T	7278a04cf1	fixing PDf output, removing citations in figure captions for now as these are causing problem in the tex output	2025-02-17 09:57:48 +08:00
N_T	16e2c13930	added prob models discussion, phi33 notebook fixes	2025-02-14 12:12:53 +08:00
N_T	dacb0d1a2d	notebook updates	2025-02-10 14:08:54 +08:00
N_T	4f1763f696	update physloss chapter	2025-02-10 11:35:22 +08:00
N_T	0981a281fe	update supervised chapter	2025-02-06 14:05:50 +08:00
N_T	4dd1611430	update of overview chapter	2025-02-05 11:34:07 +08:00
N_T	b87621c92e	updated intro overview picture	2025-02-03 19:54:44 +08:00
N_T	8f8634119d	updated intro discussion	2025-02-03 15:28:55 +08:00
N_T	dbd5d53e31	updated intro and logo	2025-01-28 09:38:16 +08:00
N_T	38ca428a8a	ellipse code updates, added run-in-colab-links	2025-01-27 15:06:59 +08:00
N_T	458934b3c8	first version of DGN example	2025-01-27 14:12:26 +08:00
N_T	60dd9aa3bc	first version of ellipse notebook	2025-01-23 13:56:00 +08:00
N_T	8e4b659a4e	first version of graph DMs	2025-01-21 12:27:15 +08:00
N_T	54d0dfc203	files and tweaks for SBI-sim notebook	2025-01-17 17:54:09 +08:00
N_T	4c0fdd8dc0	first version of SBI-sim notebook	2025-01-17 17:53:34 +08:00
N_T	3b53adb75d	first version of SMDP integration	2025-01-15 11:26:59 +08:00
N_T	d317201c66	added probmodel figures	2025-01-09 15:44:40 +08:00
N_T	084b0e6265	added fno arch figures	2025-01-07 13:53:55 +08:00
N_T	0e2736df52	first round of arch figures	2025-01-07 11:38:03 +08:00
N_T	4febc59084	first draft of arch section	2025-01-03 14:57:53 +08:00
N_T	df71d662fa	probmodels phys section	2024-12-27 20:04:29 +08:00
N_T	24f80a841f	added diff-models graph section, intro probmodels-ddpm-fm	2024-12-27 10:04:12 +08:00
N_T	3e694b217c	fixed typos Georg	2024-12-18 12:32:59 +08:00
N_T	5cb92b4943	diffusion time prediction updates	2024-12-13 13:06:49 +08:00
N_T	c469ed4e14	updated diffusion time prediction and stability sections	2024-12-13 11:37:28 +08:00
N_T	47a51ba60c	added uncond. stability chapter	2024-12-09 16:57:18 +08:00
N_T	abb6b46d0f	SoL typo	2024-12-09 16:39:09 +08:00
N_T	1049044612	updated SoL code to phiflow3.2	2024-12-09 16:37:50 +08:00
N_T	148118cbe3	improved pip install for probmodels-ddpm-fm	2024-12-09 14:24:48 +08:00
N_T	fe1393fcd1	added references, minor typos, TOC, todo: move dppm to ddpm notebook	2024-12-09 10:31:53 +08:00
N_T	dad3e8fc8d	supervised airfoils with images	2024-11-29 18:09:05 +08:00
N_T	2f9c37141f	supervised airfoils fixed typo	2024-11-29 18:02:55 +08:00
N_T	960887d527	updated supervised airfoils notebook	2024-11-29 18:01:49 +08:00
N_T	285bff8b95	fixed pip install	2024-11-05 14:33:18 +08:00
N_T	4595ba208d	first version of DDPM to FM notebook	2024-11-05 14:14:16 +08:00
N_T	dc9580b092	clarified tensor vs grid differences	2024-11-05 13:56:27 +08:00
N_T	2685e69f7d	fixed typos	2024-10-25 14:00:47 +08:00
N_T	9bd9f531ea	added HH learning code	2024-10-25 13:40:52 +08:00