fixed typos Georg

2024-12-18 12:32:59 +08:00 · 2024-12-18 12:32:59 +08:00 · 3e694b217c
commit 3e694b217c
parent 5cb92b4943
3 changed files with 68 additions and 87 deletions
--- a/_config.yml
+++ b/_config.yml
@ -2,9 +2,9 @@
 # Learn more at https://jupyterbook.org/customize/config.html

 title: Physics-based Deep Learning
-author: N. Thuerey, P. Holl, M. Mueller, P. Schnell, F. Trost, K. Um
+author: N. Thuerey, B. Holzschuh, P. Holl, G. Kohl, P. Schnell, F. Trost
 logo: resources/logo.jpg
-copyright: "2021,2022"
+copyright: "2021 - 2025"
 only_build_toc_files: true

 launch_buttons:
--- a/probmodels-time.ipynb
+++ b/probmodels-time.ipynb
@ -6,17 +6,21 @@
    "id": "d5oh9eQZLx9c"
   },
   "source": [
-    "# Diffsion-based Time Prediction\n",
+    "# Diffusion-based Time Prediction\n",
    "\n",
-    "Simulating partial differential equations (PDEs), for example turbulent fluid flows, often requires resolving solutions over time. I.e., we're not insterested\n",
-    "in a time-averaged or long-term equilibrium state, but the actual changes over time. This requires iterative solvers that are called _auto-regressively_, \n",
+    "Simulating partial differential equations (PDEs), for example turbulent fluid flows, often requires resolving solutions over time. I.e., we're not interested\n",
+    "in a time-averaged or long-term equilibrium state, but the actual changes of our system over time. This requires iterative solvers that are called _auto-regressively_, \n",
    "one step after the other, to produce a solution over time. \n",
    "Despite all advancements in this area, it is still a critical challenge to achieve stable and accurate predictions for extended temporal horizons. Many dynamical systems are inherently complex and chaotic, making it difficult to faithfully capture intricate physical phenomena over long timeframes. \n",
-    "At the same time, uncertainties also play a role for time series prediction:\n",
-    "Even minor ambiguities in the spatially averaged states used for simulations can lead to very different outcomes over time. Moreover, most traditional solvers and learning-based methods process simulation trajectories in a determinstic way. \n",
-    "They produce a single solution without accounting for the probabilistic nature of turbulent flows. This motivates - as in the previous sections - to view the steps of a time series as a probabilistic distribution over time rather than a deterministic series of states.\n",
    "\n",
-    "The following notebook introduces an approach for temporal predictions:\n",
+    "At the same time, uncertainties also play a role for time series prediction:\n",
+    "Even minor ambiguities in the spatially averaged states used for simulations can lead to very different outcomes over time. Moreover, most traditional solvers and learning-based methods process simulation trajectories in a deterministic way, treating them as being first-order Markovian (one state fully determines the next one). \n",
+    "Instead a more realistic viewpoint of many systems is given by the [Mori-Zwanzig formalism](https://en.wikipedia.org/wiki/Mori-Zwanzig_formalism): we observe a part of our system, but at the same time an \"unobserved\" (or un-simulated) part of the state can influence its evolution over time.\n",
+    "Deterministic simulators produce a single solution without accounting for a potentially probabilistic underlying processes. \n",
+    "This motivates - as in the previous sections - to view the steps of a time series as a probabilistic distribution over time rather than a deterministic series of states.\n",
+    "A probabilistic simulator can learn to take into account the influence of the un-observed state, and infer solutions from variations of this un-observed part of the system. Worst case, if this un-observed state has a negligible influence, we should see a mean state with an variance that's effectively zero. So there's nothing to loose! \n",
+    "\n",
+    "The following notebook introduces an effective, distribution-based approach for temporal predictions:\n",
    "* conditional diffusion models are used to compute autoregressive rollouts to obtain a \"probabilistic simulator\"; \n",
    "* it is of course highly interesting to compare this diffusion-based predictor to the deterministic baselines and neural operators from the previous chapters;\n",
    "* in order to evaluate the results w.r.t. their accuracy, we'll employ a transonic fluid flow for which we can compute statistics from a simulated reference.\n",
@ -24,15 +28,20 @@
    "Problem formulation: while we've previously often focused on training networks for the task $f(x)=y$, we now focus on \n",
    "tasks of the form $f(x_{t})=x_{t+1}$ \n",
    "to indicate that any subsequent step, e.g., $f(x_{t+1})=x_{t+2}$, is a problem of the same importance as the first one.\n",
-    "We still have ground truth values $x^*_{t+1}$, e.g., from an expensive high-fidelity simulation, and aim for a minimzation problem\n",
+    "We still have ground truth values $x^*_{t+1}$, e.g., from an expensive high-fidelity simulation, and aim for a minimization problem\n",
    "\n",
    "$$\n",
    " \\text{arg min}_{\\theta} | f(x_{t};\\theta) - x^*_{t+1} |_2^2 .\n",
    "$$ (learn-autoreg-l2)\n",
-    "\n",
-    "```{note} Outlook\n",
-    "\n",
-    "One of the most interesting aspects of using diffusion-based time predictors is their temporal stability. It seems that the diffusion process forces the networks to learn handling perturbations and accumulated errors in the states without being completely thrown off track. This is crucial for _unconditional stability_, i.e., neural networks that can be called autoregressively any number of times without blowing up. The training process below yields unconditionally stable networks with a surprisingly simple approach for training (we'll use DDPM below, but flow matching would likewise work).\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```{note} \n",
+    "**Unconditional stability**: One of the most interesting aspects of using diffusion-based time predictors is their temporal stability. It seems that the diffusion process forces the networks to learn handling perturbations and accumulated errors in the states without being easily thrown off track. This is crucial for _unconditional stability_, i.e., neural networks that can be called autoregressively any number of times without blowing up. The training process below yields unconditionally stable networks with a surprisingly simple approach for training (we'll use DDPM below, but flow matching would likewise work).\n",
    "\n",
    "For a more detailed evaluation of the long term stability of diffusion-based predictions [can be found here](https://ge.in.tum.de/2024/08/05/how-to-train-unconditionally-stable-autoregressive-neural-operators/).\n",
    "```\n"
@ -45,16 +54,22 @@
    "## Conditioning\n",
    "\n",
    "Previously, for the inverse problem setting we only briefly mentioned that the inference task of producing the posterior distribution depends\n",
-    "on a set of hyperparameters such as a chosen set of boundary conditions. Let's consider $x=(c,d)$, i.e. a datum $x$ is made up of a\n",
-    "conditioning part $c$ and the target data $d$.\n",
-    "For time predictions, we additionally have a strong conditioning on the current time step. Hence, this is a good occasion to explain\n",
+    "on a set of hyperparameters such as a chosen set of boundary conditions. Let's consider $x=(c,d)$, i.e. a data point $x$ is made up of a\n",
+    "component for conditioning  $c$ and the target data $d$.\n",
+    "For time predictions, we additionally have a strong conditioning on the current time step, i.e. $c$ will contain $x_t$ in addition to, e.g., a Reynolds number. \n",
+    "Hence, this is a good occasion to explain\n",
    "some of the subtleties of implementing the conditioning. The central take-away message here is: all inputs for conditioning should be treated\n",
    "in the same way as the outputs of the diffusion process.\n",
    "\n",
    "This seems somewhat counter-intuitive at first: after all, the conditioning is more similar to an input than an output. \n",
    "However, it was shown that \"forcing\" the network to denoise the conditioning alongside the target at training time forces\n",
    "it to fully consider the conditioning variables. This leads to a tight entangling of features learned for the output with \n",
-    "the conditioning. Thus for training we consider both parts $c$ and $d$ in the same way. This is illustrated on the left side\n",
+    "the conditioning. \n",
+    "Removing the conditioning information at high noise levels in this way has the additional benefit of reducing error accumulation: \n",
+    "Since the initial steps of the reverse process $p_\\theta$ are mostly unconditional due to the very noisy conditioning, accumulated \n",
+    "errors in $c$ from the previous denoising steps are not immediately included in the prediction $d$.\n",
+    "\n",
+    "Thus for training we consider both parts $c$ and $d$ in the same way. This is illustrated on the left side\n",
    "of the following picture. Conditioning $c$ and data components $d$ are treated the same at training time. This illustration \n",
    "denotes denoising time by $r$, to distinguish it from the time of the physical process $t$.\n",
    "\n",
@ -70,11 +85,11 @@
    "\n",
    "Once the model is trained, we make use of the fact that we know the exact value of $c$. As the noise $\\epsilon$ is likewise\n",
    "an input to the model that we have under full control, we can ensure that the conditioning $c_r$ at denoising time $r$ has\n",
-    "exactly the right content according to the chosen noise field and noising schedule. Hence we invoke $p_\\theta(x_{r-1}; x_r)$\n",
+    "exactly the right content according to the chosen noise field and noising schedule. Hence we invoke $p_\\theta(x_{r-1}| x_r)$\n",
    "yielding  $c_{r-1}$ as well as $d_{r-1}$, both contained in $x_{r-1}$. The predicted conditioning  $c_{r-1}$ will be good\n",
    "if the model $p_\\theta$ was trained well, but to make sure there is zero drift we simply recompute $c_{r-1}$ from the known ground truth \n",
-    "$c$ and the right noise amount $\\epsilon_{r-1}$. We then invoke $p_\\theta(x_{r-2}; x_{r-1})$ with $x_{r-1}$ containing\n",
-    "the re-commputed $c$ and the $d_{r-1}$ component previously inferred in the previous denoising step.\n",
+    "$c$ and the right noise amount $\\epsilon_{r-1}$. We then invoke $p_\\theta(x_{r-2}| x_{r-1})$ with $x_{r-1}$ containing\n",
+    "the re-computed $c$ and the $d_{r-1}$ component previously inferred in the previous denoising step.\n",
    "\n",
    "For time prediction tasks, the situation is not anymore complicated, but slightly more confusing in terms of notation:\n",
    "here, the conditioning is the previous time step in _physical_ time $x^t$. (There could be additional global parameters in $c$, but we'll\n",
@ -109,34 +124,7 @@
    "\n",
    "Specifically, this notebook will explain _autoregressive conditional diffusion models_ ([ACDM](https://github.com/tum-pbs/autoreg-pde-diffusion)), following an existing [benchmark and paper](https://arxiv.org/abs/2309.01745). The goal is the creation of a diffusion-based architecture, that can probabilistically and accurately predict the next simulation step of a turbulent flow simulation. \n",
    "\n",
-    "As a first step, we checkout the benchmark code, and download a pre-trained model checkpoint. This might take a few minutes. (Note: the command will not re-download files, if already downloaded successfully)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "fatal: destination path 'autoreg-pde-diffusion' already exists and is not an empty directory.\n",
-      "/home/thuerey/jupyter/autoreg-pde-diffusion\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/thuerey/anaconda3/envs/torch24/lib/python3.12/site-packages/IPython/core/magics/osm.py:417: UserWarning: This is now an optional IPython functionality, setting dhist requires you to install the `pickleshare` library.\n",
-      "  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]\n"
-     ]
-    }
-   ],
-   "source": [
-    "!git clone https://github.com/tum-pbs/autoreg-pde-diffusion.git\n",
-    "%cd autoreg-pde-diffusion/"
+    "As a first step, we'll download a pre-trained model checkpoint to shorten the training time later on. This might take a few minutes. (Note: the command will not re-download files, if already downloaded successfully)"
   ]
  },
  {
@ -275,10 +263,10 @@
   "source": [
    "## Backbone Network Definition\n",
    "\n",
-    "Of course, we also need a neural network architecture. Here, we will rely on a [\"modern\" U-Net](https://arxiv.org/abs/2006.11239). In contrast to classic U-Net from {doc}`supervised-airfoils`, this moderinzed version differes in a few important places:\n",
+    "Of course, we also need a neural network architecture. Here, we will rely on a [\"modern\" U-Net](https://arxiv.org/abs/2006.11239). In contrast to classic U-Net from {doc}`supervised-airfoils`, this modernized version differs in a few important places:\n",
    "the skip connection are replaced by attention mechanisms, _GELU_ activations replace _ReLU_, and group normalizations are employed at each layer of the U-Net. This architecture is the backbone of many popular diffusion models, and typically yields at least a few percent improvements over simpler architectures (in some cases also much more).\n",
    "\n",
-    "We start with a residual block that defines the skip connections, as well as the up- and downsampling operations of the U-Net. We will also make use of sinusoidal position embeddings from the [transformer architectures](https://arxiv.org/abs/1706.03762), to integrate the diffusion step with a time embedding MLP throught the U-Net layers."
+    "We start with a residual block that defines the skip connections, as well as the up- and downsampling operations of the U-Net. We will also make use of sinusoidal position embeddings from the [transformer architectures](https://arxiv.org/abs/1706.03762), to integrate the diffusion step with a time embedding MLP throughout the U-Net layers."
   ]
  },
  {
@ -815,7 +803,7 @@
    "id": "wWVCwnkrYdTH"
   },
   "source": [
-    "The most important function above is the `forward` step of the `DiffusionModel` class. It switches between training and inference via the `self.training` flag, and correspondingly either evalautes a single step in the Markov chain for backpropagation, or the full chain to obtain a sample at inference time. The `c` and `d` prefixes of the variables indicate the distinction between _conditioning_ and _data_ components in $x$, as explained above. E.g., an important line in the inference code is `dNoiseCond=concat(condNoisy, dNoise)` this gixes an $x$ by concatenating conditioning and data. Both parts of the $x$ are jointly denoised, and the conditioning part is removed after network execution via `modelMean[:, cond.shape[1]:modelMean.shape[1]]`. It's overwritten with the ground truth conditioning, so that the diffusion model can focus on producing an accurate `d` part.\n",
+    "The most important function above is the `forward` step of the `DiffusionModel` class. It switches between training and inference via the `self.training` flag, and correspondingly either evaluates a single step in the Markov chain for backpropagation, or the full chain to obtain a sample at inference time. The `c` and `d` prefixes of the variables indicate the distinction between _conditioning_ and _data_ components in $x$, as explained above. E.g., an important line in the inference code is `dNoiseCond=concat(condNoisy, dNoise)` this gives an $x$ by concatenating conditioning and data. Both parts of the $x$ are jointly denoised, and the conditioning part is removed after network execution via `modelMean[:, cond.shape[1]:modelMean.shape[1]]`. It's overwritten with the ground truth conditioning, such that the diffusion model can focus on producing an accurate `d` part.\n",
    "\n",
    "---\n",
    "\n",
@ -916,14 +904,14 @@
   "source": [
    "## Training\n",
    "\n",
-    "With all these building blocks and data available now, its time put them together and train the diffusion model. You can choose to either continue training for a few epochs from the provided model checkpoint (takes less than a minute for 10 epochs), or train the model from scratch on the small examplary data set (takes about half an hour for 1000 epochs). Note that the prediction quality and diversity for training the model from scratch will be noticeably worse, due to limited amount of data available here.\n",
+    "With all these building blocks and data available now, its time put them together and train the diffusion model. You can choose to either continue training for a few epochs from the provided model checkpoint (takes less than a minute for 10 epochs), or train the model from scratch on the small exemplary data set (this will take a few hours). Note that the prediction quality and diversity for training the model from scratch will be noticeably worse, due to limited amount of data available here.\n",
    "\n",
    "Feel free to skip this step entirely, if you don't want to train the network. You can directly continue with the sampling below, by using the pre-trained checkpoint."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
@ -1048,7 +1036,7 @@
    "    epochs = 10 # finetune only for a small number of epochs\n",
    "    lr = 0.00001 # since the model is already trained, a conservatively low learning rate\n",
    "else:\n",
-    "    epochs = 100 # train from scratch for large number of epochs\n",
+    "    epochs = 100000 # train from scratch for large number of epochs, this will take a while\n",
    "    lr = 0.0001 # larger learning rate for training from scratch\n",
    "\n",
    "diffusionSteps = 20 # the provided model checkpoint was pretrained on 20 diffusion steps\n",
@ -1116,7 +1104,7 @@
   "source": [
    "## Test Dataset\n",
    "\n",
-    "Next we download a test dataset to make sure we can evaluate the trained network on new data. Here, we only use sequences from the data set with a different mach number and physical time than the training data above (the simulation with ID 22).\n"
+    "Next we download a test dataset to make sure we can evaluate the trained network on new data. Here, we only use sequences from the data set with a different mach number and physical time than the training data above (the simulation with ID 22, which should correspond to a larger Mach number than the ones used for training with $\\text{Ma}=0.72$).\n"
   ]
  },
  {
@ -1181,7 +1169,7 @@
   "source": [
    "## Test Inference\n",
    "\n",
-    "We can now sample the trained diffusion model to create probabilistic predictions for a test set of flow trajectories. We store both predictions and ground truth in tensors with shape $(samples \\times sequences \\times sequenceLength \\times channels \\times sizeX \\times sizeY)$, that are used for the visualization below."
+    "We can now sample the trained diffusion model to create probabilistic predictions for a test set of flow trajectories. We store both predictions and ground truth in tensors with shape $(samples \\times sequences \\times sequenceLength \\times channels \\times sizeX \\times sizeY)$, that are used for the evaluations and visualization below."
   ]
  },
  {
@ -1358,7 +1346,7 @@
    "id": "Gx7w16FQTFrd"
   },
   "source": [
-    "The trained time operator closely matches early states, but you should be able to see variations for the last states at $t=60$. The shock waves for the cylinder are highly unstable, and hence give the network to create realistic but varying predicitions. Re-running inference with different noise values will produce additional variations.\n",
+    "The trained time operator closely matches early states, but you should be able to see variations for the last states at $t=60$. The shock waves for the cylinder are highly unstable, and hence give the network to create realistic but varying predictions. Re-running inference with different noise values will produce additional variations, and longer rollouts will further increase differences.\n",
    "\n",
    "### Temporal Stability\n",
    "\n",
@ -1559,26 +1547,19 @@
    "\n",
    "To conclude the results from above, this code has yielded a probabilistic model for time predictions of PDEs. The great thing about it is that it estimates the changes and uncertainties in the dataset in order to reproduce it at inference time. Hence it provides posterior sampling over time, and can be run multiple times to infer different possible solutions.\n",
    "\n",
-    "The flipside here is that diffusion models are generally not better at predicting the mean solution than classic methods [(see the ACDM benchmark for detailed evaluations)](https://github.com/tum-pbs/autoreg-pde-diffusion). Thus, if the input-output relationship in your data is unique, diffusion models will not pay off, and only incur higher inference computations. This holds for the networks above: they are more expensive, and are run repeatedly to produce a single sample. This could be sped up more (e.g. with flow matching, the model above uses denoising), but a certain (small) factor will remain.\n",
+    "The flip side here is that diffusion models are generally not better at predicting the mean solution than classic methods [(see the ACDM benchmark for detailed evaluations)](https://github.com/tum-pbs/autoreg-pde-diffusion). Thus, if the input-output relationship in your data is unique, diffusion models will not pay off, and only incur higher inference computations. This holds for the networks above: they are more expensive, and are run repeatedly to produce a single sample. This could be sped up more (e.g. with flow matching, the model above uses denoising), but a certain (small) factor will remain.\n",
    "\n",
    "Nonetheless, for most non-trivial datasets diffusion models will pay off: ambiguities in the data will **not be averaged out**, but treated (and reproduced) as a **distribution**.\n",
    "In addition, as hinted at mentioned above, a highly interesting aspect of the diffusion-based time prediction is its **unconditional stability**.\n",
-    "The trained models do not blow up over time or transform the input into trivial steady states. Both cases are common \n",
-    "in models trained with other training methodologies. Rather, the diffusion-based networks retain the statistics of\n",
-    "the reference data over arbitrarily long rollouts (it's difficult to prove that they _never_ diverge, but in our\n",
-    "tests stable networks did not diverge over the course of several hundred thousand rollout steps). This is a highly\n",
+    "Given an appropriate learning task, the trained models do not blow up over time or transform the input into trivial steady states. Both cases are common \n",
+    "in models trained with other training methodologies. Rather, the diffusion-based networks can retain the statistics of\n",
+    "the reference data over arbitrarily long rollouts. It's difficult to prove that they _never_ diverge, but in our\n",
+    "tests stable networks did not diverge over the course of several hundred thousand rollout steps. This is a highly\n",
    "attractive behavior, and indicates a fundamentally different behavior of diffusion-based models. In the next chapter\n",
    "we'll provide more details, and investigate it in comparison with temporal _unrolling_. \n",
    "\n",
    " "
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
--- a/probmodels-uncond.md
+++ b/probmodels-uncond.md
@ -1,35 +1,35 @@
-Unconditional Stablility
+Unconditional Stability
 =======================

-The results of the previous section, for time predictions with diffusion models, and earilier ones ({doc}`diffphys-discuss`)
+The results of the previous section, for time predictions with diffusion models, and earlier ones ({doc}`diffphys-discuss`)
 make it clear that unconditionally stable networks are definitely possible. 
 This has also been reported various other works. However, there's still a fair amount of approaches that seem to have trouble with long term stability.
 This poses a very interesting question: which ingredients are necessary to obtain _unconditional stability_? 
 Unconditional stability here means obtaining trained networks that are stable for arbitrarily long rollouts. Are inductive biases or special training methodologies necessary, or is it simply a matter of training enough different initializations? Our setup provides a very good starting point to shed light on this topic.

-The "success stories" from earlier chapters, some with fairly simple setups, indicate that unconditional stability is “nothing special” for neural network based predictors. I.e., it does not require special loss functions or tricks beyond a properly chosen set of hyperparamters for training. As errors will accumulate over time, we can expect that network size and the total number of update steps in training are important. Interestingly, it seems that the neural network architecture doesn’t really matter: we can obtain stable rollouts with pretty much “any” architecture once it’s sufficiently large.
+The "success stories" from earlier chapters, some with fairly simple setups, indicate that unconditional stability is “nothing special” for neural network based predictors. I.e., it does not require special loss functions or tricks beyond a proper learning setup (suitable hyperparameters, sufficiently large model plus enough data).
+As errors will accumulate over time, we can expect that network size and the total number of update steps in training are important. Interestingly, it seems that the neural network architecture doesn’t really matter: we can obtain stable rollouts with pretty much “any” architecture once it’s sufficiently large.

-## Neural Network Architectures
+Note that we'll focus on time steps with a **fixed length** in the following. The "unconditional stability" refers to being stable over an arbitrary number of iterative steps. The following networks could potentially trained for variable time step sizes as well, but we will focus on the "dimension" of stability of multiple, iterative network calls below.

-As shown in the previuos chapter, diffusion models perform extremely well. This can be attribute to the underlying task of working with noisy distributions (e.g. for denoising or flow matching). Likewise, the network architecture has only a minor influence: the network simply needs to be large enough to provide a converging iteration. For supervised  or unrolled training, we can leverage a variety of discrete and continuous neural operators. CNNs, Unets, FNOs and Transformers are popular approaches here.
-Interestingly, FNOs, due to their architecture _project_ the solution onto a subspace of the frequencies in the discretization. This inherently removes high frequencies that primarily drive isntabilities. As such, they're less strongly influenced by unrolling [(details can be found, e.g., here)](https://tum-pbs.github.io/apebench-paper/).
+## Main Considerations for an Evaluation
+
+As shown in the previous chapter, diffusion models perform extremely well. This can be attributed to the underlying task of working with pure noise as input (e.g., for denoising or flow matching tasks). Likewise, the network architecture has only a minor influence: the network simply needs to be large enough to provide a converging iteration. For supervised  or unrolled training, we can leverage a variety of discrete and continuous neural operators. CNNs, Unets, FNOs and Transformers are popular approaches here.
+Interestingly, FNOs, due to their architecture _project_ the solution onto a subspace of the frequencies in the discretization. This inherently removes high frequencies that primarily drive instabilities. As such, they're influenced by unrolling to a lesser extent [(details can be found, e.g., here)](https://tum-pbs.github.io/apebench-paper/).
 Operators that better preserve small-scale details, such as convolutions, can strongly benefit from unrolling. This will be a focus of the following ablations.

-Interestingly, it turns out that the batch size and the length of the unrolling horizon play a crucial but conflicting role: small batches are preferable, but in the worst case under-utilize the hardware and require long training runs. Unrolling on the other hand significantly stabilizes the rollout, but leads to increased resource usage due to the longer computational graph for each NN update. Thus, our experiements show that a “sweet spot” along the Pareto-front of batch size vs unrolling horizon can be obtained by aiming for as-long-as-possible rollouts at training time in combination with a batch size that sufficiently utilizes the available GPU memory.
+Interestingly, it turns out that the batch size and the length of the unrolling horizon play a crucial but conflicting role: small batches are preferable, but in the worst case under-utilize the hardware and require long training runs. Unrolling on the other hand significantly stabilizes the rollout, but leads to increased resource usage due to the longer computational graph for each NN update. Thus, our experiments show that a “sweet spot” along the Pareto-front of batch size vs unrolling horizon can be obtained by aiming for as-long-as-possible rollouts at training time in combination with a batch size that sufficiently utilizes the available GPU memory.

-Learning Task: To analyze the temporal stability of autoregressive networks on long rollouts, two flow prediction tasks from the [ACDM benchmark](https://github.com/tum-pbs/autoreg-pde-diffusion) are considered: an easier incompressible cylinder flow (denoted by _Inc_), and a complex transonic wake flow (denoted as _Tra_) at Reynolds number 10 000. For Inc, the networks are trained on flows with Reynolds number 200 – 900 and required to extrapolate to Reynolds numbers of 960, 980, and 1000 during inference (_Inc-high_). For Tra, the training data consists of flows with Mach numbers between 0.53 and 0.9, and networks are tested on the Mach numbers 0.50, 0.51, and 0.52 (denoted by _Tra-ext_). This Mach number is tough as 
+Learning Task: To analyze the temporal stability of autoregressive networks on long rollouts, two flow prediction tasks from the [ACDM benchmark](https://github.com/tum-pbs/autoreg-pde-diffusion) are considered: an easier incompressible cylinder flow (denoted by _Inc_), and a complex transonic wake flow (denoted by _Tra_) at Reynolds number 10 000. For Inc, the networks are trained on flows with Reynolds number 200 – 900 and required to extrapolate to Reynolds numbers of 960, 980, and 1000 during inference (_Inc-high_). For Tra, the training data consists of flows with Mach numbers between 0.53 and 0.9, and networks are tested on the Mach numbers 0.50, 0.51, and 0.52 (denoted by _Tra-ext_). This Mach number is tough as it contains a substantial amounts of shocks that interact with the flow.
 For each sequences in both data sets, three training runs of each architecture are unrolled over 200.000 steps. This unrolling length is no proof that these networks yield infinitely long stable rollouts, but they feature an extremely small probability for blowups.

-## Architecture Comparison
+## Comparing Architectures

 As a first comparison, we'll train three network architectures with an identical U-Net architecture, that use different stabilization techniques. This comparison shows that it is possible to successfully achieve the task "unconditional stability" in different ways:
 - Unrolled training (_U-Net-ut_) where gradients are backpropagated through multiple time steps during training.
 - Networks trained on a single prediction step with added training noise (_U-Net-tn_). This technique is known to improve stability by reducing data shift, as the added noise emulates errors that accumulate during inference.
 - Autoregressive conditional diffusion models (ACDM). A denoising diffusion model is conditioned on the previous time step and iteratively refines noise to create a prediction for the next step, as shown in {doc}`probmodels-time`. 

-NT_DEBUG, todo, more ACDM discussion below!
-    images from : 2024-08-05-long-rollout-www/Long Rollouts/imgs/
-

 ```{figure} resources/probmodels-uncond01.png
 ---
@ -54,8 +54,8 @@ For the test sequences from Tra-ext, one from the three trained U-Net-tn network
 ## Stability Criteria

 Focusing on the U-Net networks with unrolled training, we will next focus on training multiple models (3 each time), and measure the percentage of stable runs they achieve. This provides more thorough statistics compared to the single, qualitative examples above.
-We'll investigate the first key criterium rollout length, to show how it influences fully stable rollouts over extremely long horizons.
-Figure 2 lists the percentage of stable runs for a range of ablation networks on the Tra-ext data set with rollouts over 200 000 time steps. Results on the indiviual Mach numbers, as well as an average (top row) are shown.
+We'll investigate the first key criteria rollout length, to show how it influences fully stable rollouts over extremely long horizons.
+Figure 2 lists the percentage of stable runs for a range of ablation networks on the Tra-ext data set with rollouts over 200 000 time steps. Results on the individual Mach numbers, as well as an average (top row) are shown.

 ```{figure} resources/probmodels-uncond03-ma.png
 ---
@ -76,7 +76,7 @@ Three factors that did not substantially impact rollout stability in experiments
 ## Batch Size vs Rollout

 Interestingly, the batch size turns out to be an important factor:
-it can substantially impact the stability of autoregressive networks. This is similar to the image domain, where smaller batches are know to improve generalization (this is the motivation for using mini-batching instead of gradients over the full data set). The impact of the batch size on the stability and training time is shown in the figure below, for both investigated data sets. Networks that only come close to the ideal rollout lenght at a large batch size, can be stabilized with smaller batches. However, this effect does not completely remove the need for unrolled training, as networks without unrolling were unstable across all tested batch sizes. For the Inc case, the U-Net width was reduced by a factor of 8 across layers (in comparison to above), to artifically increase the difficulty of this task. Otherwise all parameter configurations would already be stable and show the effect of varying the batchsize.
+it can substantially impact the stability of autoregressive networks. This is similar to the image domain, where smaller batches are know to improve generalization (this is the motivation for using mini-batching instead of gradients over the full data set). The impact of the batch size on the stability and training time is shown in the figure below, for both investigated data sets. Networks that only come close to the ideal rollout length at a large batch size, can be stabilized with smaller batches. However, this effect does not completely remove the need for unrolled training, as networks without unrolling were unstable across all tested batch sizes. For the Inc case, the U-Net width was reduced by a factor of 8 across layers (in comparison to above), to artificially increase the difficulty of this task. Otherwise all parameter configurations would already be stable and show the effect of varying the batchsize.

 ```{figure} resources/probmodels-uncond04a.png
 ---
@ -91,7 +91,7 @@ Percentage of stable runs and training time for different combinations of rollou
 height: 210px
 name: probmodels-uncond04b
 ---
-Percentage of stable runs and training time for rollout length and batch size for the Inc-high data set. Grey again indicates out-of-memory (mem) or overly high computations (-).
+Percentage of stable runs and training time for rollout length and batch size for the Inc-high data set. Grey again indicates out-of-memory (mem) or overly high computational demands (-).
 ```

 This shows that increasing the batch size is more expensive in terms of training time on both data sets, due to less memory efficient computations. Using longer rollouts during training does not necessarily induce longer training times, as we compensate for longer rollouts with a smaller number of updates per epoch. E.g., we use either 250 batches with a rollout of 4, or 125 batches with a rollout of 8. Thus the number of simulation states that each network sees over the course of training remains constant. However, we did in practice observe additional computational costs for training the larger U-Net network on Tra-ext. This leads to the "central" question in these ablations: which combination of rollout length and batch size is most efficient?
@ -104,7 +104,7 @@ name: probmodels-uncond05
 Training time for different combinations of rollout length and batch size to on the Tra-ext data set (left) and the Inc-high data set (right). Only configurations that to lead to highly stable networks (stable run percentage >= 89%) are shown.
 ```

-This figure answers this question by showing the central tradeoff between rollout length and batch size (only stable versions are included here). 
+The figure above answers this question by showing the central tradeoff between rollout length and batch size (only stable versions are included here). 
 To achieve _unconditionally stable_ networks and neural operators, it is consistently beneficial to choose configurations where large rollout lengths are paired with a batch size that is big enough the sufficiently utilize the available GPU memory. This means, improved stability is achieved more efficiently with longer training rollouts rather than smaller batches, as indicated by the green dots with the lowest training times.

 ## Summary