pbdl-book/diffphys-discuss.md

Discussion
=======================

The training via differentiable physics as described so far allows us
to integrate full numerical simulations and the training of deep neural networks
interacting with these simulations. While we've only hinted at what could be
achieved via DP approaches it is nonetheless a good time to discuss some 
additional properties, and summarize the pros and cons.


![Divider](resources/divider4.jpg)


## Time steps and iterations

When using DP approaches for learning application, there is a large amount of flexibility
w.r.t. combination of DP and NN building blocks. 

Just as a reminder, this is the previously shown _overview_ figure to illustrate the combination 
of NNs and DP operators. Here, these operators look like a loss term: they typically don't have weights,
and only provide a gradient that influences the optimization of the NN weights:

```{figure} resources/diffphys-shortened.jpg
---
height: 220px
name: diffphys-short
---
The DP approach as described in the previous chapters. A network produces an input to a PDE solver $\mathcal P$, which provides a gradient for training during the backpropagation step.
```

This setup can be seen as the network receiving information about how it's output influences the outcome of the PDE solver. I.e., the gradient will provide information how to produce an NN output that minimizes the loss. E.g., in line with the previously described _physical losses_ (from {doc}`physicalloss`), this mean upholding a conservation law.

**Switching the Order** 

However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine to switch the NN and DP components, giving the following structure:

```{figure} resources/diffphys-switched.jpg
---
height: 220px
name: diffphys-switch
---
A PDE solver produces an output which is processed by an NN.
```

In this case the PDE solver essentially represents an _on-the-fly_ data generator. This is not necessarily always useful: this setup could be replaced by a pre-computation of the same inputs, as the PDE solver is not influenced by the NN. Hence, we could replace the $\mathcal P$ invocations by a "loading" function. On the other hand, evaluating the PDE solver at training time with a randomized sampling of the paramter domain of interest can lead to an excellent sampling of the data distribution of the input, and hence yield accurate and stable NNs. If done correctly, the solver can alleviate the need to store and load large amounts of data, and instead produce them more quickly at training time, e.g., directly on a GPU.

**Time Stepping** 

In general, there's no combination of NN layers and DP operators that is _forbidden_ (as long as their dimensions are compatible). One that makes particular sense is to "unroll" the iterations of a time stepping process of a simulator, and let the state of a system be influenced by an NN.

In this case we compute a (potentially very long) sequence of PDE solver steps in the forward pass. Inbetween these solver steps, an NN modifies the state of our system, which is then used to compute the next PDE solver step. During the backpropagation pass, we move backwards through all of these steps to evaluate contributions to the loss function (it can be evaluated in one or more places anywhere in the execution chain), and to backpropagte the gradient information throught the DP and NN operators. This unrolling of solver iterations essentially gives feedback to the NN how it's "actions" influence the state of the physical system and resulting loss. Due to the iterative nature of this process, many errors increase exponentially over the course of iterations, and are extremely difficult to detect in a single evaluation. In these cases it is crucial to provide feedback to the NN at training time who the erros evolve over course of the iterations. Note that in this case, a pre-computation of the states is not possible, as the iterations depend on the state of the NN, which is unknown before training. Hence, a DP-based training is crucial to evaluate the correct gradient information at training time. 

```{figure} resources/diffphys-multistep.jpg
---
height: 180px
name: diffphys-mulitstep
---
Time stepping with interleaved DP and NN operations for $k$ solver iterations.
```

Note that this picture (and the ones before) have assumed an _additive_ influence of the NN. Of course, any differentiable operator could be used here to integrate the NN result into the state of the PDE. E.g., multiplicative modifications can be more suitable in certain settings, or in others the NN could modify the parameters of the PDE in addition to or instead of the state space. Likewise, the loss function is problem dependent and can be computed in different ways.

DP setups with many time steps can be difficult to train: the gradients need to backpropagate through the full chain of PDE solver evaluations and NN evaluations. Typically, each of them represents a non-linear and complex function. Hence for larger numbers of steps, the vanishing and exploding gradient problem can make training difficult (see {doc}`diffphys-code-sol` for some practical tipps how to alleviate this).

## Alternatives: noise

It is worth mentioning here that other works have proposed perturbing the inputs and 
the iterations at training time with noise {cite}`sanchez2020learning` (somewhat similar to
regularizers like dropout). 
This can help to prevent overfitting to the training states, and hence shares similarities
with the goals of training with DP. 

However, the noise is typically undirected, and hence not as accurate as training with 
the actual evolutions of simulations. Hence, this noise can be a good starting point 
for training that tends to overfit, but if possible, it is preferable to incorporate the
actual solver in the training loop via a DP approach.


![Divider](resources/divider5.jpg)

## Summary

To summarize the pros and cons of training NNs via DP:

✅ Pro: 
- Uses physical model and numerical methods for discretization.
- Efficiency of selected methods carries over to training.
- Tight coupling of physical models and NNs possible.

❌ Con: 
- Not compatible with all simulators (need to provide gradients).
- Require more heavy machinery (in terms of framework support) than previously discussed methods.

Especially the last negative point regarding heavy machinery is one that is bound to strongly improve in a fairly short time, but for now it's important to keep in mind that not every simulator is suitable for DP training out of the box. Hence, in this book we'll focus on examples using phiflow, which was designed for interfacing with deep learning frameworks. 

Next we can target more some complex scenarios to showcase what can be achieved with differentiable physics.
This will also illustrate how the right selection of a numerical methods for a DP operator yields improvements in terms of training accuracy.
diffphys discussion 2021-01-10 14:15:50 +01:00			`Discussion`
			`=======================`

dp ns code update 2021-03-02 08:03:15 +01:00			`The training via differentiable physics as described so far allows us`
			`to integrate full numerical simulations and the training of deep neural networks`
			`interacting with these simulations. While we've only hinted at what could be`
updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			`achieved via DP approaches it is nonetheless a good time to discuss some`
			`additional properties, and summarize the pros and cons.`
diffphys discussion 2021-01-10 14:15:50 +01:00
udpated notation, into control 2021-03-08 04:15:00 +01:00
updated teaser, added dividers 2021-04-11 14:17:03 +02:00			`![Divider](resources/divider4.jpg)`

unified caps of headings 2021-04-12 03:19:00 +02:00
			`## Time steps and iterations`
updated figures, and texts for figures 2021-04-01 10:42:03 +02:00
			`When using DP approaches for learning application, there is a large amount of flexibility`
			`w.r.t. combination of DP and NN building blocks.`

			`Just as a reminder, this is the previously shown _overview_ figure to illustrate the combination`
			`of NNs and DP operators. Here, these operators look like a loss term: they typically don't have weights,`
			`and only provide a gradient that influences the optimization of the NN weights:`

			```{figure} resources/diffphys-shortened.jpg
			`---`
			`height: 220px`
			`name: diffphys-short`
			`---`
			`The DP approach as described in the previous chapters. A network produces an input to a PDE solver $\mathcal P$, which provides a gradient for training during the backpropagation step.`
			```

			This setup can be seen as the network receiving information about how it's output influences the outcome of the PDE solver. I.e., the gradient will provide information how to produce an NN output that minimizes the loss. E.g., in line with the previously described _physical losses_ (from {doc}`physicalloss`), this mean upholding a conservation law.

			`Switching the Order`

			`However, with DP, there's no real reason to be limited to this setup. E.g., we could imagine to switch the NN and DP components, giving the following structure:`

			```{figure} resources/diffphys-switched.jpg
			`---`
			`height: 220px`
			`name: diffphys-switch`
			`---`
			`A PDE solver produces an output which is processed by an NN.`
			```

			In this case the PDE solver essentially represents an _on-the-fly_ data generator. This is not necessarily always useful: this setup could be replaced by a pre-computation of the same inputs, as the PDE solver is not influenced by the NN. Hence, we could replace the $\mathcal P$ invocations by a "loading" function. On the other hand, evaluating the PDE solver at training time with a randomized sampling of the paramter domain of interest can lead to an excellent sampling of the data distribution of the input, and hence yield accurate and stable NNs. If done correctly, the solver can alleviate the need to store and load large amounts of data, and instead produce them more quickly at training time, e.g., directly on a GPU.

			`Time Stepping`

			`In general, there's no combination of NN layers and DP operators that is _forbidden_ (as long as their dimensions are compatible). One that makes particular sense is to "unroll" the iterations of a time stepping process of a simulator, and let the state of a system be influenced by an NN.`

			In this case we compute a (potentially very long) sequence of PDE solver steps in the forward pass. Inbetween these solver steps, an NN modifies the state of our system, which is then used to compute the next PDE solver step. During the backpropagation pass, we move backwards through all of these steps to evaluate contributions to the loss function (it can be evaluated in one or more places anywhere in the execution chain), and to backpropagte the gradient information throught the DP and NN operators. This unrolling of solver iterations essentially gives feedback to the NN how it's "actions" influence the state of the physical system and resulting loss. Due to the iterative nature of this process, many errors increase exponentially over the course of iterations, and are extremely difficult to detect in a single evaluation. In these cases it is crucial to provide feedback to the NN at training time who the erros evolve over course of the iterations. Note that in this case, a pre-computation of the states is not possible, as the iterations depend on the state of the NN, which is unknown before training. Hence, a DP-based training is crucial to evaluate the correct gradient information at training time.

			```{figure} resources/diffphys-multistep.jpg
			`---`
smaller updates to figures and captions 2021-04-01 10:53:41 +02:00			`height: 180px`
updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			`name: diffphys-mulitstep`
			`---`
			`Time stepping with interleaved DP and NN operations for $k$ solver iterations.`
			```

			`Note that this picture (and the ones before) have assumed an _additive_ influence of the NN. Of course, any differentiable operator could be used here to integrate the NN result into the state of the PDE. E.g., multiplicative modifications can be more suitable in certain settings, or in others the NN could modify the parameters of the PDE in addition to or instead of the state space. Likewise, the loss function is problem dependent and can be computed in different ways.`

			DP setups with many time steps can be difficult to train: the gradients need to backpropagate through the full chain of PDE solver evaluations and NN evaluations. Typically, each of them represents a non-linear and complex function. Hence for larger numbers of steps, the vanishing and exploding gradient problem can make training difficult (see {doc}`diffphys-code-sol` for some practical tipps how to alleviate this).

unified caps of headings 2021-04-12 03:19:00 +02:00			`## Alternatives: noise`
udpated notation, into control 2021-03-08 04:15:00 +01:00
			`It is worth mentioning here that other works have proposed perturbing the inputs and`
			the iterations at training time with noise {cite}`sanchez2020learning` (somewhat similar to
			`regularizers like dropout).`
			`This can help to prevent overfitting to the training states, and hence shares similarities`
			`with the goals of training with DP.`

			`However, the noise is typically undirected, and hence not as accurate as training with`
			`the actual evolutions of simulations. Hence, this noise can be a good starting point`
			`for training that tends to overfit, but if possible, it is preferable to incorporate the`
spellcheck 2021-03-09 09:39:54 +01:00			`actual solver in the training loop via a DP approach.`
udpated notation, into control 2021-03-08 04:15:00 +01:00

updated teaser, added dividers 2021-04-11 14:17:03 +02:00			`![Divider](resources/divider5.jpg)`

diffphys discussion 2021-01-10 14:15:50 +01:00			`## Summary`

updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			`To summarize the pros and cons of training NNs via DP:`
udpated notation, into control 2021-03-08 04:15:00 +01:00
diffphys ns phiflow2 update, WIP 2021-02-27 07:07:42 +01:00			`✅ Pro:`
PG conclusions, list formatting 2021-03-26 03:28:05 +01:00			`- Uses physical model and numerical methods for discretization.`
			`- Efficiency of selected methods carries over to training.`
			`- Tight coupling of physical models and NNs possible.`
sol code 2021-01-22 13:31:22 +01:00
diffphys ns phiflow2 update, WIP 2021-02-27 07:07:42 +01:00			`❌ Con:`
PG conclusions, list formatting 2021-03-26 03:28:05 +01:00			`- Not compatible with all simulators (need to provide gradients).`
			`- Require more heavy machinery (in terms of framework support) than previously discussed methods.`
dp ns code update 2021-03-02 08:03:15 +01:00
updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			`Especially the last negative point regarding heavy machinery is one that is bound to strongly improve in a fairly short time, but for now it's important to keep in mind that not every simulator is suitable for DP training out of the box. Hence, in this book we'll focus on examples using phiflow, which was designed for interfacing with deep learning frameworks.`
added DP control, small update of PINN code 2021-03-04 06:32:21 +01:00
updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			`Next we can target more some complex scenarios to showcase what can be achieved with differentiable physics.`
			`This will also illustrate how the right selection of a numerical methods for a DP operator yields improvements in terms of training accuracy.`