pbdl-book/physicalloss.md

Physical Loss Terms
=======================

The supervised setting of the previous sections can quickly 
yield approximate solutions with a fairly simple training process, but what's
quite sad to see here is that we only use physical models and numerics
as an "external" tool to produce a big pile of data 😢.

We as humans have a lot of knowledge about how to describe physical processes
mathematically. As the following chapters will show, we can improve the
training process by guiding it with our human knowledge of physics.

```{figure} resources/physloss-overview.jpg
---
height: 220px
name: physloss-overview
---
Physical losses typically combine a supervised loss with a combination of derivatives from the neural network.
```

## Using physical models

Given a PDE for $\mathbf{u}(\mathbf{x},t)$ with a time evolution, 
we can typically express it in terms of a function $\mathcal F$ of the derivatives 
of $\mathbf{u}$ via  

$$
  \mathbf{u}_t = \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{xx...x} ) ,
$$

where the $_{\mathbf{x}}$ subscripts denote spatial derivatives with respect to one of the spatial dimensions
of higher and higher order (this can of course also include mixed derivatives with respect to different axes).

In this context we can employ DL by approximating the unknown $\mathbf{u}$ itself 
with a NN, denoted by $\tilde{\mathbf{u}}$. If the approximation is accurate, the PDE
naturally should be satisfied, i.e., the residual $R$ should be equal to zero: 

$$
  R = \mathbf{u}_t - \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{xx...x} ) = 0 .
$$

This nicely integrates with the objective for training a neural network: similar to before
we can collect sample solutions 
$[x_0,y_0], ...[x_n,y_n]$ for $\mathbf{u}$ with $\mathbf{u}(\mathbf{x})=y$. 
This is typically important, as most practical PDEs we encounter do not have unique solutions
unless initial and boundary conditions are specified. Hence, if we only consider $R$ we might
get solutions with random offset or other undesirable components. Hence the supervised sample points
help to _pin down_ the solution in certain places.
Now our training objective becomes

$$
\text{arg min}_{\theta} \ \alpha_0 \sum_i (f(x_i ; \theta)-y_i)^2 + \alpha_1 R(x_i) ,
$$ (physloss-training)

where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and 
the residual term, respectively. We could of course add additional residual terms with suitable scaling factors here.

Note that, similar to the data samples used for supervised training, we have no guarantees that the
residual terms $R$ will actually reach zero during training. The non-linear optimization of the training process
will minimize the supervised and residual terms as much as possible, but worst case, large non-zero residual 
contributions can remain. We'll look at this in more detail in the upcoming code example, for now it's important 
to remember that physical constraints in this way only represent _soft-constraints_, without guarantees
of minimizing these constraints.

## Neural network derivatives

In order to compute the residuals at training time, it would be possible to store 
the unknowns of $\mathbf{u}$ on a computational mesh, e.g., a grid, and discretize the equations of
$R$ there. This has a fairly long "tradition" in DL, and was proposed by Tompson et al. {cite}`tompson2017` early on.

A popular variant of employing physical soft-constraints {cite}`raissi2018hiddenphys`
instead uses fully connected NNs to represent $\mathbf{u}$. This has some interesting pros and cons that we'll outline in the following, and we will also focus on it in the following code examples and comparisons.

The central idea here is that the aforementioned general function $f$ that we're after in our learning problems
can also be used to obtain a representation of a physical field, e.g., a field $\mathbf{u}$ that satisfies $R=0$. This means $\mathbf{u}(\mathbf{x})$ will 
be turned into $\mathbf{u}(\mathbf{x}, \theta)$ where we choose the NN parameters $\theta$ such that a desired $\mathbf{u}$ is 
represented as precisely as possible.

One nice side effect of this viewpoint is that NN representations inherently support the calculation of derivatives. 
The derivative $\partial f / \partial \theta$ was a key building block for learning via gradient descent, as explained 
in {doc}`overview`. Now, we can use the same tools to compute spatial derivatives such as $\partial \mathbf{u} / \partial x$,
Note that above for $R$ we've written this derivative in the shortened notation as $\mathbf{u}_{x}$.
For functions over time this of course also works for $\partial \mathbf{u} / \partial t$, i.e. $\mathbf{u}_{t}$ in the notation above.

Thus, for some generic $R$, made up of $\mathbf{u}_t$ and $\mathbf{u}_{x}$ terms, we can rely on the backpropagation algorithm
of DL frameworks to compute these derivatives once we have a NN that represents $\mathbf{u}$. Essentially, this gives us a 
function (the NN) that receives space and time coordinates to produce a solution for $\mathbf{u}$. Hence, the input is typically
quite low-dimensional, e.g., 3+1 values for a 3D case over time, and often produces a scalar value or a spatial vector.
Due to the lack of explicit spatial sampling points, an MLP, i.e., fully-connected NN is the architecture of choice here.

To pick a simple example, Burgers equation in 1D,
$\frac{\partial u}{\partial{t}} + u \nabla u = \nu \nabla \cdot \nabla u $ , we can directly
formulate a loss term $R = \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2} u$ that should be minimized as much as possible at training time. For each of the terms, e.g. $\frac{\partial u}{\partial x}$,
we can simply query the DL framework that realizes $u$ to obtain the corresponding derivative. 
For higher order derivatives, such as $\frac{\partial^2 u}{\partial x^2}$, we can simply query the derivative function of the framework multiple times. In the following section, we'll give a specific example of how that works in tensorflow.


## Summary so far

The approach above gives us a method to include physical equations into DL learning as a soft-constraint: the residual loss.
Typically, this setup is suitable for _inverse problems_, where we have certain measurements or observations
for which we want to find a PDE solution. Because of the high cost of the reconstruction (to be 
demonstrated in the following), the solution manifold shouldn't be overly complex. E.g., it is not possible 
to capture a wide range of solutions, such as with the previous supervised airfoil example, with such a physical residual loss.
initial checkin 2021-01-04 09:36:09 +01:00			`Physical Loss Terms`
			`=======================`

more text 2021-01-12 04:50:42 +01:00			`The supervised setting of the previous sections can quickly`
			`yield approximate solutions with a fairly simple training process, but what's`
			`quite sad to see here is that we only use physical models and numerics`
			`as an "external" tool to produce a big pile of data 😢.`
diffphys discussion 2021-01-10 14:15:50 +01:00
physloss intro update 2021-05-16 16:02:38 +02:00			`We as humans have a lot of knowledge about how to describe physical processes`
			`mathematically. As the following chapters will show, we can improve the`
			`training process by guiding it with our human knowledge of physics.`

updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			```{figure} resources/physloss-overview.jpg
first version DP 2021-01-16 06:30:26 +01:00			`---`
			`height: 220px`
updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			`name: physloss-overview`
first version DP 2021-01-16 06:30:26 +01:00			`---`
updated figures, and texts for figures 2021-04-01 10:42:03 +02:00			`Physical losses typically combine a supervised loss with a combination of derivatives from the neural network.`
first version DP 2021-01-16 06:30:26 +01:00			```

unified caps of headings 2021-04-12 03:19:00 +02:00			`## Using physical models`
initial checkin 2021-01-04 09:36:09 +01:00
physloss intro update 2021-05-16 16:02:38 +02:00			`Given a PDE for $\mathbf{u}(\mathbf{x},t)$ with a time evolution,`
more text 2021-01-12 04:50:42 +01:00			`we can typically express it in terms of a function $\mathcal F$ of the derivatives`
			`of $\mathbf{u}$ via`
formatting updates 2021-05-16 05:12:16 +02:00
			`$$`
cleanup of dp code 2021-01-16 10:10:21 +01:00			`\mathbf{u}_t = \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{xx...x} ) ,`
formatting updates 2021-05-16 05:12:16 +02:00			`$$`

Starting diffphys chapter 2021-01-15 09:13:41 +01:00			`where the $_{\mathbf{x}}$ subscripts denote spatial derivatives with respect to one of the spatial dimensions`
physloss intro update 2021-05-16 16:02:38 +02:00			`of higher and higher order (this can of course also include mixed derivatives with respect to different axes).`
draft model equations 2021-01-07 09:18:36 +01:00
Starting diffphys chapter 2021-01-15 09:13:41 +01:00			`In this context we can employ DL by approximating the unknown $\mathbf{u}$ itself`
more text 2021-01-12 04:50:42 +01:00			`with a NN, denoted by $\tilde{\mathbf{u}}$. If the approximation is accurate, the PDE`
			`naturally should be satisfied, i.e., the residual $R$ should be equal to zero:`
formatting updates 2021-05-16 05:12:16 +02:00
			`$$`
			`R = \mathbf{u}_t - \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{xx...x} ) = 0 .`
			`$$`
updates, airfoils test 2021-01-07 02:39:57 +01:00
more text 2021-01-12 04:50:42 +01:00			`This nicely integrates with the objective for training a neural network: similar to before`
			`we can collect sample solutions`
Starting diffphys chapter 2021-01-15 09:13:41 +01:00			`$[x_0,y_0], ...[x_n,y_n]$ for $\mathbf{u}$ with $\mathbf{u}(\mathbf{x})=y$.`
more text 2021-01-12 04:50:42 +01:00			`This is typically important, as most practical PDEs we encounter do not have unique solutions`
			`unless initial and boundary conditions are specified. Hence, if we only consider $R$ we might`
			`get solutions with random offset or other undesirable components. Hence the supervised sample points`
			`help to _pin down_ the solution in certain places.`
			`Now our training objective becomes`

formatting updates 2021-05-16 05:12:16 +02:00			`$$`
			`\text{arg min}_{\theta} \ \alpha_0 \sum_i (f(x_i ; \theta)-y_i)^2 + \alpha_1 R(x_i) ,`
			`$$ (physloss-training)`
more text 2021-01-12 04:50:42 +01:00
Starting diffphys chapter 2021-01-15 09:13:41 +01:00			`where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and`
more text 2021-01-12 04:50:42 +01:00			`the residual term, respectively. We could of course add additional residual terms with suitable scaling factors here.`

			`Note that, similar to the data samples used for supervised training, we have no guarantees that the`
			`residual terms $R$ will actually reach zero during training. The non-linear optimization of the training process`
			`will minimize the supervised and residual terms as much as possible, but worst case, large non-zero residual`
			`contributions can remain. We'll look at this in more detail in the upcoming code example, for now it's important`
			`to remember that physical constraints in this way only represent _soft-constraints_, without guarantees`
			`of minimizing these constraints.`

			`## Neural network derivatives`

			`In order to compute the residuals at training time, it would be possible to store`
			`the unknowns of $\mathbf{u}$ on a computational mesh, e.g., a grid, and discretize the equations of`
			$R$ there. This has a fairly long "tradition" in DL, and was proposed by Tompson et al. {cite}`tompson2017` early on.

physloss intro update 2021-05-16 16:02:38 +02:00			A popular variant of employing physical soft-constraints {cite}`raissi2018hiddenphys`
			`instead uses fully connected NNs to represent $\mathbf{u}$. This has some interesting pros and cons that we'll outline in the following, and we will also focus on it in the following code examples and comparisons.`
more text 2021-01-12 04:50:42 +01:00
			`The central idea here is that the aforementioned general function $f$ that we're after in our learning problems`
physloss intro update 2021-05-16 16:02:38 +02:00			`can also be used to obtain a representation of a physical field, e.g., a field $\mathbf{u}$ that satisfies $R=0$. This means $\mathbf{u}(\mathbf{x})$ will`
			`be turned into $\mathbf{u}(\mathbf{x}, \theta)$ where we choose the NN parameters $\theta$ such that a desired $\mathbf{u}$ is`
more text 2021-01-12 04:50:42 +01:00			`represented as precisely as possible.`

			`One nice side effect of this viewpoint is that NN representations inherently support the calculation of derivatives.`
			`The derivative $\partial f / \partial \theta$ was a key building block for learning via gradient descent, as explained`
physloss intro update 2021-05-16 16:02:38 +02:00			in {doc}`overview`. Now, we can use the same tools to compute spatial derivatives such as $\partial \mathbf{u} / \partial x$,
more text 2021-01-12 04:50:42 +01:00			`Note that above for $R$ we've written this derivative in the shortened notation as $\mathbf{u}_{x}$.`
			`For functions over time this of course also works for $\partial \mathbf{u} / \partial t$, i.e. $\mathbf{u}_{t}$ in the notation above.`

unified backprop spelling 2021-03-25 05:34:02 +01:00			`Thus, for some generic $R$, made up of $\mathbf{u}_t$ and $\mathbf{u}_{x}$ terms, we can rely on the backpropagation algorithm`
more text 2021-01-12 04:50:42 +01:00			`of DL frameworks to compute these derivatives once we have a NN that represents $\mathbf{u}$. Essentially, this gives us a`
			`function (the NN) that receives space and time coordinates to produce a solution for $\mathbf{u}$. Hence, the input is typically`
			`quite low-dimensional, e.g., 3+1 values for a 3D case over time, and often produces a scalar value or a spatial vector.`
			`Due to the lack of explicit spatial sampling points, an MLP, i.e., fully-connected NN is the architecture of choice here.`

			`To pick a simple example, Burgers equation in 1D,`
			`$\frac{\partial u}{\partial{t}} + u \nabla u = \nu \nabla \cdot \nabla u $ , we can directly`
			`formulate a loss term $R = \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2} u$ that should be minimized as much as possible at training time. For each of the terms, e.g. $\frac{\partial u}{\partial x}$,`
			`we can simply query the DL framework that realizes $u$ to obtain the corresponding derivative.`
physloss intro update 2021-05-16 16:02:38 +02:00			`For higher order derivatives, such as $\frac{\partial^2 u}{\partial x^2}$, we can simply query the derivative function of the framework multiple times. In the following section, we'll give a specific example of how that works in tensorflow.`
more text 2021-01-12 04:50:42 +01:00

			`## Summary so far`

physloss intro update 2021-05-16 16:02:38 +02:00			`The approach above gives us a method to include physical equations into DL learning as a soft-constraint: the residual loss.`
cleanup, unified notation NN instead of ANN 2021-03-10 05:15:50 +01:00			`Typically, this setup is suitable for _inverse problems_, where we have certain measurements or observations`
			`for which we want to find a PDE solution. Because of the high cost of the reconstruction (to be`
physloss intro update 2021-05-16 16:02:38 +02:00			`demonstrated in the following), the solution manifold shouldn't be overly complex. E.g., it is not possible`
			`to capture a wide range of solutions, such as with the previous supervised airfoil example, with such a physical residual loss.`