of higher and higher order (this can of course also include mixed derivatives with respect to different axes). $\mathbf{u}_t$ denotes the changes over time.
In this context, we can approximate the unknown $\mathbf{u}$ itself with a neural network. If the approximation, which we call $\tilde{\mathbf{u}}$, is accurate, the PDE should be satisfied naturally. In other words, the residual R should be equal to zero:
It is instructive to note what the two different terms in equation {eq}`physloss-training` mean: The first term is a conventional, supervised L2-loss. If we were to optimize only this loss, our network would learn to approximate the training samples well, but might average multiple modes in the solutions, and do poorly in regions in between the sample points.
If we, instead, were to optimize only the second term (the physical residual), our neural network might be able to locally satisfy the PDE, but still could produce solutions that are still far away from our training data. This can happen due to "null spaces" in the solutions, i.e., different solutions that all satisfy the residuals.
Therefore, we optimize both objectives simultaneously such that, in the best case, the network learns to approximate the specific solutions of the training data while still capturing knowledge about the underlying PDE.
The previous overview did not really make clear how an NN produces $\mathbf{u}$.
We can distinguish two different approaches here:
via a chosen explicit representation of the target function (v1 in the following), or via using fully-connected neural networks to represent the solution (v2).
E.g., for v1 we could set up a _spatial_ grid (or graph, or a set of sample points), while in the second case no explicit representation exists, and the NN instead receives the _spatial coordinate_ to produce the solution at a query position.
We'll outline these two variants in more detail the following.
---
## Variant 1: Residual derivatives for explicit representations
For variant 1, we choose the discretization and set up a computational mesh that covers our target domain. Without loss of generality, let's assume this is a Cartesian grid that samples the space with positions $\mathbf{p}$. Now, an NN is trained to produce the solution on the grid: $\mathbf{u}(\mathbf{p}) = f(\mathbf{x} ; \theta)$. For a regular grid, a CNN would be a good choice for $f$, while for triangle meshes we could use a graph-network, or a network with point-convolutions for particles.
```{figure} resources/physloss-overview-v1.jpg
---
height: 220px
name: physloss-overview-v1
---
Variant 1: the solution is represented by a chosen computational mesh, and produced by an NN. The residual is discretized there, and can be combined with supervised terms.
```
Now, we can discretize the equations of
$R$ on our computational mesh, and compute derivatives with our method of choice. Only caveat: to incorporate the residual
into training, we have to formulate the evaluation such that a deep learning framework can backpropagate through the
calculations. As our network $f()$ produces the solution $\mathbf{u}$, and the residual depends on it ($R(\mathbf{u})$), we at least need $\partial R / \partial \mathbf u$, such that the gradient can be backpropagated for the weights $\theta$. Luckily, if we formulate $R$ in terms of operations of a DL framework, this will be taken care of by the backpropagation functionality of the framework.
This variant has a fairly long "tradition" in DL, and was, e.g., proposed by Tompson et al. {cite}`tompson2017` early on to learn
divergence free motions. To give a specific example: if our goal is to learn velocities $\mathbf u(t)$ which are divergence free $\nabla \cdot \mathbf u=0$, we can employ this training approach to train an NN without having to pre-compute divergence free velocity fields as training data. For brevity, we will drop the spatial index ($\mathbf p$) here, and focus on $t$, which we can likewise simplify: divergence-freeness has to hold at all times, and hence we can consider a single step from $t=0$ with $\Delta t=1$, i.e., a normalized step from a divergent $\mathbf u(0)$ to a divergence-free $\mathbf u(1)$. For a normal solver, we'd have to compute a pressure
$p=\nabla^{-2} \mathbf{u}(0)$, such that $\mathbf{u}(1) = \mathbf{u}(0) - \nabla p$. This is the famous fundamental
theorem of vector calculus, or
[Helmholtz decomposition](https://en.wikipedia.org/wiki/Helmholtz_decomposition), splitting a vector field into a _solenoidal_ (divergence-free) and irrotational part (the pressure gradient).
To learn this decomposition, we can approximate $p$ with a CNN on our computational mesh: $p = f(\mathbf{u}(0) ; \theta)$. The learning objective becomes minimizing the divergence of $\mathbf u(0)$, which means minimizing
To implement this residual, all we need to do is provide the divergence operator $(\nabla \cdot)$ of $\mathbf u$ on our computational mesh. This is typically easy to do via
a convolutional layer in the DL framework that contains the finite difference weights for the divergence.
Nicely enough, in this case we don't even need additional supervised samples, and can typically purely train with this residual formulation. Also, in contrast to variant 2 below, we can directly handle fairly large spaces of solutions here (we're not restricted to learning single solutions)
An example implementation can be found in this [code repository](https://github.com/tum-pbs/CG-Solver-in-the-Loop).
Overall, this variant 1 has a lot in common with _differentiable physics_ training (it's basically a subset). As we'll discuss differentiable physics in a lot more detail
## Variant 2: Derivatives from a neural network representation
The second variant of employing physical residuals as soft constraints
instead uses fully connected NNs to represent $\mathbf{u}$. This _physics-informed_ approach was popularized by Raissi et al. {cite}`raissi2019pinn`, and has some interesting pros and cons that we'll outline in the following. We will target this physics-informed version (variant 2) in the following code examples and discussions.
can also be used to obtain a representation of a physical field, e.g., a field $\mathbf{u}$ that satisfies $R=0$. This means $\mathbf{u}(\mathbf{x})$ will
be turned into $\mathbf{u}(\mathbf{x}, \theta)$ where we choose the NN parameters $\theta$ such that a desired $\mathbf{u}$ is
Variant 2: the solution is produced by a fully-connected network, typically requiring a supervised loss with a combination of derivatives from the neural network for the residual. These NN derivatives have their own share of advantages and disadvantages.
formulate a loss term $R = \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} - \nu \frac{\partial^2 u}{\partial x^2}$ that should be minimized as much as possible at training time. For each of the terms, e.g. $\frac{\partial u}{\partial x}$,
For higher order derivatives, such as $\frac{\partial^2 u}{\partial x^2}$, we can simply query the derivative function of the framework multiple times. In the following section, we'll give a specific example of how that works in tensorflow.