extended advection example, comments Maxi

This commit is contained in:
NT 2021-07-15 14:49:04 +02:00
parent 7614288ac5
commit 776b9ca2e3

View File

@ -34,7 +34,7 @@ provide directions in the form of gradients to steer the learning process.
With the DP direction we build on existing numerical solvers. I.e.,
the approach is strongly relying on the algorithms developed in the larger field
of computational methods for a vast range of physical effects in our world.
To start with we need a continuous formulation as model for the physical effect that we'd like
To start with, we need a continuous formulation as model for the physical effect that we'd like
to simulate -- if this is missing we're in trouble. But luckily, we can
tap into existing collections of model equations and established methods
for discretizing continuous models.
@ -45,9 +45,10 @@ with model parameters $\nu$ (e.g., diffusion, viscosity, or conductivity constan
The component of $\mathbf{u}$ will be denoted by a numbered subscript, i.e.,
$\mathbf{u} = (u_1,u_2,\dots,u_d)^T$.
%and a corresponding discrete version that describes the evolution of this quantity over time: $\mathbf{u}_t = \mathcal P(\mathbf{x}, \mathbf{u}, t)$.
Typically, we are interested in the temporal evolution of such a system,
and discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
that we can re-arrange to compute a future state after a time step $\Delta t$ via sequence of
Typically, we are interested in the temporal evolution of such a system.
Discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
that we can re-arrange to compute a future state after a time step $\Delta t$.
The state at $t+\Delta t$ is computed via sequence of
operations $\mathcal P_1, \mathcal P_2 \dots \mathcal P_m$ such that
$\mathbf{u}(t+\Delta t) = \mathcal P_1 \circ \mathcal P_2 \circ \dots \mathcal P_m ( \mathbf{u}(t),\nu )$,
where $\circ$ denotes function decomposition, i.e. $f(g(x)) = f \circ g(x)$.
@ -151,9 +152,9 @@ E.g., as this process is very similar to adjoint method optimizations, we can re
that were developed in this field, or leverage established numerical methods. E.g.,
we could leverage the $O(n)$ runtime of multigrid solvers for matrix inversion.
The flip-side of this approach is, that it requires some understanding of the problem at hand,
The flip-side of this approach is that it requires some understanding of the problem at hand,
and of the numerical methods. Also, a given solver might not provide gradient calculations out of the box.
Thus, we want to employ DL for model equations that we don't have a proper grasp of, it might not be a good
Thus, if we want to employ DL for model equations that we don't have a proper grasp of, it might not be a good
idea to directly go for learning via a DP approach. However, if we don't really understand our model, we probably
should go back to studying it a bit more anyway...
@ -199,14 +200,16 @@ $$
d(t+\Delta t) = \mathcal P ( ~ d(t), \mathbf{u}, t+\Delta t)
$$
As a simple example of an optimization and learning task, let's consider the problem of
finding a motion $\mathbf{u}$ such that starting with a given initial state $d^{~0}$ at $t^0$,
the time evolved scalar density at time $t^e$ has a certain shape or configuration $d^{\text{target}}$.
Informally, we'd like to find a motion that deforms $d^{~0}$ into a target state.
As a simple example of an inverse problem and learning task, let's consider the problem of
finding a unknown motion $\mathbf{u}$:
this motion should transform a given initial scalar density state $d^{~0}$ at time $t^0$
into state that's evolved by $\mathcal P$ to a later "end" time $t^e$
with a certain shape or configuration $d^{\text{target}}$.
Informally, we'd like to find a motion that deforms $d^{~0}$ through the PDE model into a target state.
The simplest way to express this goal is via an $L^2$ loss between the two states. So we want
to minimize the loss function $L=|d(t^e) - d^{\text{target}}|^2$.
Note that as described here this is a pure optimization task, there's no NN involved,
Note that as described here this inverse problem is a pure optimization task: there's no NN involved,
and our goal is to obtain $\mathbf{u}$. We do not want to apply this motion to other, unseen _test data_,
as would be custom in a real learning task.
@ -214,7 +217,7 @@ The final state of our marker density $d(t^e)$ is fully determined by the evolut
of $\mathcal P$ via $\mathbf{u}$, which gives the following minimization problem:
$$
\text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e - t^0 ) - d^{\text{target}}|^2
\text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2
$$
We'd now like to find the minimizer for this objective by
@ -229,38 +232,58 @@ $\Delta \mathbf{u} = \partial L / \partial \mathbf{u}$,
which can be decomposed into
$\Delta \mathbf{u} =
\frac{ \partial d }{ \partial \mathbf{u}}
\frac{ \partial L }{ \partial d}
$.
And as the evolution of $d$ is given by our discretized physical model $P$,
what we're acutally looking for is the Jacobian
$\partial \mathcal P / \partial \mathbf{u}$ to
compute
$\Delta \mathbf{u} =
\frac{ \partial \mathcal P }{ \partial \mathbf{u}}
\frac{ \partial L }{ \partial d}$.
We luckily don't need $\partial \mathcal P / \partial \mathbf{u}$ as a full
matrix, but instead only mulitplied by the vector obtained from the derivative of our scalar
loss function $L$.
\frac{ \partial L }{ \partial d} $.
The $\frac{ \partial L }{ \partial d}$ component is typically simple enough: we'll get
$$
\frac{ \partial L }{ \partial d}
= \partial | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2 / \partial d
= 2 (d(t^e)-d^{\text{target}}).
$$
If $d$ is represented as a vector, e.g., for one entry per cell of a mesh,
$\frac{ \partial L }{ \partial d}$ will likewise be a column vector of equivalent size.
This is thanks to the fact that $L$ is always a scalar loss function, and hence the Jacobian
matrix will have a dimension of 1 along the $L$ dimension.
Intuitively, this vector will simply contain the differences between $d$ at the end time
in comparison to the target densities $d^{\text{target}}$.
The evolution of $d$ itself is given by our discretized physical model $\mathcal P$,
and we use $\mathcal P$ and $d$ interchangeably.
Hence, the more interesting component is the Jacobian
$\partial d / \partial \mathbf{u} = \partial \mathcal P / \partial \mathbf{u}$ to
compute the full $\Delta \mathbf{u} =
\frac{ \partial d }{ \partial \mathbf{u}}
\frac{ \partial L }{ \partial d}$.
We luckily don't need $\partial d / \partial \mathbf{u}$ as a full
matrix, but instead only multiplied by $\frac{ \partial L }{ \partial d}$.
So what is the actual Jacobian for $d$? To compute it we first need
to finalize our PDE model $\mathcal P$, such that we get an expression which we can derive.
In the next section we'll choose a specific advection scheme and a discretization
so that we can be more specific.
%the vector obtained from the derivative of our scalar loss function $L$.
%the $L^2$ loss $L= |d(t^e) - d^{\text{target}}|^2$, thus
So what are the actual Jacobians here?
The one for $L$ is simple enough, we simply get a column vector with entries of the form
$2(d(t^e)_i - d^{\text{target}})_i$ for one component $i$.
$\partial \mathcal P / \partial \mathbf{u}$ is more interesting:
here we'll get derivatives of the chosen advection operator w.r.t. each component of the
velocities.
%So what are the actual Jacobians here? The one for $L$ is simple enough, we simply get a column vector with entries of the form $2(d_i(t^e) - d^{\text{target}}_i)$ for one component $i$.
%$\partial \mathcal P / \partial \mathbf{u}$ is more interesting: here we'll get derivatives of the chosen advection operator w.r.t. each component of the velocities.
%...to obtain an explicit update of the form $d(t+\Delta t) = A d(t)$, where the matrix $A$ represents the discretized advection step of size $\Delta t$ for $\mathbf{u}$. ... we'll get a matrix that essentially encodes linear interpolation coefficients for positions $\mathbf{x} + \Delta t \mathbf{u}$. For a grid of size $d_x \times d_y$ we'd have a
### Introducing a specific advection scheme
E.g., for a simple [first order upwinding scheme](https://en.wikipedia.org/wiki/Upwind_scheme)
on a Cartesian grid in 1D, with marker density and velocity $d_i$ and $u_i$ for cell $i$
(superscripts for time $t$ are omitted for brevity),
we get
In the following we'll make use of a simple [first order upwinding scheme](https://en.wikipedia.org/wiki/Upwind_scheme)
on a Cartesian grid in 1D, with marker density and velocity $d_i$ and $u_i$ for cell $i$.
We omit the $(t)$ for quantities at time $t$ for brevity, i.e., $d_i(t)$ is written as $d_i$ below.
From above, we'll use our _physical model_ that updates the marker density
$d_i(t+\Delta t) = \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t)$, which
gives the following:
$$ \begin{aligned}
& d_i^{~t+\Delta t} = d_i - u_i^+ (d_{i+1} - d_{i}) + u_i^- (d_{i} - d_{i-1}) \text{ with } \\
& d_i(t+\Delta t) = d_i - u_i^+ (d_{i+1} - d_{i}) + u_i^- (d_{i} - d_{i-1}) \text{ with } \\
& u_i^+ = \text{max}(u_i \Delta t / \Delta x,0) \\
& u_i^- = \text{min}(u_i \Delta t / \Delta x,0)
\end{aligned} $$
@ -274,19 +297,97 @@ name: advection-upwind
```
Thus, for a positive $u_i$ we have
$d_i^{~t+\Delta t} = (1 + \frac{u_i \Delta t }{ \Delta x}) d_i - \frac{u_i \Delta t }{ \Delta x} d_{i+1}$
and hence
$\partial \mathcal P / \partial u_i$ for cell $i$ would be $1 + \frac{u_i \Delta t }{ \Delta x}$.
For the full gradient we'd need to add
the potential contributions from cells $i+1$ and $i-1$, depending on the sign of their velocities.
In practice this step is similar to evaluating a transposed matrix multiplication.
$$
\mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t) = (1 + \frac{u_i \Delta t }{ \Delta x}) d_i - \frac{u_i \Delta t }{ \Delta x} d_{i+1}
$$ (eq:advection)
and hence $\partial \mathcal P / \partial u_i$ gives
$\frac{\Delta t }{ \Delta x} d_i - \frac{\Delta t }{ \Delta x} d_{i+1}$. Intuitively,
the change of the velocity $u_i$ depends on the spatial derivatives of the densities.
Due to the first order upwinding, we only include two neighbors (higher order methods would depend on
additional entries of $d$)
% velocity derivative , Just for completeness: another derivative we could compute here is
In practice this step is equivalent to evaluating a transposed matrix multiplication.
If we rewrite the calculation above as
$d^{~t+\Delta t} = A \mathbf{u}$, then $\partial \mathcal P / \partial \mathbf{u} = A^T$.
$ \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t) = A \mathbf{u}$,
then $\partial \mathcal P / \partial \mathbf{u} = A^T$.
However, in many practical cases, a matrix free implementation of this multiplication might
be preferable to actually constructing $A$.
## A (slightly) more complex example
% density derivative
Another derivative that we can consider for the advection scheme is that w.r.t. the previous
density state, i.e. $d_i(t)$, which is $d_i$ in the shortened notation.
$\partial \mathcal P / \partial d_i$ for cell $i$ from above gives $1 + \frac{u_i \Delta t }{ \Delta x}$. However, for the full gradient we'd need to add the potential contributions from cells $i+1$ and $i-1$, depending on the sign of their velocities. This derivative will come into play in the next section.
### Time evolution
So far we've only dealt with a single update step of
$d$ from time $t$ to $t+\Delta t$, but we could of course have an arbitrary number of such
steps. After all, above we stated the goal to advance the initial marker state $d(t^0)$ to
the target state at time $t^e$, which could encompass a long interval of time.
In the expression above for $d_i(t+\Delta t)$, each of the $d_i(t)$ in turn depend
on the velocity and density states at time $t-\Delta t$, i.e., $d_i(t-\Delta t)$. Thus we have to trace back
the influence of our loss $L$ all the way back to how $\mathbf{u}$ influences the initial marker
state. This can involve a large number of evaluations of our advection scheme via $\mathcal P$.
This sounds challenging at first:
e.g., one could try to insert equation {eq}`eq:advection` at time $t-\Delta t$
into equation {eq}`eq:advection` at $t$ and repeat this process recursively until
we have a single expression relating $d^{~0}$ to the targets. However, thanks
to the linear nature of the Jacobians, we can treat each advection step, i.e.,
each invocation of our PDE $\mathcal P$ as a seperate, modular
operation. And each of these invocations follows the procedure described
in the previous section.
Hence, given the machinery above, the backtrace is fairly simple to realize:
for each of the advection steps
in $\mathcal P$ we can compute a Jacobian product with the _incoming_ vector of derivatives
from the loss $L$ or a previous advection step. We repeat this until we have traced the chain from the
loss with $d^{\text{target}}$ all the way back to $d^{~0}$.
Theoretically, the velocity $\mathbf{u}$ could be a function of time, in which
case we'd get a gradient $\Delta \mathbf{u}(t)$ for every time step $t$. To simplify things
below, let's we assume we have field that is constant in time, i.e., we're
reusing the same velocities $\mathbf{u}$ for every advection via $\mathcal P$. Now, each time step
will give us a contribution to $\Delta \mathbf{u}$ which we can accumulate for all steps.
$$ \begin{aligned}
\Delta \mathbf{u} =&
\frac{ \partial d(t^e) }{ \partial \mathbf{u} }
\frac{ \partial L }{ \partial d(t^e) }
+
\frac{ \partial d(t^e - \Delta t) }{ \partial \mathbf{u}}
\frac{ \partial d(t^e) }{ \partial d(t^e - \Delta t) }
\frac{ \partial L }{ \partial d(t^e)}
+ \\
&
\cdots + \\
&
\Big( \frac{ \partial d(t^0) }{ \partial \mathbf{u}} \cdots
\frac{ \partial d(t^e - \Delta t) }{ \partial d(t^e - 2 \Delta t) }
\frac{ \partial d(t^e) }{ \partial d(t^e - \Delta t) }
\frac{ \partial L }{ \partial d(t^e)} \Big)
\end{aligned} $$
Here the last term above contains the full backtrace of the marker density to time $t^0$.
The terms of this sum look unwieldy
at first, but they contain a lot of similar Jacobians, and in practice can be computed efficiently
by backtracing through the sequence of computational steps in the forward evaluation of our PDE.
This structure also makes clear that the process is very similar to the regular training
process of an NN: the evaluations of these Jacobian vector products is exactly what
a deep learning framework does for training an NN (we just have weights $\theta$ instead
of a velocity field there). And hence all we need to do in practice is to provide a custom
function the Jacobian vector product for $\mathcal P$.
---
## Implicit gradient calculations
As a slightly more complex example let's consider Poisson's equation $\nabla^2 a = b$, where
$a$ is the quantity of interest, and $b$ is given.
@ -305,8 +406,8 @@ If we now introduce an NN that modifies $\mathbf{u}$ in a solver, we inevitably
backpropagate through the Poisson solve. I.e., we need a gradient for $\mathbf{u}^{n}$, which in this
notation takes the form $\partial \mathbf{u}^{n} / \partial \mathbf{u}$.
In combination, $\mathbf{u}^{n} = \mathbf{u} - \nabla \left( (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients simple to compute. The main difficulty lies in obtaining the
matrix inverse $(\nabla^2)^{-1}$ from the Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear).
In combination, $\mathbf{u}^{n} = \mathbf{u} - \nabla \left( (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients are simple to compute. The main difficulty lies in obtaining the
matrix inverse $(\nabla^2)^{-1}$ from Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear).
In practice, the matrix vector product for $(\nabla^2)^{-1} b$ with $b=\nabla \cdot \mathbf{u}$ is not explicitly computed via matrix operations, but approximated with a (potentially matrix-free) iterative solver. E.g., conjugate gradient (CG) methods are a very popular choice here. Thus, we could treat this iterative solver as a function $S$,
with $p = S(\nabla \cdot \mathbf{u})$. Note that matrix inversion is a non-linear process, despite the matrix itself being linear. As solvers like CG are also based on matrix and vector operations, we could decompose $S$ into a sequence of simpler operations $S(x) = S_n( S_{n-1}(...S_{1}(x)))$, and backpropagate through each of them. This is certainly possible, but not a good idea: it can introduce numerical problems, and can be very slow.