extended advection example, comments Maxi

2021-07-15 14:49:04 +02:00
parent 7614288ac5
commit 776b9ca2e3
1 changed files with 146 additions and 45 deletions
--- a/diffphys.md
+++ b/diffphys.md
@@ -34,7 +34,7 @@ provide directions in the form of gradients to steer the learning process.
 With the DP direction we build on existing numerical solvers. I.e., 
 the approach is strongly relying on the algorithms developed in the larger field 
 of computational methods for a vast range of physical effects in our world.
-To start with we need a continuous formulation as model for the physical effect that we'd like 
+To start with, we need a continuous formulation as model for the physical effect that we'd like 
 to simulate -- if this is missing we're in trouble. But luckily, we can 
 tap into existing collections of model equations and established methods
 for discretizing continuous models.
@@ -45,9 +45,10 @@ with model parameters $\nu$ (e.g., diffusion, viscosity, or conductivity constan
 The component of $\mathbf{u}$ will be denoted by a numbered subscript, i.e.,
 $\mathbf{u} = (u_1,u_2,\dots,u_d)^T$.
 %and a corresponding discrete version that describes the evolution of this quantity over time: $\mathbf{u}_t = \mathcal P(\mathbf{x}, \mathbf{u}, t)$.
-Typically, we are interested in the temporal evolution of such a system, 
-and discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
-that we can re-arrange to compute a future state after a time step $\Delta t$ via sequence of
+Typically, we are interested in the temporal evolution of such a system.
+Discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
+that we can re-arrange to compute a future state after a time step $\Delta t$. 
+The state at $t+\Delta t$ is computed via sequence of
 operations $\mathcal P_1, \mathcal P_2 \dots \mathcal P_m$ such that
 $\mathbf{u}(t+\Delta t) = \mathcal P_1 \circ \mathcal P_2 \circ \dots \mathcal P_m ( \mathbf{u}(t),\nu )$,
 where $\circ$ denotes function decomposition, i.e. $f(g(x)) = f \circ g(x)$.
@@ -151,9 +152,9 @@ E.g., as this process is very similar to adjoint method optimizations, we can re
 that were developed in this field, or leverage established numerical methods. E.g., 
 we could leverage the $O(n)$ runtime of multigrid solvers for matrix inversion.

-The flip-side of this approach is, that it requires some understanding of the problem at hand, 
+The flip-side of this approach is that it requires some understanding of the problem at hand, 
 and of the numerical methods. Also, a given solver might not provide gradient calculations out of the box.
-Thus, we want to employ DL for model equations that we don't have a proper grasp of, it might not be a good
+Thus, if we want to employ DL for model equations that we don't have a proper grasp of, it might not be a good
 idea to directly go for learning via a DP approach. However, if we don't really understand our model, we probably
 should go back to studying it a bit more anyway...

@@ -199,14 +200,16 @@ $$
    d(t+\Delta t) = \mathcal P ( ~ d(t), \mathbf{u}, t+\Delta t) 
 $$

-As a simple example of an optimization and learning task, let's consider the problem of
-finding a motion $\mathbf{u}$ such that starting with a given initial state $d^{~0}$ at $t^0$,
-the time evolved scalar density at time $t^e$ has a certain shape or configuration $d^{\text{target}}$.
-Informally, we'd like to find a motion that deforms $d^{~0}$ into a target state.
+As a simple example of an inverse problem and learning task, let's consider the problem of
+finding a unknown motion $\mathbf{u}$: 
+this motion should transform a given initial scalar density state $d^{~0}$ at time $t^0$ 
+into state that's evolved by $\mathcal P$ to a later "end" time $t^e$ 
+with a certain shape or configuration $d^{\text{target}}$.
+Informally, we'd like to find a motion that deforms $d^{~0}$ through the PDE model into a target state.
 The simplest way to express this goal is via an $L^2$ loss between the two states. So we want
 to minimize the loss function $L=|d(t^e) - d^{\text{target}}|^2$. 

-Note that as described here this is a pure optimization task, there's no NN involved,
+Note that as described here this inverse problem is a pure optimization task: there's no NN involved,
 and our goal is to obtain $\mathbf{u}$. We do not want to apply this motion to other, unseen _test data_,
 as would be custom in a real learning task.

@@ -214,7 +217,7 @@ The final state of our marker density $d(t^e)$ is fully determined by the evolut
 of $\mathcal P$ via $\mathbf{u}$, which gives the following minimization problem:

 $$
-    \text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e - t^0 ) - d^{\text{target}}|^2
+    \text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2
 $$

 We'd now like to find the minimizer for this objective by
@@ -229,38 +232,58 @@ $\Delta \mathbf{u} = \partial L / \partial \mathbf{u}$,
 which can be decomposed into 
 $\Delta \mathbf{u} = 
 \frac{ \partial d }{ \partial \mathbf{u}}
-\frac{ \partial L }{ \partial d}
-$.
-And as the evolution of $d$ is given by our discretized physical model $P$,
-what we're acutally looking for is the Jacobian 
-$\partial \mathcal P / \partial \mathbf{u}$ to
-compute 
-$\Delta \mathbf{u} = 
- \frac{ \partial \mathcal P }{ \partial \mathbf{u}}
- \frac{ \partial L }{ \partial d}$. 
-We luckily don't need $\partial \mathcal P / \partial \mathbf{u}$ as a full
-matrix, but instead only mulitplied by the vector obtained from the derivative of our scalar 
-loss function $L$.
+\frac{ \partial L }{ \partial d} $.

+The $\frac{ \partial L }{ \partial d}$ component is typically simple enough: we'll get 
+
+$$ 
+\frac{ \partial L }{ \partial d} 
+    = \partial | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2 / \partial d 
+    = 2 (d(t^e)-d^{\text{target}}).
+$$
+
+If $d$ is represented as a vector, e.g., for one entry per cell of a mesh, 
+$\frac{ \partial L }{ \partial d}$ will likewise be a column vector of equivalent size.
+This is thanks to the fact that $L$ is always a scalar loss function, and hence the Jacobian
+matrix will have a dimension of 1 along the $L$ dimension.
+Intuitively, this vector will simply contain the differences between $d$ at the end time
+in comparison to the target densities $d^{\text{target}}$.
+
+The evolution of $d$ itself is given by our discretized physical model $\mathcal P$,
+and we use $\mathcal P$ and $d$ interchangeably.
+Hence, the more interesting component is the Jacobian 
+$\partial d / \partial \mathbf{u} = \partial \mathcal P / \partial \mathbf{u}$ to
+compute the full $\Delta \mathbf{u} = 
+ \frac{ \partial d }{ \partial \mathbf{u}}
+ \frac{ \partial L }{ \partial d}$. 
+We luckily don't need $\partial d / \partial \mathbf{u}$ as a full
+matrix, but instead only multiplied by $\frac{ \partial L }{ \partial d}$.
+
+So what is the actual Jacobian for $d$? To compute it we first need
+to finalize our PDE model $\mathcal P$, such that we get an expression which we can derive.
+In the next section we'll choose a specific advection scheme and a discretization
+so that we can be more specific.
+
+%the vector obtained from the derivative of our scalar loss function $L$.
 %the $L^2$ loss $L= |d(t^e) - d^{\text{target}}|^2$, thus

-So what are the actual Jacobians here?
-The one for $L$ is simple enough, we simply get a column vector with entries of the form
-$2(d(t^e)_i - d^{\text{target}})_i$ for one component $i$.
-$\partial \mathcal P / \partial \mathbf{u}$ is more interesting:
-here we'll get derivatives of the chosen advection operator w.r.t. each component of the 
-velocities.
+%So what are the actual Jacobians here? The one for $L$ is simple enough, we simply get a column vector with entries of the form $2(d_i(t^e) - d^{\text{target}}_i)$ for one component $i$.
+
+%$\partial \mathcal P / \partial \mathbf{u}$ is more interesting: here we'll get derivatives of the chosen advection operator w.r.t. each component of the velocities.

 %...to obtain an explicit update of the form $d(t+\Delta t) = A d(t)$, where the matrix $A$ represents the discretized advection step of size $\Delta t$ for $\mathbf{u}$. ... we'll get a matrix that essentially encodes linear interpolation coefficients for positions $\mathbf{x} + \Delta t \mathbf{u}$. For a grid of size $d_x \times d_y$ we'd have a 

+### Introducing a specific advection scheme

-E.g., for a simple [first order upwinding scheme](https://en.wikipedia.org/wiki/Upwind_scheme) 
-on a Cartesian grid in 1D, with marker density and velocity $d_i$ and $u_i$ for cell $i$
-(superscripts for time $t$ are omitted for brevity), 
-we get 
+In the following we'll make use of a simple [first order upwinding scheme](https://en.wikipedia.org/wiki/Upwind_scheme) 
+on a Cartesian grid in 1D, with marker density and velocity $d_i$ and $u_i$ for cell $i$.
+We omit the $(t)$ for quantities at time $t$ for brevity, i.e., $d_i(t)$ is written as $d_i$ below.
+From above, we'll use our _physical model_ that updates the marker density 
+$d_i(t+\Delta t) = \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t)$, which
+gives the following:

 $$ \begin{aligned}
-    & d_i^{~t+\Delta t} = d_i - u_i^+ (d_{i+1} - d_{i}) +  u_i^- (d_{i} - d_{i-1}) \text{ with }  \\
+    & d_i(t+\Delta t) = d_i - u_i^+ (d_{i+1} - d_{i}) +  u_i^- (d_{i} - d_{i-1}) \text{ with }  \\
    & u_i^+ = \text{max}(u_i \Delta t / \Delta x,0) \\
    & u_i^- = \text{min}(u_i \Delta t / \Delta x,0)
 \end{aligned} $$
@@ -274,19 +297,97 @@ name: advection-upwind
 ```

 Thus, for a positive $u_i$ we have 
-$d_i^{~t+\Delta t} = (1 + \frac{u_i \Delta t }{ \Delta x}) d_i - \frac{u_i \Delta t }{ \Delta x} d_{i+1}$
-and hence 
-$\partial \mathcal P / \partial u_i$ for cell $i$ would be $1 + \frac{u_i \Delta t }{ \Delta x}$.
-For the full gradient we'd need to add 
-the potential contributions from cells $i+1$ and $i-1$, depending on the sign of their velocities.

-In practice this step is similar to evaluating a transposed matrix multiplication.
+$$
+    \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t) = (1 + \frac{u_i \Delta t }{ \Delta x}) d_i - \frac{u_i \Delta t }{ \Delta x} d_{i+1}
+$$ (eq:advection)
+
+and hence $\partial \mathcal P / \partial u_i$ gives 
+$\frac{\Delta t }{ \Delta x} d_i - \frac{\Delta t }{ \Delta x} d_{i+1}$. Intuitively, 
+the change of the velocity $u_i$ depends on the spatial derivatives of the densities. 
+Due to the first order upwinding, we only include two neighbors (higher order methods would depend on
+additional entries of $d$)
+
+% velocity derivative , Just for completeness: another derivative we could compute here is
+
+In practice this step is equivalent to evaluating a transposed matrix multiplication.
 If we rewrite the calculation above as 
-$d^{~t+\Delta t} = A \mathbf{u}$, then $\partial \mathcal P / \partial \mathbf{u} = A^T$.
+$ \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t) = A \mathbf{u}$, 
+then $\partial \mathcal P / \partial \mathbf{u} = A^T$.
 However, in many practical cases, a matrix free implementation of this multiplication might 
 be preferable to actually constructing $A$.

-## A (slightly) more complex example
+% density derivative
+Another derivative that we can consider for the advection scheme is that w.r.t. the previous
+density state, i.e. $d_i(t)$, which is $d_i$ in the shortened notation.
+$\partial \mathcal P / \partial d_i$ for cell $i$ from above gives $1 + \frac{u_i \Delta t }{ \Delta x}$. However, for the full gradient we'd need to add the potential contributions from cells $i+1$ and $i-1$, depending on the sign of their velocities. This derivative will come into play in the next section.
+
+### Time evolution
+
+So far we've only dealt with a single update step of
+$d$ from time $t$ to $t+\Delta t$, but we could of course have an arbitrary number of such 
+steps. After all, above we stated the goal to advance the initial marker state $d(t^0)$ to
+the target state at time $t^e$, which could encompass a long interval of time.
+
+In the expression above for $d_i(t+\Delta t)$, each of the $d_i(t)$ in turn depend
+on the velocity and density states at time $t-\Delta t$, i.e., $d_i(t-\Delta t)$. Thus we have to trace back
+the influence of our loss $L$ all the way back to how $\mathbf{u}$ influences the initial marker
+state. This can involve a large number of evaluations of our advection scheme via $\mathcal P$.
+
+This sounds challenging at first:
+e.g., one could try to insert equation {eq}`eq:advection` at time $t-\Delta t$
+into equation {eq}`eq:advection` at $t$ and repeat this process recursively until
+we have a single expression relating $d^{~0}$ to the targets. However, thanks
+to the linear nature of the Jacobians, we can treat each advection step, i.e.,
+each invocation of our PDE $\mathcal P$ as a seperate, modular
+operation. And each of these invocations follows the procedure described 
+in the previous section.
+
+Hence, given the machinery above, the backtrace is fairly simple to realize: 
+for each of the advection steps
+in $\mathcal P$ we can compute a Jacobian product with the _incoming_ vector of derivatives
+from the loss $L$ or a previous advection step. We repeat this until we have traced the chain from the
+loss with $d^{\text{target}}$ all the way back to $d^{~0}$. 
+Theoretically, the velocity $\mathbf{u}$ could be a function of time, in which
+case we'd get a gradient $\Delta \mathbf{u}(t)$ for every time step $t$. To simplify things
+below, let's we assume we have field that is constant in time, i.e., we're
+reusing the same velocities $\mathbf{u}$ for every advection via $\mathcal P$. Now, each time step
+will give us a contribution to $\Delta \mathbf{u}$ which we can accumulate for all steps.
+
+$$ \begin{aligned}
+    \Delta \mathbf{u} =& 
+        \frac{ \partial d(t^e) }{ \partial \mathbf{u} }
+        \frac{ \partial L }{ \partial d(t^e) }
+        +
+        \frac{ \partial d(t^e - \Delta t) }{ \partial \mathbf{u}}
+        \frac{ \partial d(t^e) }{ \partial d(t^e - \Delta t) }
+        \frac{ \partial L }{ \partial d(t^e)}
+        + \\
+    & 
+        \cdots + \\
+    & 
+       \Big( \frac{ \partial d(t^0) }{ \partial \mathbf{u}} \cdots 
+        \frac{ \partial d(t^e - \Delta t) }{ \partial d(t^e - 2 \Delta t) }
+        \frac{ \partial d(t^e) }{ \partial d(t^e - \Delta t) }
+        \frac{ \partial L }{ \partial d(t^e)} \Big)
+\end{aligned} $$
+
+Here the last term above contains the full backtrace of the marker density to time $t^0$. 
+The terms of this sum look unwieldy 
+at first, but they contain a lot of similar Jacobians, and in practice can be computed efficiently
+by backtracing through the sequence of computational steps in the forward evaluation of our PDE.
+
+This structure also makes clear that the process is very similar to the regular training
+process of an NN: the evaluations of these Jacobian vector products is exactly what
+a deep learning framework does for training an NN (we just have weights $\theta$ instead
+of a velocity field there). And hence all we need to do in practice is to provide a custom 
+function the Jacobian vector product for $\mathcal P$.
+
+
+---
+
+
+## Implicit gradient calculations

 As a slightly more complex example let's consider Poisson's equation $\nabla^2 a = b$, where
 $a$ is the quantity of interest, and $b$ is given. 
@@ -305,8 +406,8 @@ If we now introduce an NN that modifies $\mathbf{u}$ in a solver, we inevitably
 backpropagate through the Poisson solve. I.e., we need a gradient for $\mathbf{u}^{n}$, which in this
 notation takes the form $\partial \mathbf{u}^{n} / \partial \mathbf{u}$.

-In combination, $\mathbf{u}^{n} = \mathbf{u} - \nabla \left(  (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients simple to compute. The main difficulty lies in obtaining the
-matrix inverse $(\nabla^2)^{-1}$ from the Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear). 
+In combination, $\mathbf{u}^{n} = \mathbf{u} - \nabla \left(  (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients are simple to compute. The main difficulty lies in obtaining the
+matrix inverse $(\nabla^2)^{-1}$ from Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear). 

 In practice, the matrix vector product for $(\nabla^2)^{-1} b$ with $b=\nabla \cdot \mathbf{u}$ is not explicitly computed via matrix operations, but approximated with a (potentially matrix-free) iterative solver. E.g., conjugate gradient (CG) methods are a very popular choice here. Thus, we could treat this iterative solver as a function $S$,
 with $p = S(\nabla \cdot \mathbf{u})$. Note that matrix inversion is a non-linear process, despite the matrix itself being linear. As solvers like CG are also based on matrix and vector operations, we could decompose $S$ into a sequence of simpler operations $S(x) = S_n( S_{n-1}(...S_{1}(x)))$, and backpropagate through each of them. This is certainly possible, but not a good idea: it can introduce numerical problems, and can be very slow.