extended advection example, comments Maxi

2021-07-15 14:49:04 +02:00
parent 7614288ac5
commit 776b9ca2e3
1 changed files with 146 additions and 45 deletions
--- a/diffphys.md
+++ b/diffphys.md
@@ -34,7 +34,7 @@ provide directions in the form of gradients to steer the learning process.
 With the DP direction we build on existing numerical solvers. I.e., 
 the approach is strongly relying on the algorithms developed in the larger field 
 of computational methods for a vast range of physical effects in our world.
-To start with we need a continuous formulation as model for the physical effect that we'd like 
+To start with, we need a continuous formulation as model for the physical effect that we'd like 
 to simulate -- if this is missing we're in trouble. But luckily, we can 
 tap into existing collections of model equations and established methods
 for discretizing continuous models.
@@ -45,9 +45,10 @@ with model parameters $\nu$ (e.g., diffusion, viscosity, or conductivity constan
 The component of $\mathbf{u}$ will be denoted by a numbered subscript, i.e.,
 $\mathbf{u} = (u_1,u_2,\dots,u_d)^T$.
 %and a corresponding discrete version that describes the evolution of this quantity over time: $\mathbf{u}_t = \mathcal P(\mathbf{x}, \mathbf{u}, t)$.
-Typically, we are interested in the temporal evolution of such a system, 
+Typically, we are interested in the temporal evolution of such a system.
-and discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
+Discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
-that we can re-arrange to compute a future state after a time step $\Delta t$ via sequence of
+that we can re-arrange to compute a future state after a time step $\Delta t$. 
 The state at $t+\Delta t$ is computed via sequence of
 operations $\mathcal P_1, \mathcal P_2 \dots \mathcal P_m$ such that
 $\mathbf{u}(t+\Delta t) = \mathcal P_1 \circ \mathcal P_2 \circ \dots \mathcal P_m ( \mathbf{u}(t),\nu )$,
 where $\circ$ denotes function decomposition, i.e. $f(g(x)) = f \circ g(x)$.
@@ -151,9 +152,9 @@ E.g., as this process is very similar to adjoint method optimizations, we can re
 that were developed in this field, or leverage established numerical methods. E.g., 
 we could leverage the $O(n)$ runtime of multigrid solvers for matrix inversion.
-The flip-side of this approach is, that it requires some understanding of the problem at hand, 
+The flip-side of this approach is that it requires some understanding of the problem at hand, 
 and of the numerical methods. Also, a given solver might not provide gradient calculations out of the box.
-Thus, we want to employ DL for model equations that we don't have a proper grasp of, it might not be a good
+Thus, if we want to employ DL for model equations that we don't have a proper grasp of, it might not be a good
 idea to directly go for learning via a DP approach. However, if we don't really understand our model, we probably
 should go back to studying it a bit more anyway...
@@ -199,14 +200,16 @@ $$
    d(t+\Delta t) = \mathcal P ( ~ d(t), \mathbf{u}, t+\Delta t) 
 $$
-As a simple example of an optimization and learning task, let's consider the problem of
+As a simple example of an inverse problem and learning task, let's consider the problem of
-finding a motion $\mathbf{u}$ such that starting with a given initial state $d^{~0}$ at $t^0$,
+finding a unknown motion $\mathbf{u}$: 
-the time evolved scalar density at time $t^e$ has a certain shape or configuration $d^{\text{target}}$.
+this motion should transform a given initial scalar density state $d^{~0}$ at time $t^0$ 
-Informally, we'd like to find a motion that deforms $d^{~0}$ into a target state.
+into state that's evolved by $\mathcal P$ to a later "end" time $t^e$ 
 with a certain shape or configuration $d^{\text{target}}$.
 Informally, we'd like to find a motion that deforms $d^{~0}$ through the PDE model into a target state.
 The simplest way to express this goal is via an $L^2$ loss between the two states. So we want
 to minimize the loss function $L=|d(t^e) - d^{\text{target}}|^2$. 
-Note that as described here this is a pure optimization task, there's no NN involved,
+Note that as described here this inverse problem is a pure optimization task: there's no NN involved,
 and our goal is to obtain $\mathbf{u}$. We do not want to apply this motion to other, unseen _test data_,
 as would be custom in a real learning task.
@@ -214,7 +217,7 @@ The final state of our marker density $d(t^e)$ is fully determined by the evolut
 of $\mathcal P$ via $\mathbf{u}$, which gives the following minimization problem:
 $$
-    \text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e - t^0 ) - d^{\text{target}}|^2
+    \text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2
 $$
 We'd now like to find the minimizer for this objective by
@@ -229,38 +232,58 @@ $\Delta \mathbf{u} = \partial L / \partial \mathbf{u}$,
 which can be decomposed into 
 $\Delta \mathbf{u} = 
 \frac{ \partial d }{ \partial \mathbf{u}}
-\frac{ \partial L }{ \partial d}
+\frac{ \partial L }{ \partial d} $.
 $.
 And as the evolution of $d$ is given by our discretized physical model $P$,
 what we're acutally looking for is the Jacobian 
 $\partial \mathcal P / \partial \mathbf{u}$ to
 compute 
 $\Delta \mathbf{u} = 
 \frac{ \partial \mathcal P }{ \partial \mathbf{u}}
 \frac{ \partial L }{ \partial d}$. 
 We luckily don't need $\partial \mathcal P / \partial \mathbf{u}$ as a full
 matrix, but instead only mulitplied by the vector obtained from the derivative of our scalar 
 loss function $L$.
 The $\frac{ \partial L }{ \partial d}$ component is typically simple enough: we'll get 
 $$ 
 \frac{ \partial L }{ \partial d} 
    = \partial | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2 / \partial d 
    = 2 (d(t^e)-d^{\text{target}}).
 $$
 If $d$ is represented as a vector, e.g., for one entry per cell of a mesh, 
 $\frac{ \partial L }{ \partial d}$ will likewise be a column vector of equivalent size.
 This is thanks to the fact that $L$ is always a scalar loss function, and hence the Jacobian
 matrix will have a dimension of 1 along the $L$ dimension.
 Intuitively, this vector will simply contain the differences between $d$ at the end time
 in comparison to the target densities $d^{\text{target}}$.
 The evolution of $d$ itself is given by our discretized physical model $\mathcal P$,
 and we use $\mathcal P$ and $d$ interchangeably.
 Hence, the more interesting component is the Jacobian 
 $\partial d / \partial \mathbf{u} = \partial \mathcal P / \partial \mathbf{u}$ to
 compute the full $\Delta \mathbf{u} = 
 \frac{ \partial d }{ \partial \mathbf{u}}
 \frac{ \partial L }{ \partial d}$. 
 We luckily don't need $\partial d / \partial \mathbf{u}$ as a full
 matrix, but instead only multiplied by $\frac{ \partial L }{ \partial d}$.
 So what is the actual Jacobian for $d$? To compute it we first need
 to finalize our PDE model $\mathcal P$, such that we get an expression which we can derive.
 In the next section we'll choose a specific advection scheme and a discretization
 so that we can be more specific.
 %the vector obtained from the derivative of our scalar loss function $L$.
 %the $L^2$ loss $L= |d(t^e) - d^{\text{target}}|^2$, thus
-So what are the actual Jacobians here?
+%So what are the actual Jacobians here? The one for $L$ is simple enough, we simply get a column vector with entries of the form $2(d_i(t^e) - d^{\text{target}}_i)$ for one component $i$.
-The one for $L$ is simple enough, we simply get a column vector with entries of the form
+
-$2(d(t^e)_i - d^{\text{target}})_i$ for one component $i$.
+%$\partial \mathcal P / \partial \mathbf{u}$ is more interesting: here we'll get derivatives of the chosen advection operator w.r.t. each component of the velocities.
 $\partial \mathcal P / \partial \mathbf{u}$ is more interesting:
 here we'll get derivatives of the chosen advection operator w.r.t. each component of the 
 velocities.
 %...to obtain an explicit update of the form $d(t+\Delta t) = A d(t)$, where the matrix $A$ represents the discretized advection step of size $\Delta t$ for $\mathbf{u}$. ... we'll get a matrix that essentially encodes linear interpolation coefficients for positions $\mathbf{x} + \Delta t \mathbf{u}$. For a grid of size $d_x \times d_y$ we'd have a 
 ### Introducing a specific advection scheme
-E.g., for a simple [first order upwinding scheme](https://en.wikipedia.org/wiki/Upwind_scheme) 
+In the following we'll make use of a simple [first order upwinding scheme](https://en.wikipedia.org/wiki/Upwind_scheme) 
-on a Cartesian grid in 1D, with marker density and velocity $d_i$ and $u_i$ for cell $i$
+on a Cartesian grid in 1D, with marker density and velocity $d_i$ and $u_i$ for cell $i$.
-(superscripts for time $t$ are omitted for brevity), 
+We omit the $(t)$ for quantities at time $t$ for brevity, i.e., $d_i(t)$ is written as $d_i$ below.
-we get 
+From above, we'll use our _physical model_ that updates the marker density 
 $d_i(t+\Delta t) = \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t)$, which
 gives the following:
 $$ \begin{aligned}
-    & d_i^{~t+\Delta t} = d_i - u_i^+ (d_{i+1} - d_{i}) +  u_i^- (d_{i} - d_{i-1}) \text{ with }  \\
+    & d_i(t+\Delta t) = d_i - u_i^+ (d_{i+1} - d_{i}) +  u_i^- (d_{i} - d_{i-1}) \text{ with }  \\
    & u_i^+ = \text{max}(u_i \Delta t / \Delta x,0) \\
    & u_i^- = \text{min}(u_i \Delta t / \Delta x,0)
 \end{aligned} $$
@@ -274,19 +297,97 @@ name: advection-upwind
 ```
 Thus, for a positive $u_i$ we have 
 $d_i^{~t+\Delta t} = (1 + \frac{u_i \Delta t }{ \Delta x}) d_i - \frac{u_i \Delta t }{ \Delta x} d_{i+1}$
 and hence 
 $\partial \mathcal P / \partial u_i$ for cell $i$ would be $1 + \frac{u_i \Delta t }{ \Delta x}$.
 For the full gradient we'd need to add 
 the potential contributions from cells $i+1$ and $i-1$, depending on the sign of their velocities.
-In practice this step is similar to evaluating a transposed matrix multiplication.
+$$
    \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t) = (1 + \frac{u_i \Delta t }{ \Delta x}) d_i - \frac{u_i \Delta t }{ \Delta x} d_{i+1}
 $$ (eq:advection)
 and hence $\partial \mathcal P / \partial u_i$ gives 
 $\frac{\Delta t }{ \Delta x} d_i - \frac{\Delta t }{ \Delta x} d_{i+1}$. Intuitively, 
 the change of the velocity $u_i$ depends on the spatial derivatives of the densities. 
 Due to the first order upwinding, we only include two neighbors (higher order methods would depend on
 additional entries of $d$)
 % velocity derivative , Just for completeness: another derivative we could compute here is
 In practice this step is equivalent to evaluating a transposed matrix multiplication.
 If we rewrite the calculation above as 
-$d^{~t+\Delta t} = A \mathbf{u}$, then $\partial \mathcal P / \partial \mathbf{u} = A^T$.
+$ \mathcal P ( d_i(t), \mathbf{u}(t), t + \Delta t) = A \mathbf{u}$, 
 then $\partial \mathcal P / \partial \mathbf{u} = A^T$.
 However, in many practical cases, a matrix free implementation of this multiplication might 
 be preferable to actually constructing $A$.
-## A (slightly) more complex example
+% density derivative
 Another derivative that we can consider for the advection scheme is that w.r.t. the previous
 density state, i.e. $d_i(t)$, which is $d_i$ in the shortened notation.
 $\partial \mathcal P / \partial d_i$ for cell $i$ from above gives $1 + \frac{u_i \Delta t }{ \Delta x}$. However, for the full gradient we'd need to add the potential contributions from cells $i+1$ and $i-1$, depending on the sign of their velocities. This derivative will come into play in the next section.
 ### Time evolution
 So far we've only dealt with a single update step of
 $d$ from time $t$ to $t+\Delta t$, but we could of course have an arbitrary number of such 
 steps. After all, above we stated the goal to advance the initial marker state $d(t^0)$ to
 the target state at time $t^e$, which could encompass a long interval of time.
 In the expression above for $d_i(t+\Delta t)$, each of the $d_i(t)$ in turn depend
 on the velocity and density states at time $t-\Delta t$, i.e., $d_i(t-\Delta t)$. Thus we have to trace back
 the influence of our loss $L$ all the way back to how $\mathbf{u}$ influences the initial marker
 state. This can involve a large number of evaluations of our advection scheme via $\mathcal P$.
 This sounds challenging at first:
 e.g., one could try to insert equation {eq}`eq:advection` at time $t-\Delta t$
 into equation {eq}`eq:advection` at $t$ and repeat this process recursively until
 we have a single expression relating $d^{~0}$ to the targets. However, thanks
 to the linear nature of the Jacobians, we can treat each advection step, i.e.,
 each invocation of our PDE $\mathcal P$ as a seperate, modular
 operation. And each of these invocations follows the procedure described 
 in the previous section.
 Hence, given the machinery above, the backtrace is fairly simple to realize: 
 for each of the advection steps
 in $\mathcal P$ we can compute a Jacobian product with the _incoming_ vector of derivatives
 from the loss $L$ or a previous advection step. We repeat this until we have traced the chain from the
 loss with $d^{\text{target}}$ all the way back to $d^{~0}$. 
 Theoretically, the velocity $\mathbf{u}$ could be a function of time, in which
 case we'd get a gradient $\Delta \mathbf{u}(t)$ for every time step $t$. To simplify things
 below, let's we assume we have field that is constant in time, i.e., we're
 reusing the same velocities $\mathbf{u}$ for every advection via $\mathcal P$. Now, each time step
 will give us a contribution to $\Delta \mathbf{u}$ which we can accumulate for all steps.
 $$ \begin{aligned}
    \Delta \mathbf{u} =& 
        \frac{ \partial d(t^e) }{ \partial \mathbf{u} }
        \frac{ \partial L }{ \partial d(t^e) }
        +
        \frac{ \partial d(t^e - \Delta t) }{ \partial \mathbf{u}}
        \frac{ \partial d(t^e) }{ \partial d(t^e - \Delta t) }
        \frac{ \partial L }{ \partial d(t^e)}
        + \\
    & 
        \cdots + \\
    & 
       \Big( \frac{ \partial d(t^0) }{ \partial \mathbf{u}} \cdots 
        \frac{ \partial d(t^e - \Delta t) }{ \partial d(t^e - 2 \Delta t) }
        \frac{ \partial d(t^e) }{ \partial d(t^e - \Delta t) }
        \frac{ \partial L }{ \partial d(t^e)} \Big)
 \end{aligned} $$
 Here the last term above contains the full backtrace of the marker density to time $t^0$. 
 The terms of this sum look unwieldy 
 at first, but they contain a lot of similar Jacobians, and in practice can be computed efficiently
 by backtracing through the sequence of computational steps in the forward evaluation of our PDE.
 This structure also makes clear that the process is very similar to the regular training
 process of an NN: the evaluations of these Jacobian vector products is exactly what
 a deep learning framework does for training an NN (we just have weights $\theta$ instead
 of a velocity field there). And hence all we need to do in practice is to provide a custom 
 function the Jacobian vector product for $\mathcal P$.
 ---
 ## Implicit gradient calculations
 As a slightly more complex example let's consider Poisson's equation $\nabla^2 a = b$, where
 $a$ is the quantity of interest, and $b$ is given. 
@@ -305,8 +406,8 @@ If we now introduce an NN that modifies $\mathbf{u}$ in a solver, we inevitably
 backpropagate through the Poisson solve. I.e., we need a gradient for $\mathbf{u}^{n}$, which in this
 notation takes the form $\partial \mathbf{u}^{n} / \partial \mathbf{u}$.
-In combination, $\mathbf{u}^{n} = \mathbf{u} - \nabla \left(  (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients simple to compute. The main difficulty lies in obtaining the
+In combination, $\mathbf{u}^{n} = \mathbf{u} - \nabla \left(  (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients are simple to compute. The main difficulty lies in obtaining the
-matrix inverse $(\nabla^2)^{-1}$ from the Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear). 
+matrix inverse $(\nabla^2)^{-1}$ from Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear). 
 In practice, the matrix vector product for $(\nabla^2)^{-1} b$ with $b=\nabla \cdot \mathbf{u}$ is not explicitly computed via matrix operations, but approximated with a (potentially matrix-free) iterative solver. E.g., conjugate gradient (CG) methods are a very popular choice here. Thus, we could treat this iterative solver as a function $S$,
 with $p = S(\nabla \cdot \mathbf{u})$. Note that matrix inversion is a non-linear process, despite the matrix itself being linear. As solvers like CG are also based on matrix and vector operations, we could decompose $S$ into a sequence of simpler operations $S(x) = S_n( S_{n-1}(...S_{1}(x)))$, and backpropagate through each of them. This is certainly possible, but not a good idea: it can introduce numerical problems, and can be very slow.