Starting diffphys chapter

2021-01-15 16:13:41 +08:00 · 2021-01-15 16:13:41 +08:00 · cbf20ff8fe
commit cbf20ff8fe
parent 0063c71c05
6 changed files with 273 additions and 13 deletions
--- a/diffphys-code-ns.ipynb
+++ b/diffphys-code-ns.ipynb
@ -6,7 +6,7 @@
    "id": "o4JZ84moBKMr"
   },
   "source": [
-    "# Differentiable Fluid Simulations with Φ<sub>Flow</sub>\n",
+    "# Differentiable Fluid Simulations\n",
    "\n",
    "... now a more complex example with fluid simulations (Navier-Stokes) ...\n"
   ]
--- a/diffphys.md
+++ b/diffphys.md
@ -1,6 +1,243 @@
-Differentiable Simulations
+Differentiable Physics
 =======================

-... are much more powerful ...
+%... are much more powerful ...
+As a next step towards a tighter and more generic combination of deep learning
+methods and deep learning we will target incorporating _differentiable physical 
+simulations_ into the learning process. In the following, we'll shorten
+that to "differentiable physics" (DP).

+The central goal of this methods is to use existing numerical solvers, and equip
+them with functionality to compute gradients with respect to their inputs.
+Once this is realized for all operators of a simulation, we can leverage 
+the autodiff functionality of DL frameworks with back-propagation to let gradient 
+information from from a simulator into an NN and vice versa. This has numerous 
+advantages such as improved learning feedback and generalization, as we'll outline below.
+In contrast to physics-informed loss functions, it also enables handling more complex
+solution manifolds instead of single inverse problems.
+
+## Differentiable Operators
+
+With this direction we build on existing numerical solvers. I.e., 
+the approach is strongly relying on the algorithms developed in the larger field 
+of computational methods for a vast range of physical effects in our world.
+To start with we need a continuous formulation as model for the physical effect that we'd like 
+to simulate -- if this is missing we're in trouble. But luckily, we can resort to 
+a huge library of established model equations, and ideally also on an established
+method for discretization of the equation.
+
+Let's assume we have a continuous formulation $\mathcal P^*(\mathbf{x}, \nu)$ of the physical quantity of 
+interest $\mathbf{u}(\mathbf{x}, t): \mathbb R^d \times \mathbb R^+ \rightarrow \mathbb R^d$,
+with a model parameter $\nu$ (e.g., a diffusion or viscosity constant).
+The component of $\mathbf{u}$ will be denoted by a numbered subscript, i.e.,
+$\mathbf{u} = (u_1,u_2,\dots,u_d)^T$.
+%and a corresponding discrete version that describes the evolution of this quantity over time: $\mathbf{u}_t = \mathcal P(\mathbf{x}, \mathbf{u}, t)$.
+Typically, we are interested in the temporal evolution of such a system, 
+and discretization yields a formulation $\mathcal P(\mathbf{x}, \nu)$
+that we can re-arrange to compute a future state after a time step $\Delta t$ via sequence of
+operations $\mathcal P_1, \mathcal P_2 \dots \mathcal P_m$ such that
+$\mathbf{u}(t+\Delta t) = \mathcal P_1 \circ \mathcal P_2 \circ \dots \mathcal P_m ( \mathbf{u}(t),\nu )$,
+where $\circ$ denotes function decomposition, i.e. $f(g(x)) = f \circ g(x)$.
+
+```{note} 
+In order to integrate this solver into a DL process, we need to ensure that every operator
+$\mathcal P_i$ provides a gradient w.r.t. its inputs, i.e. in the example above
+$\partial \mathcal P_i / \partial \mathbf{u}$. 
+```
+
+Note that we typically don't need derivatives 
+for all parameters of $\mathcal P$, e.g. we omit $\nu$ in the following, assuming that this is a 
+given model parameter, with which the NN should not interact. Naturally, it can vary, 
+by $\nu$ will not be the output of a NN representation. If this is the case, we can omit
+providing $\partial \mathcal P_i / \partial \nu$ in our solver. 
+
+## Jacobians
+
+As $\mathbf{u}$ is typically a vector-valued function, $\partial \mathcal P_i / \partial \mathbf{u}$ denotes
+a Jacobian matrix $J$ rather than a single value:
+% test
+$$ \begin{aligned}
+    \frac{ \partial \mathcal P_i }{ \partial \mathbf{u} } = 
+    \begin{bmatrix} 
+    \partial \mathcal P_{i,1} / \partial u_{1} 
+    & \  \cdots \ &
+    \partial \mathcal P_{i,1} / \partial u_{d} 
+    \\
+    \vdots & \ & \ 
+    \\
+    \partial \mathcal P_{i,d} / \partial u_{1} 
+    & \  \cdots \ &
+    \partial \mathcal P_{i,d} / \partial u_{d} 
+    \end{bmatrix} 
+\end{aligned} $$
+where, as above, $d$ denotes the number of components in $\mathbf{u}$. As $\mathcal P$ maps one value of
+$\mathbf{u}$ to another, the jacobian is square and symmetric here. Of course this isn't necessarily the case
+for general model equations, but non-square Jacobian matrices would not cause any problems for differentiable 
+simulations.
+
+In practice, we can rely on the _reverse mode_ differentiation that all modern DL
+frameworks provide, and focus on computing a matrix vector product of the Jacobian transpose
+with a vector $\mathbf{a}$, i.e. the expression: 
+$
+    ( \frac{\partial \mathcal P_i }{ \partial \mathbf{u} } )^T \mathbf{a}
+$. 
+If we'd need to construct and store all full Jacobian matrices that we encounter during training, 
+this would cause huge memory overheads and unnecessarily slow down training.
+Instead, for backpropagation, we can provide faster operations that compute products
+with the Jacobian transpose because we always have a scalar loss function at the end of the chain.
+
+[TODO check transpose of Jacobians in equations]
+
+Given the formulation above, we need to resolve the derivatives
+of the chain of function compositions of the $\mathcal P_i$ at some current state $\mathbf{u}^n$ via the chain rule.
+E.g., for two of them
+
+$
+    \frac{ \partial (\mathcal P_1 \circ \mathcal P_2) }{ \partial \mathbf{u} }|_{\mathbf{u}^n}
+    = 
+    \frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} }|_{\mathcal P_2(\mathbf{u}^n)}
+    \ 
+    \frac{ \partial \mathcal P_2 }{ \partial \mathbf{u} }|_{\mathbf{u}^n}
+$,
+which is just the vector valued version of the "classic" chain rule
+$f(g(x))' = f'(g(x)) g'(x)$, and directly extends for larger numbers of composited functions, i.e. $i>2$.
+
+Here, the derivatives for $\mathcal P_1$ and $\mathcal P_2$ are still Jacobian matrices, but knowing that 
+at the "end" of the chain we have our scalar loss (cf. {doc}`overview`), the right-most Jacobian will invariably
+be a matrix with 1 column, i.e. a vector. During reverse mode, we start with this vector, and compute
+the multiplications with the left Jacobians, $\frac{ \partial \mathcal P_1 }{ \partial \mathbf{u} }$ above,
+one by one.
+
+For the details of forward and reverse mode differentiation, please check out external materials such 
+as this [nice survey by Baydin et al.](https://arxiv.org/pdf/1502.05767.pdf).
+
+## Learning via DP Operators 
+
+Thus, long story short, once the operators of our simulator support computations of the Jacobian-vector 
+products, we can integrate them into DL pipelines just like you would include a regular fully-connected layer
+or a ReLU activation.
+
+At this point, the following (very valid) question often comes up: "_Most physics solver can be broken down into a
+sequence of vector and matrix operations. All state-of-the-art DL frameworks support these, so why don't we just 
+use these operators to realize our physics solver?_"
+
+It's true that this would theoretically be possible. The problem here is that each of the vector and matrix
+operations in tensorflow and pytorch is computed individually, and internally needs to store the current 
+state of the forward evaluation for backpropagation (the "$g(x)$" above). For a typical 
+simulation, however, we're not overly interested in every single intermediate result our solver produces.
+Typically, we're more concerned with significant updates such as the step from $\mathbf{u}(t)$ to  $\mathbf{u}(t+\Delta t)$.
+
+%provide discretized simulator of physical phenomenon as differentiable operator.
+Thus, in practice it is a very good idea to break down the solving process into a sequence
+of meaningful but _monolithic_ operators. This not only saves a lot of work by preventing the calculation
+of unnecessary intermediate results, it also allows us to choose the best possible numerical methods 
+to compute the updates (and derivatives) for these operators.
+%in practice break down into larger, monolithic components
+E.g., as this process is very similar to adjoint method optimizations, we can re-use many of the techniques
+that were developed in this field, or leverage established numerical methods. E.g., 
+we could leverage the $O(n)$ complexity of multigrid solvers for matrix inversion.
+
+The flipside of this approach is, that it requires some understanding of the problem at hand, 
+and of the numerical methods. Also, a given solver might not provide gradient calculations out of the box.
+Thus, we want to employ DL for model equations that we don't have a proper grasp of, it might not be a good
+idea to direclty go for learning via a DP approach. However, if we don't really understand our model, we probably
+should go back to studying it a bit more anyway...
+
+Also, in practice we can be _greedy_ with the derivative operators, and only 
+provide those which are relevant for the learning task. E.g., if our network 
+never produces the parameter $\nu$ in the example above, and it doesn't appear in our
+loss formulation, we will never encounter a $\partial/\partial \nu$ derivative
+in our backpropagation step.
+
+---
+
+## A practical example
+
+As a simple example let's consider the advection of a passive scalar density $d(\mathbf{x},t)$ in
+a velocity field $\mathbf{u}$ as physical model $\mathcal P^*$:
+
+$$
+  \frac{\partial d}{\partial{t}} + \mathbf{u} \cdot \nabla d = 0 
+$$
+
+Instead of using this formulation as a residual equation right away (as in {doc}`physicalloss`), 
+we can discretize it with our favorite mesh and discretization scheme,
+to obtain a formulation that updates the state of our system over time. This is a standard
+procedure for a _forward_ solve.
+Note that to simplify things, we assume that $\mathbf{u}$ is only a function in space,
+i.e. constant over time. We'll bring back the time evolution of $\mathbf{u}$ later on.
+%
+[TODO, write out simple finite diff approx?]
+%
+Let's denote this re-formulation as $\mathcal P$ that maps a state of $d(t)$ into a 
+new state at an evoled time, i.e.:
+
+$$
+    d(t+\Delta t) = \mathcal P ( d(t), \mathbf{u}, t+\Delta t) 
+$$
+
+As a simple optimization and learning task, let's consider the problem of
+finding an unknown motion $\mathbf{u}$ such that starting with a given initial state $d^{~0}$ at $t=t^0$,
+the time evolved scalar density at time $t=t^e$ has a certain shape or configuration $d^{\text{target}}$.
+Informally, we'd like to find a motion that deforms $d^{~0}$ into a target state.
+The simplest way to express this goal is via an $L^2$ loss between the two states, hence we want
+to minimize $|d(t^e) - d^{\text{target}}|^2$. 
+Note that as described here this is a pure optimization task, there's no NN involved,
+and our goal is $\mathbf{u}$. We do not want to apply this motion to other, unseen _test data_,
+as would be custom in a real learning task.
+
+The final state of our marker density $d(t^e)$ is fully determined by the evolution 
+of $\mathcal P$ via $\mathbf{u}$, which gives the following minimization as overall goal:
+
+$$
+    \text{arg min}_{~\mathbf{u}} | \mathcal P ( d^{~0}, \mathbf{u}, t^e) - d^{\text{target}}|^2
+$$
+
+We'd now like to find the minimizer for this objective by
+gradient descent (as the most straight-forward starting point), where the 
+gradient is determined by the differentiable physics approach described above.
+As the velocity field $\mathbf{u}$ contains our degrees of freedom,
+what we're looking for is the Jacobian 
+$\partial \mathcal P / \partial \mathbf{u}$, which we luckily don't need as a full
+matrix, but instead only mulitplied by the vector obtained from the derivative of the 
+$L^2$ loss $L= |d(t^e) - d^{\text{target}}|^2$, thus
+$\partial L / \partial d$
+
+TODO, introduce L earlier
+P gives us d, next pd / du -> udpate u
+
+...to obtain an explicit update of the form
+$d(t+\Delta t) = A d(t)$,
+where the matrix $A$ represents the discretized advection step of size $\Delta t$ for $\mathbf{u}$.
+
+E.g., for a simple first order upwinding scheme on a Cartesian grid we'll get a matrix that essentially encodes
+linear interpolation coefficients for positions $\mathbf{x} + \Delta t \mathbf{u}$. For a grid of 
+size $d_x \times d_y$ we'd have a 
+
+simplest example, advection step of passive scalar
+turn into matrix, mult by transpose 
+matrix free
+
+more complex, matrix inversion, eg Poisson solve
+dont backprop through all CG steps (available in phiflow though)
+rather, re-use linear solver to compute multiplication by inverse matrix
+
+note - time can be "virtual" , solving for steady state
+only assumption: some iterative procedure, not just single eplicit step - then things simplify.
+
+## Summary of Differentiable Physics
+
+This gives us a method to include physical equations into DL learning as a soft-constraint.
+Typically, this setup is suitable for _inverse_ problems, where we have certain measurements or observations
+that we wish to find a solution of a model PDE for. Because of the high expense of the reconstruction (to be 
+demonstrated in the following), the solution manifold typically shouldn't be overly complex. E.g., it is difficult 
+to capture a wide range of solutions, such as the previous supervised airfoil example, in this way.
+
+```{figure} resources/placeholder.png
+---
+height: 220px
+name: dp-training
+---
+TODO, visual overview of DP training
+```

--- a/overview.md
+++ b/overview.md
@ -144,9 +144,20 @@ Very brief intro, basic equations... approximate $f^*(x)=y$ with NN $f(x;\theta)

 learn via GD, $\partial f / \partial \theta$ 

+general goal, minimize E for e(x,y) ... cf. eq. 8.1 from DLbook
+
+$$
+  test \~ \approx eq \ \RR
+$$
+
+introduce scalar loss, always(!) scalar...
+
 Read chapters 6 to 9 of the [Deep Learning book](https://www.deeplearningbook.org),
 especially about [MLPs]https://www.deeplearningbook.org/contents/mlp.html and 
 "Conv-Nets", i.e. [CNNs](https://www.deeplearningbook.org/contents/convnets.html).

 **Note:** Classic distinction between _classification_ and _regression_ problems not so important here,
 we only deal with _regression_ problems in the following.
+
+maximum likelihood estimation
+
--- a/physicalloss-discuss.md
+++ b/physicalloss-discuss.md
@ -16,6 +16,17 @@ And while the setup is realtively simple, it is generally difficult to control.
 has flexibility to refine the solution by itself, but at the same time, tricks are necessary
 when it doesn't pick the right regions of the solution.

+## Is it "Machine Learning"
+
+TODO, discuss - more akin to classical optimization:
+we test for space/time positions at training time, and are interested in the  
+solution there afterwards.
+
+hence, no real generalization, or test data with different distribution.
+more similar to inverse problem that solves single state e.g. via BFGS or Newton.
+
+## Summary
+
 In general, a fundamental drawback of this approach is that it does combine with traditional
 numerical techniques well. E.g., learned representation is not suitable to be refined with 
 a classical iterative solver such as the conjugate gradient method. This means many
--- a/physicalloss.md
+++ b/physicalloss.md
@ -9,24 +9,25 @@ as an "external" tool to produce a big pile of data 😢.
 ## Using Physical Models

 We can improve this setting by trying to bring the model equations (or parts thereof)
-into the training process. E.g., given a PDE for $\mathbf{u}(x,t)$ with a time evolution, 
+into the training process. E.g., given a PDE for $\mathbf{u}(\mathbf{x},t)$ with a time evolution, 
 we can typically express it in terms of a function $\mathcal F$ of the derivatives 
 of $\mathbf{u}$ via  
 $
-  \mathbf{u}_t = \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{x..x})
+  \mathbf{u}_t = \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx, ... \mathbf{u}_{xx...x} )
 $,
-where the $_{x}$ subscripts denote spatial derivatives of higher order.
+where the $_{\mathbf{x}}$ subscripts denote spatial derivatives with respect to one of the spatial dimensions
+of higher and higher order (this can of course also include derivatives with repsect to different axes).

-In this context we can employ DL by approxmating the unknown $\mathbf{u}$ itself 
+In this context we can employ DL by approximating the unknown $\mathbf{u}$ itself 
 with a NN, denoted by $\tilde{\mathbf{u}}$. If the approximation is accurate, the PDE
 naturally should be satisfied, i.e., the residual $R$ should be equal to zero: 
 $
-  R = \mathbf{u}_t - \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx}, ... \mathbf{u}_{x..x}) = 0
+  R = \mathbf{u}_t - \mathcal F ( \mathbf{u}_{x}, \mathbf{u}_{xx, ... \mathbf{u}_{xx...x} ) = 0
 $

 This nicely integrates with the objective for training a neural network: similar to before
 we can collect sample solutions 
-$[x_0,y_0], ...[x_n,y_n]$ for $\mathbf{u}$ with $\mathbf{u}(x)=y$. 
+$[x_0,y_0], ...[x_n,y_n]$ for $\mathbf{u}$ with $\mathbf{u}(\mathbf{x})=y$. 
 This is typically important, as most practical PDEs we encounter do not have unique solutions
 unless initial and boundary conditions are specified. Hence, if we only consider $R$ we might
 get solutions with random offset or other undesirable components. Hence the supervised sample points
@ -35,7 +36,7 @@ Now our training objective becomes

 $\text{arg min}_{\theta} \ \alpha_0 \sum_i (f(x_i ; \theta)-y_i)^2 + \alpha_1 R(x_i) $,

-where $\alpha_{0,1}$ denote hyper parameters that scale the contribution of the supervised term and 
+where $\alpha_{0,1}$ denote hyperparameters that scale the contribution of the supervised term and 
 the residual term, respectively. We could of course add additional residual terms with suitable scaling factors here.

 Note that, similar to the data samples used for supervised training, we have no guarantees that the
@ -56,8 +57,8 @@ uses fully connected NNs to represent $\mathbf{u}$. This has some interesting pr
 Due to the popularity of the version, we'll also focus on it in the following code examples and comparisons.

 The central idea here is that the aforementioned general function $f$ that we're after in our learning problems
-can be seen as a representation of a physical field we're after. Thus, the $\mathbf{u}(x)$ will 
-be turned into $\mathbf{u}(x, \theta)$ where we choose $\theta$ such that the solution to $\mathbf{u}$ is 
+can be seen as a representation of a physical field we're after. Thus, the $\mathbf{u}(\mathbf{x})$ will 
+be turned into $\mathbf{u}(\mathbf{x}, \theta)$ where we choose $\theta$ such that the solution to $\mathbf{u}$ is 
 represented as precisely as possible.

 One nice side effect of this viewpoint is that NN representations inherently support the calculation of derivatives. 
--- a/supervised.md
+++ b/supervised.md
@ -22,7 +22,7 @@ by adjusting weights $\theta$ of our representation with $f$ such that
 $\text{arg min}_{\theta} \sum_i (f(x_i ; \theta)-y_i)^2$.

 This will give us $\theta$ such that $f(x;\theta) \approx y$ as accurately as possible given
-our choice of $f$ and the hyper parameters for training. Note that above we've assumed 
+our choice of $f$ and the hyperparameters for training. Note that above we've assumed 
 the simplest case of an $L^2$ loss. A more general version would use an error metric $e(x,y)$
 to be minimized via $\text{arg min}_{\theta} \sum_i e( f(x_i ; \theta) , y_i) )$. The choice
 of a suitable metric is topic we will get back to later on.