unified backprop spelling

2021-03-25 12:34:02 +08:00 · 2021-03-25 12:34:02 +08:00 · 5bf263a60a
commit 5bf263a60a
parent 656457ff0f
6 changed files with 14 additions and 14 deletions
--- a/diffphys-code-sol.ipynb
+++ b/diffphys-code-sol.ipynb
@ -574,7 +574,7 @@
   "source": [
    "Additionally, we can define several global variables to control the training and the simulation.\n",
    "\n",
-    "The most important and interesting one is `msteps`. It defines the number of simulation steps that are unrolled at each training iteration. This directly influences the runtime of each training step, as we first have to simulation all steps forward, and then back-propagate the gradient through all `msteps` simulation steps interleaved with the NN evaluations. However, this is where we'll receive important feedback in terms of gradients how the inferred corrections actually influence a running simulation. Hence, larger `msteps` are typically better.\n",
+    "The most important and interesting one is `msteps`. It defines the number of simulation steps that are unrolled at each training iteration. This directly influences the runtime of each training step, as we first have to simulation all steps forward, and then backpropagate the gradient through all `msteps` simulation steps interleaved with the NN evaluations. However, this is where we'll receive important feedback in terms of gradients how the inferred corrections actually influence a running simulation. Hence, larger `msteps` are typically better.\n",
    "\n",
    "In addition we define the `source` and `reference` simulations below (note, the reference is just a placeholder for data, only the `source` simulation is actually executed). We also define the actual NN `network`. All three are initialized with the size given in the data set (`dataset.resolution`)."
   ]
--- a/diffphys-code-tf.ipynb
+++ b/diffphys-code-tf.ipynb
@ -111,7 +111,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As the graph encodes the full temporal sequence, and we can compute and propagate derivatives for all intermediates, we can can now back-propagate a gradient for this loss to any earlier variable, and hence also the initial velocity state that we're interested in (and that we already declared as a `variable()` to TF above).\n",
+    "As the graph encodes the full temporal sequence, and we can compute and propagate derivatives for all intermediates, we can can now backpropagate a gradient for this loss to any earlier variable, and hence also the initial velocity state that we're interested in (and that we already declared as a `variable()` to TF above).\n",
    "\n",
    "Also, in contrast to the PINN example we don't need additional terms for enforcing the boundary conditions, these are already present in the simulation, and hence encoded in the execution graph for the `states` sequence above.\n",
    "\n",
--- a/diffphys-dpvspinn.md
+++ b/diffphys-dpvspinn.md
@ -37,7 +37,7 @@ The following table summarizes these findings:

 | Method   |  ✅ Pro   |  ❌ Con  |
 |----------|-------------|------------|
-| **PINN** | - Analytic derivatives via back-propagation  | - Expensive evaluation of NN, as well as derivative calculations | 
+| **PINN** | - Analytic derivatives via backpropagation  | - Expensive evaluation of NN, as well as derivative calculations | 
 |          | - Simple to implement  | - Incompatible with existing numerical methods     | 
 |          |                  | - No control of discretization  | 
 | **DiffPhys** | - Leverage existing numerical methods | - More complicated to implement  | 
--- a/diffphys.md
+++ b/diffphys.md
@ -9,7 +9,7 @@ that to "differentiable physics" (DP).
 The central goal of this methods is to use existing numerical solvers, and equip
 them with functionality to compute gradients with respect to their inputs.
 Once this is realized for all operators of a simulation, we can leverage 
-the autodiff functionality of DL frameworks with back-propagation to let gradient 
+the autodiff functionality of DL frameworks with backpropagation to let gradient 
 information from from a simulator into an NN and vice versa. This has numerous 
 advantages such as improved learning feedback and generalization, as we'll outline below.

@ -95,7 +95,7 @@ $
 $. 
 If we'd need to construct and store all full Jacobian matrices that we encounter during training, 
 this would cause huge memory overheads and unnecessarily slow down training.
-Instead, for back-propagation, we can provide faster operations that compute products
+Instead, for backpropagation, we can provide faster operations that compute products
 with the Jacobian transpose because we always have a scalar loss function at the end of the chain.

 Given the formulation above, we need to resolve the derivatives
@ -134,7 +134,7 @@ use these operators to realize our physics solver?_"

 It's true that this would theoretically be possible. The problem here is that each of the vector and matrix
 operations in tensorflow and pytorch is computed individually, and internally needs to store the current 
-state of the forward evaluation for back-propagation (the "$g(x)$" above). For a typical 
+state of the forward evaluation for backpropagation (the "$g(x)$" above). For a typical 
 simulation, however, we're not overly interested in every single intermediate result our solver produces.
 Typically, we're more concerned with significant updates such as the step from $\mathbf{u}(t)$ to  $\mathbf{u}(t+\Delta t)$.

@ -158,7 +158,7 @@ Also, in practice we can be _greedy_ with the derivative operators, and only
 provide those which are relevant for the learning task. E.g., if our network 
 never produces the parameter $\nu$ in the example above, and it doesn't appear in our
 loss formulation, we will never encounter a $\partial/\partial \nu$ derivative
-in our back-propagation step.
+in our backpropagation step.

 ---

@ -287,22 +287,22 @@ velocity field. This step is typically crucial to enforce the hard-constraint $\
 and is often called _Chorin Projection_, or _Helmholtz decomposition_. It is also closely related to the fundamental theorem of vector calculus.

 If we now introduce an NN that modifies $\mathbf{u}$ in an iterative solver, we inevitably have to
-back-propagate through the Poisson solve. I.e., we need a gradient for $\mathbf{u}^{n}$, which in this
+backpropagate through the Poisson solve. I.e., we need a gradient for $\mathbf{u}^{n}$, which in this
 notation takes the form $\partial \mathbf{u}^{n} / \partial \mathbf{u}$.

 In combination, $\partial \mathbf{u}^{n} = \mathbf{u} - \nabla \left(  (\nabla^2)^{-1} \nabla \cdot \mathbf{u} \right)$. The outer gradient (from $\nabla p$) and the inner divergence ($\nabla \cdot \mathbf{u}$) are both linear operators, and their gradients simple to compute. The main difficulty lies in obtaining the
 matrix inverse $(\nabla^2)^{-1}$ from the Poisson's equation for pressure (we'll keep it a bit simpler here, but it's often time-dependent, and non-linear). 

 In practice, the matrix vector product for $(\nabla^2)^{-1} b$ with $b=\nabla \cdot \mathbf{u}$ is not explicitly computed via matrix operations, but approximated with a (potentially matrix-free) iterative solver. E.g., conjugate gradient (CG) methods are a very popular choice here. Thus, we could treat this iterative solver as a function $S$,
-with $p = S(\nabla \cdot \mathbf{u})$. Note that matrix inversion is a non-linear process, despite the matrix itself being linear. As solvers like CG are also based on matrix and vector operations, we could decompose $S$ into a sequence of simpler operations $S(x) = S_n( S_{n-1}(...S_{1}(x)))$, and back-propagate through each of them. This is certainly possible, but not a good idea: it can introduce numerical problems, and can be very slow.
+with $p = S(\nabla \cdot \mathbf{u})$. Note that matrix inversion is a non-linear process, despite the matrix itself being linear. As solvers like CG are also based on matrix and vector operations, we could decompose $S$ into a sequence of simpler operations $S(x) = S_n( S_{n-1}(...S_{1}(x)))$, and backpropagate through each of them. This is certainly possible, but not a good idea: it can introduce numerical problems, and can be very slow.
 By default DL frameworks store the internal states for every differentiable operator like the $S_i()$ in this example, and hence we'd organize and keep $n$ intermediate states in memory. These states are completely uninteresting for our original PDE, though. They're just intermediate states of the CG solver.

 If we take a step back and look at $p = (\nabla^2)^{-1} b$, it's gradient $\partial p / \partial b$
-is just $((\nabla^2)^{-1})^T$. And in this case, $(\nabla^2)$ is a symmetric matrix, and so $((\nabla^2)^{-1})^T=(\nabla^2)^{-1}$. This is the identical inverse matrix that we encountered in the original equation above, and hence we can re-use our unmodified iterative solver to compute the gradient. We don't need to take it apart and slow it down by storing intermediate states. However, the iterative solver computes the matrix-vector-products for $(\nabla^2)^{-1} b$. So what is $b$ during back-propagation? In an optimization setting we'll always have our loss function $L$ at the end of the forward chain. The back-propagation step will then give a gradient for the output, let's assume it is $\partial L/\partial p$ here, which needs to be propagated to the earlier operations of the forward pass. Thus, we can simply invoke our iterative solve during the backward pass to compute $\partial p / \partial b = S(\partial L/\partial p)$. And assuming that we've chosen a good solver as $S$ for the forward pass, we get exactly the same performance and accuracy in the backwards pass.
+is just $((\nabla^2)^{-1})^T$. And in this case, $(\nabla^2)$ is a symmetric matrix, and so $((\nabla^2)^{-1})^T=(\nabla^2)^{-1}$. This is the identical inverse matrix that we encountered in the original equation above, and hence we can re-use our unmodified iterative solver to compute the gradient. We don't need to take it apart and slow it down by storing intermediate states. However, the iterative solver computes the matrix-vector-products for $(\nabla^2)^{-1} b$. So what is $b$ during backpropagation? In an optimization setting we'll always have our loss function $L$ at the end of the forward chain. The backpropagation step will then give a gradient for the output, let's assume it is $\partial L/\partial p$ here, which needs to be propagated to the earlier operations of the forward pass. Thus, we can simply invoke our iterative solve during the backward pass to compute $\partial p / \partial b = S(\partial L/\partial p)$. And assuming that we've chosen a good solver as $S$ for the forward pass, we get exactly the same performance and accuracy in the backwards pass.

 If you're interested in a code example, the [differentiate-pressure example]( https://github.com/tum-pbs/PhiFlow/blob/master/demos/differentiate_pressure.py) of phiflow uses exactly this process for an optimization through a pressure projection step: a flow field that is constrained on the right side, is optimized for the content on the left, such that it matches the target on the right after a pressure projection step.

-The main take-away here is: it is important _not to blindly back-propagate_ through the forward computation, but to think about which steps of the analytic equations for the forward pass to compute gradients for. In cases like the above, we can often find improved analytic expressions for the gradients, which we can then compute numerically.
+The main take-away here is: it is important _not to blindly backpropagate_ through the forward computation, but to think about which steps of the analytic equations for the forward pass to compute gradients for. In cases like the above, we can often find improved analytic expressions for the gradients, which we can then compute numerically.

 ```{admonition} Implicit Function Theorem & Time
 :class: tip
@ -311,7 +311,7 @@ The main take-away here is: it is important _not to blindly back-propagate_ thro
 The process above essentially yields an _implicit derivative_. Instead of explicitly deriving all forward steps, we've relied on the [implicit function theorem](https://en.wikipedia.org/wiki/Implicit_function_theorem) to compute the derivative.

 **Time**: we _can_ actually consider the steps of an iterative solver as a virtual "time",
-and back-propagate through them. In line with other DP approaches, this enabled an NN to _interact_ with an iterative solver. An example is to learn initial guesses of CG solvers from {cite}`um2020sol`, 
+and backpropagate through these steps. In line with other DP approaches, this enabled an NN to _interact_ with an iterative solver. An example is to learn initial guesses of CG solvers from {cite}`um2020sol`, 
 [details can be found here](https://github.com/tum-pbs/CG-Solver-in-the-Loop).
 ```

--- a/physgrad-comparison.ipynb
+++ b/physgrad-comparison.ipynb
@ -680,7 +680,7 @@
    "= \\mathbf{z}(x) - \\eta \\frac{\\partial L}{\\partial \\mathbf{z}} (\\frac{\\partial \\mathbf{z}}{\\partial x})^2 + \\mathcal{O}( h^2 )\n",
    "$.\n",
    "\n",
-    "And $\\frac{\\partial L}{\\partial \\mathbf{z}} (\\frac{\\partial \\mathbf{z}}{\\partial \\mathbf{x}})^2$ clearly differs from the step $\\frac{\\partial L}{\\partial \\mathbf{z}}$ we would compute during the back-propagation pass in GD for $\\mathbf{z}$.\n",
+    "And $\\frac{\\partial L}{\\partial \\mathbf{z}} (\\frac{\\partial \\mathbf{z}}{\\partial \\mathbf{x}})^2$ clearly differs from the step $\\frac{\\partial L}{\\partial \\mathbf{z}}$ we would compute during the backpropagation pass in GD for $\\mathbf{z}$.\n",
    "\n",
    "**Newton's method** does not fare much better: we compute first-order derivatives like for GD, and the second-order derivatives for the Hessian for the full process. But since both are approximations, the actual intermediate states resulting from an update step are unknown until the full chain is evaluated. In the _Consistency in function compositions_ paragraph for Newton's method in {doc}`physgrad` the squared $\\frac{\\partial \\mathbf{z}}{\\partial \\mathbf{x}}$ term for the Hessian already indicated this dependency.\n",
    "\n",
--- a/physicalloss.md
+++ b/physicalloss.md
@ -75,7 +75,7 @@ in {doc}`overview`. Here, we can use the same tools to compute spatial derivativ
 Note that above for $R$ we've written this derivative in the shortened notation as $\mathbf{u}_{x}$.
 For functions over time this of course also works for $\partial \mathbf{u} / \partial t$, i.e. $\mathbf{u}_{t}$ in the notation above.

-Thus, for some generic $R$, made up of $\mathbf{u}_t$ and $\mathbf{u}_{x}$ terms, we can rely on the back-propagation algorithm
+Thus, for some generic $R$, made up of $\mathbf{u}_t$ and $\mathbf{u}_{x}$ terms, we can rely on the backpropagation algorithm
 of DL frameworks to compute these derivatives once we have a NN that represents $\mathbf{u}$. Essentially, this gives us a 
 function (the NN) that receives space and time coordinates to produce a solution for $\mathbf{u}$. Hence, the input is typically
 quite low-dimensional, e.g., 3+1 values for a 3D case over time, and often produces a scalar value or a spatial vector.