From d5f11a58c906a51cafd132148890d7f312da77a2 Mon Sep 17 00:00:00 2001
From: bobarna <barnabas.borcsok@gmail.com>
Date: Sun, 11 Sep 2022 10:25:40 +0200
Subject: [PATCH] Fix minor typos

---
 diffphys.md | 6 +++---
 notation.md | 2 +-
 physgrad.md | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/diffphys.md b/diffphys.md
index 2f80387..ead5036 100644
--- a/diffphys.md
+++ b/diffphys.md
@@ -227,7 +227,7 @@ Once things are working with GD, we can relatively easily switch to better optim
 an NN into the picture, hence it's always a good starting point.
 To make things easier to read below, we'll omit the transpose of the Jacobians in the following. 
 Unfortunately, the Jacobian is defined this way, but we actually never need the un-transposed one.
-Keep in mind that in practice we're dealing with tranposed Jacobians $\big( \frac{ \partial a }{ \partial b} \big)^T$
+Keep in mind that in practice we're dealing with transposed Jacobians $\big( \frac{ \partial a }{ \partial b} \big)^T$
 that are "abbreviated" by $\frac{ \partial a }{ \partial b}$.
 
 As the discretized velocity field $\mathbf{u}$ contains all our degrees of freedom,
@@ -341,7 +341,7 @@ e.g., one could try to insert equation {eq}`eq:advection` at time $t-\Delta t$
 into equation {eq}`eq:advection` at time $t$ and repeat this process recursively until
 we have a single expression relating $d^{~0}$ to the targets. However, thanks
 to the linear nature of the Jacobians, we treat each advection step, i.e.,
-each invocation of our PDE $\mathcal P$ as a seperate, modular
+each invocation of our PDE $\mathcal P$ as a separate, modular
 operation. And each of these invocations follows the procedure described 
 in the previous section.
 
@@ -380,7 +380,7 @@ at first, but looking closely, each line simply adds an additional Jacobian for
 This follows from the chain rule, as shown in the two-operator case above.
 So the terms of the sum contain a lot of similar Jacobians, and in practice can be computed efficiently
 by backtracing through the sequence of computational steps that resulted from the forward evaluation of our PDE.
-(Note that, as mentioned above, we've omitted the tranpose of the Jacobians here.)
+(Note that, as mentioned above, we've omitted the transpose of the Jacobians here.)
 
 This structure also makes clear that the process is very similar to the regular training
 process of an NN: the evaluations of these Jacobian vector products from nested function calls
diff --git a/notation.md b/notation.md
index 0ac5a8d..e1766fd 100644
--- a/notation.md
+++ b/notation.md
@@ -22,7 +22,7 @@
 
 ## Summary of the most important abbreviations:
 
-| ABbreviation | Meaning |
+| Abbreviation | Meaning |
 | --- | --- |
 | BNN  | Bayesian neural network |
 | CNN  | Convolutional neural network |
diff --git a/physgrad.md b/physgrad.md
index a0733d0..af5962a 100644
--- a/physgrad.md
+++ b/physgrad.md
@@ -50,7 +50,7 @@ name: physgrad-scaling
 Loss landscapes in $x$ for different $\alpha$ of the 2D example problem. The green arrows visualize an example update step $- \nabla_x$ (not exactly to scale) for each case.
 ```
 
-However, within this book we're targeting _physical_ learning problems, and hence we have physical functions integrated into the learning process, as discussed at length for differentiable physics approaches. This is fundamentally different! Physical processes pretty much always introduce different scaling behavor for different components: some changes in the physical state are sensitive and produce massive responses, others have barely any effect. In our toy problem we can mimic this by choosing different values for $\alpha$, as shown in the middle and right graphs of the figure above.
+However, within this book we're targeting _physical_ learning problems, and hence we have physical functions integrated into the learning process, as discussed at length for differentiable physics approaches. This is fundamentally different! Physical processes pretty much always introduce different scaling behavior for different components: some changes in the physical state are sensitive and produce massive responses, others have barely any effect. In our toy problem we can mimic this by choosing different values for $\alpha$, as shown in the middle and right graphs of the figure above.
 
 For larger $\alpha$, the loss landscape away from the minimum steepens along $x_2$. $x_1$ will have an increasingly different scale than $x_2$. As a consequence, the gradients grow along this $x_2$. If we don't want our optimization to blow up, we'll need to choose a smaller learning rate $\eta$, reducing progress along $x_1$. The gradient of course stays perpendicular to the loss. In this example we'll move quickly along $x_2$ until we're close to the x axis, and then only very slowly creep left towards the minimum. Even worse, as we'll show below, regular updates actually apply the square of the scaling! 
 And in settings with many dimensions, it will be extremely difficult to find a good learning rate.