Fix minor typos
This commit is contained in:
parent
590a930246
commit
d5f11a58c9
@ -227,7 +227,7 @@ Once things are working with GD, we can relatively easily switch to better optim
|
||||
an NN into the picture, hence it's always a good starting point.
|
||||
To make things easier to read below, we'll omit the transpose of the Jacobians in the following.
|
||||
Unfortunately, the Jacobian is defined this way, but we actually never need the un-transposed one.
|
||||
Keep in mind that in practice we're dealing with tranposed Jacobians $\big( \frac{ \partial a }{ \partial b} \big)^T$
|
||||
Keep in mind that in practice we're dealing with transposed Jacobians $\big( \frac{ \partial a }{ \partial b} \big)^T$
|
||||
that are "abbreviated" by $\frac{ \partial a }{ \partial b}$.
|
||||
|
||||
As the discretized velocity field $\mathbf{u}$ contains all our degrees of freedom,
|
||||
@ -341,7 +341,7 @@ e.g., one could try to insert equation {eq}`eq:advection` at time $t-\Delta t$
|
||||
into equation {eq}`eq:advection` at time $t$ and repeat this process recursively until
|
||||
we have a single expression relating $d^{~0}$ to the targets. However, thanks
|
||||
to the linear nature of the Jacobians, we treat each advection step, i.e.,
|
||||
each invocation of our PDE $\mathcal P$ as a seperate, modular
|
||||
each invocation of our PDE $\mathcal P$ as a separate, modular
|
||||
operation. And each of these invocations follows the procedure described
|
||||
in the previous section.
|
||||
|
||||
@ -380,7 +380,7 @@ at first, but looking closely, each line simply adds an additional Jacobian for
|
||||
This follows from the chain rule, as shown in the two-operator case above.
|
||||
So the terms of the sum contain a lot of similar Jacobians, and in practice can be computed efficiently
|
||||
by backtracing through the sequence of computational steps that resulted from the forward evaluation of our PDE.
|
||||
(Note that, as mentioned above, we've omitted the tranpose of the Jacobians here.)
|
||||
(Note that, as mentioned above, we've omitted the transpose of the Jacobians here.)
|
||||
|
||||
This structure also makes clear that the process is very similar to the regular training
|
||||
process of an NN: the evaluations of these Jacobian vector products from nested function calls
|
||||
|
@ -22,7 +22,7 @@
|
||||
|
||||
## Summary of the most important abbreviations:
|
||||
|
||||
| ABbreviation | Meaning |
|
||||
| Abbreviation | Meaning |
|
||||
| --- | --- |
|
||||
| BNN | Bayesian neural network |
|
||||
| CNN | Convolutional neural network |
|
||||
|
@ -50,7 +50,7 @@ name: physgrad-scaling
|
||||
Loss landscapes in $x$ for different $\alpha$ of the 2D example problem. The green arrows visualize an example update step $- \nabla_x$ (not exactly to scale) for each case.
|
||||
```
|
||||
|
||||
However, within this book we're targeting _physical_ learning problems, and hence we have physical functions integrated into the learning process, as discussed at length for differentiable physics approaches. This is fundamentally different! Physical processes pretty much always introduce different scaling behavor for different components: some changes in the physical state are sensitive and produce massive responses, others have barely any effect. In our toy problem we can mimic this by choosing different values for $\alpha$, as shown in the middle and right graphs of the figure above.
|
||||
However, within this book we're targeting _physical_ learning problems, and hence we have physical functions integrated into the learning process, as discussed at length for differentiable physics approaches. This is fundamentally different! Physical processes pretty much always introduce different scaling behavior for different components: some changes in the physical state are sensitive and produce massive responses, others have barely any effect. In our toy problem we can mimic this by choosing different values for $\alpha$, as shown in the middle and right graphs of the figure above.
|
||||
|
||||
For larger $\alpha$, the loss landscape away from the minimum steepens along $x_2$. $x_1$ will have an increasingly different scale than $x_2$. As a consequence, the gradients grow along this $x_2$. If we don't want our optimization to blow up, we'll need to choose a smaller learning rate $\eta$, reducing progress along $x_1$. The gradient of course stays perpendicular to the loss. In this example we'll move quickly along $x_2$ until we're close to the x axis, and then only very slowly creep left towards the minimum. Even worse, as we'll show below, regular updates actually apply the square of the scaling!
|
||||
And in settings with many dimensions, it will be extremely difficult to find a good learning rate.
|
||||
|
Loading…
Reference in New Issue
Block a user