The main goal is the put DL into the context of these classical methods. While we'll focus on DL, we will also revisit
the classical algorithms for improved learning algorithms later on in this book. Physics simulations exaggerate the difficulties caused by neural networks, which is why the topics below have a particular relevance for physics-based learning tasks.
*Deep-dive Chapter*: This chapter is a deep dive for those interested in the theory of different optimizers. It will skip evaluations as well as source code, and instead focus on theory. The chapter is highly recommend as a basis for the chapters of {doc}`physgrad`. However, it is not "mandatory" for getting started with topics like training via _differentiable physics_. If you'd rather quickly get started with practical aspects, feel free to skip ahead to {doc}`supervised`.
% Only in this one chapter, the derivatives of $L(x)$ w.r.t. $x$ will be denoted by $'$. We won't use this in any other chapter, but as we're only dealing with derivatives of $x$ here, it will make some parts below a lot clearer.
%So, applying $\partial / \partial x$ once yields $L' \equiv J(x)$. As $J$ is a row vector, the gradient (column vector) $\nabla L$ is given by $J^T$. Applying $\partial / \partial x$ again gives the Hessian matrix $L'' \equiv H(x)$ And the third $\partial / \partial x$ gives the third derivative tensor $L''' \equiv K(x)$, which we luckily never need to compute as a full tensor, but which is required for some of the derivations below.
Applying $\partial / \partial x$ once to $L$ yields the Jacobian $J(x)$. As $L$ is scalar, $J$ is a row vector, and the gradient (column vector) $\nabla L$ is given by $J^T$.
Applying $\partial / \partial x$ again gives the Hessian matrix $H(x)$,
and another application of $\partial / \partial x$ gives the third derivative tensor denoted by $K(x)$. We luckily never need to compute $K$ as a full tensor, but it is needed for some of the derivations below. To shorten the notation below,
Now we can start with arguably the most classic algorithm for optimization: _Newton's method_. It is derived
by approximating the function we're interested in as a parabola. This can be motivated by the fact that pretty much every minimum looks like a parabola close up.
In the next sequence of steps, we will first employ $(H\Delta)^T \Delta = \|H^{\frac{1}{2}}\Delta\|^2$. This term will be shortened to $\epsilon \equiv \| H^{\frac{1}{2}} \Delta \|^2$ in the second line below. Due to the properties of $H$, this $\epsilon$ "just" represents a small positive factor that will stick around till the end.
Afterwards, in line 3 and 4 below, we can start finding an upper bound
for the change of the loss. We'll first use a Cauchy-Schwartz inequality, and then
make use of a special Lipschitz condition for affine conjugate matrices. For $H$, it takes the form $\| H(x)^{-\frac{1}{2}}\big( H(y)-H(x) \big) \| \le \mathcal L \| H(x)^{\frac{1}{2}} (y-x) \|^2$. This requires $H$ to by symmetric, positive-definite, which isn't too unreasonable in practice.
Continuing from above, we get:
$$\begin{aligned}
\cdots
&= - \big( \lambda +\frac{\lambda^2}{2} \big) \| H^{\frac{1}{2}} \Delta \|^2 + \lambda^2 \int_0^1 \text{d}s ~ s \int_0^1 \text{d}t \big[ [ H(x + s t \lambda \Delta) - H ] ~ \Delta \big]^T H^{-\frac{1}{2}} H^{\frac{1}{2}} \Delta \\
&= - \lambda \epsilon +\frac{\lambda^2 \epsilon}{2} + \lambda^2 \int_0^1 \text{d}s ~ s \int_0^1 \text{d}t \big[ H^{-\frac{1}{2}} [ H(x + s t \lambda \Delta) - H ] ~ \Delta \big]^T H^{\frac{1}{2}} \Delta \\
&\le - \lambda \epsilon +\frac{\lambda^2 \epsilon}{2} + \lambda^2 \int_0^1 \text{d}s ~ s \int_0^1 \text{d}t \big\| H^{-\frac{1}{2}} [ H(x + s t \lambda \Delta) - H ] ~ \Delta \big\| ~ \| H^{\frac{1}{2}} \Delta \| \\
In line 4, we've pulled $s,t$ and $\lambda$ out of the integrals as much as possible, in addition to applying the special Lipschitz condition. The last three lines just simplify the terms, express the occurrences of $H$ in terms of $\epsilon$, and evaluate the integrals. This leaves us with a cubic form in terms of $\lambda$, the step size. Most importantly, the first, linear term is negative, and hence will dominate for small $\lambda$. With this we've shown that the step will be negative for sufficiently small $\lambda$:
To conclude, we've shown that Newton's method with an adaptive step size provably converges, which is great. However, it requires the Hessian $H$ as a central ingredient. Unfortunately, $H$ very difficult to obtain in practice. This is a real show stopper, and motivates the following methods. They keep the basic step of Newton's method, but approximate $H$.
$\frac{ J(x_{n+1})^T - J(x_{n}) }{\Delta}$ with the assumption that the current Jacobian is zero.
Keep in mind, that $\Delta$ is a vector here (see the vector division above), so the finite difference gives a matrix of size $N \times N$ that can be added to $H_n$.
The last line of equation {eq}`gauss-newton-approx` means we are basically approximating the Hessian with $J$ squared. This is reasonable in many scenarios, and inserting it into our update step above gives the
the update is based on an approximate inverse of the Jacobian of $L$.
This inverse gives an approximately equal step size in all parameters, and
as such provides an interesting building block that we will revisit in later chapters.
In the form above, it still means we have to invert a large matrix,
which is costly, and the matrix itself may not even be invertible.
---
## Back to Deep Learning
We've shown above that Newton's method lies at the heart of many popular algorithms for non-linear optimization.
Gauss-Newton now finally provides us with a stepping stone back towards Deep Learning algorithms, specifically, to the Adam optimizer.
## Adam
As usual, we start with a Newton step $\Delta = - \lambda \frac{J^T}{H}$, but even the simplest approximation
of $H \approx J^T J$ from Gauss-Newton requires inverting a potentially huge matrix. This is not feasible for the weights of neural networks, and hence a valid question is, how can we further simplify this step? For Adam, the answer is: with a diagonal approximation. Specifically, Adam uses:
This is a very rough approximation of the true Hessian. We're simply using the squared, first derivatives here, and in general, of course, $\Big( \frac{\partial f}{\partial x} \Big)^2 \ne \frac{\partial^2 f}{\partial x^2}$.
This only holds for the first-order approximation from Gauss-Newton, i.e., the first term of equation {eq}`gauss-newton-approx`. Now Adam goes a step further, and only keeps the diagonal of $J^T J$. This quantity is readily available in deep learning in the form of the gradient of the weights, and makes the inversion of $H$ trivial. As a result, it at least provides some estimate of the curvature of the individual weights, but neglects their correlations.
This makes the estimates more robust, which is crucial: a normalization with an erroneously small entry of the gradient could otherwise lead to an explosion. Adam additionally adds a small constant when dividing, and the square-root likewise helps to mitigate overshoot.
To summarize: Adam makes use of a first-order update with a diagonal Gauss-Newton approximation of the Hessian for normalization. It additionally employs momentum for stabilization.
By choosing $\lambda \le \frac{1}{\mathcal L}$, we can simplify these terms further, and can an upper bound that depends on $J$ squared: $L(x+\lambda \Delta) \le L(x) - \frac{ \lambda}{2} | J|^2$
and thus ensures convergence.
This result unfortunately does not help us much in practice, as for all common usage of GD in deep learning $\mathcal L$ is not known. It is still good to know that a Lipschitz constant for the gradient would theoretically provide us with convergence guarantees for GD.
With this we conclude our tour of classical optimizers and their relation to deep learning methods. It's worth noting that we've focused on non-stochastic algorithms here for clarity, as the proofs would become more involved for stochastic algorithms.