diff --git a/make-pdf.sh b/make-pdf.sh index 2c6aeae..a788ca6 100755 --- a/make-pdf.sh +++ b/make-pdf.sh @@ -42,7 +42,9 @@ pdflatex book pdflatex book # for convenience, archive results in main dir -mv book.pdf ../../book-pdflatex.pdf +mv book.pdf ../../pbfl-book-pdflatex.pdf tar czvf ../../pbdl-latex-for-arxiv.tar.gz * +cd ../.. +ls -l ./pbfl-book-pdflatex.pdf ./pbdl-latex-for-arxiv.tar.gz diff --git a/overview-optconv.md b/overview-optconv.md index a13afe2..48fccc1 100644 --- a/overview-optconv.md +++ b/overview-optconv.md @@ -1,14 +1,14 @@ Optimization and Convergence ============================ -This chapter will give an overview of the derivations of different classic algorithms from optimization. -In contrast to other works, we'll start with _the_ classic optimization algorithm, Newton's method, +This chapter will give an overview of the derivations for different optimization algorithms. +In contrast to other texts, we'll start with _the_ classic optimization algorithm, Newton's method, derive several widely used variants from it, before coming back full circle to deep learning (DL) optimizers. -The main goal is the put DL into the context of these classical methods, which we'll revisit -for improved learning algorithms later on in this book. Physics simulations exaggerate the difficulties caused by neural networks, which is why the topics below have a particular relevance for physics-based learning tasks. +The main goal is the put DL into the context of these classical methods. While we'll focus on DL, we will also revisit +the classical algorithms for improved learning algorithms later on in this book. Physics simulations exaggerate the difficulties caused by neural networks, which is why the topics below have a particular relevance for physics-based learning tasks. ```{note} -*Deep-dive Chapter*: This chapter is a deep dive for those interested in the theory of different optimizers. It will skip evaluations as well as source code, and instead focus on derivations. The chapter is highly recommend as a basis for the chapters of {doc}`physgrad`. However, it is not "mandatory" for getting started with topics like training via _differentiable physics_. If you'd rather quickly get started with practical aspects, feel free to skip ahead to {doc}`supervised`. +*Deep-dive Chapter*: This chapter is a deep dive for those interested in the theory of different optimizers. It will skip evaluations as well as source code, and instead focus on theory. The chapter is highly recommend as a basis for the chapters of {doc}`physgrad`. However, it is not "mandatory" for getting started with topics like training via _differentiable physics_. If you'd rather quickly get started with practical aspects, feel free to skip ahead to {doc}`supervised`. ``` @@ -16,13 +16,13 @@ for improved learning algorithms later on in this book. Physics simulations exag This chapter uses a custom notation that was carefully chosen to give a clear and brief representation of all methods under consideration. -We have a scalar loss function -$L(x): \mathbb R^n \rightarrow \mathbb R^1$, the optimum (the minimum of $L$) is at location $x^*$, +We have a scalar loss function $L(x): \mathbb R^N \rightarrow \mathbb R$, the optimum (the minimum of $L$) is at location $x^*$, and $\Delta$ denotes a step in $x$. Different intermediate update steps in $x$ are denoted by a subscript, e.g., as $x_n$ or $x_k$. In the following, we often need inversions, i.e. a division by a certain quantity. For vectors $a$ and $b$, -we define this as $\frac{a}{b} \equiv b^{-1}a$. With $a$ and $b$ being vectors, the result naturally is a matrix: +We define $\frac{a}{b} \equiv b^{-1}a$. +When $a$ and $b$ are vectors, the result is a matrix obtained with one of the two equivalent formulations below: $$ \frac{a}{b} \equiv \frac{a a^T}{a^T b } \equiv \frac{a b^T }{b^T b} @@ -33,10 +33,10 @@ $$ (vector-division) Applying $\partial / \partial x$ once to $L$ yields the Jacobian $J(x)$. As $L$ is scalar, $J$ is a row vector, and the gradient (column vector) $\nabla L$ is given by $J^T$. Applying $\partial / \partial x$ again gives the Hessian matrix $H(x)$, -and the third $\partial / \partial x$ gives the third derivative tensor denoted by $K(x)$. We luckily never need to compute $K$ as a full tensor, but it is needed for some of the derivations below. To shorten the notation below, +and another application of $\partial / \partial x$ gives the third derivative tensor denoted by $K(x)$. We luckily never need to compute $K$ as a full tensor, but it is needed for some of the derivations below. To shorten the notation below, we'll typically drop the $(x)$ when a function or derivative is evaluated at location $x$, e.g., $J$ will denote $J(x)$. -The following image gives an overview of the resulting matrix shapes for some of the commonly used quantities below. +The following image gives an overview of the resulting matrix shapes for some of the commonly used quantities. We don't really need it afterwards, but for this figure $N$ denotes the dimension of $x$, i.e. $x \in \mathbb R^N$. ![opt conv shapes pic](resources/overview-optconv-shapes.png) @@ -46,7 +46,7 @@ We don't really need it afterwards, but for this figure $N$ denotes the dimensio We'll need a few tools for the derivations below, which are summarized here for reference. -Not surprisingly, we'll need some Taylor-series expansions, which, using the notation above can be written as: +Not surprisingly, we'll need some Taylor-series expansions. With the notation above it reads: $$L(x+\Delta) = L + J \Delta + \frac{1}{2} H \Delta^2 + \cdots$$ @@ -64,30 +64,29 @@ $ u^T v < |u| \cdot |v| $. ## Newton's method -Now we can get started with arguably one of the most classic algorithms: Newton's method. It is derived -by approximating the function we're interested in as a parabola. This can be motivated by the fact that pretty much every minimum looks like a parabola if we're close enough. +Now we can start with arguably the most classic algorithm for optimization: _Newton's method_. It is derived +by approximating the function we're interested in as a parabola. This can be motivated by the fact that pretty much every minimum looks like a parabola close up. ![parabola linear](resources/overview-optconv-parab.png) -So we can represent $L$ around an optimum $x^*$ by parabola of the form $L(x) = \frac{1}{2} H(x-x^*)^2 + c$, +So we can represent $L$ near an optimum $x^*$ by a parabola of the form $L(x) = \frac{1}{2} H(x-x^*)^2 + c$, where $c$ denotes a constant offset. At location $x$ we observe $H$ and -$J^T=H \cdot (x_k-x^*)$. Re-arranging this yields $x^* = x_k - \frac{J^T}{H}$. -Newton's method by default computes $x^*$ in a single step. -Thus, the update in $x$ of Newton's method is given by: +$J^T=H \cdot (x_k-x^*)$. Re-arranging this directly yields an equation to compute the minimum: $x^* = x_k - \frac{J^T}{H}$. +Newton's method by default computes $x^*$ in a single step, and +hence the update in $x$ of Newton's method is given by: $$ \Delta = - \frac{J^T}{H} $$ (opt-newton) Let's look at the order of convergence of Newton's method. -Let $x^*$ be an optimum of $L$, with -$\Delta_n^* = x^* - x_n$, as illustrated below. +For an optimum $x^*$ of $L$, let +$\Delta_n^* = x^* - x_n$ denote the step from a current $x_n$ to the optimum, as illustrated below. ![newton x-* pic](resources/overview-optconv-minimum.png) Assuming differentiability of $J$, -we can perform the Lagrange expansion of $J$ in around $x^*$. -Here $\Delta^*_n = x^* - x_{n+1}$ denotes the a step from the current $x$ to the optimum: +we can perform the Lagrange expansion of $J$ at $x^*$: $$\begin{aligned} 0 = J(x^*) &= J(x_n) + H(x_n) \Delta^*_n + \frac{1}{2} K (\xi_n ){\Delta^*_n}^2 @@ -109,7 +108,7 @@ $$\begin{aligned} Thus, the distance to the optimum changes by ${\Delta^*_n}^2$, which means once we're close enough we have quadratic convergence. This is great, of course, -but it still depends on the prefactor $\frac{K}{2H}$, and will diverge if its $>1$, +but it still depends on the pre-factor $\frac{K}{2H}$, and will diverge if its $>1$. Note that this is an exact expression, there's no truncation thanks to the Lagrange expansion. And so far we have quadratic convergence, but the convergence to the optimum is not guaranteed. For this we have to allow for a variable step size. @@ -119,27 +118,29 @@ For this we have to allow for a variable step size. Thus, as a next step for Newton's method we introduce a variable step size $\lambda$ which gives the iteration $x_{n+1} = x_n + \lambda \Delta = x_n - \lambda \frac{J^T}{H}$. +As illustrated in the picture below, this is especially helpful is $L$ is not exactly +a parabola, and a small $H$ might overshoot in undesirable ways. The the far left in this example: ![newton lambda step pic](resources/overview-optconv-adap.png) To make statements about convergence, we need some fundamental assumptions: convexity and smoothness of our loss function. Then we'll focus on showing that the loss decreases, and -we moving along a sequence of smaller sets $\forall x ~ L(x)