em dash; sentence case

This commit is contained in:
jverzani
2025-07-27 15:26:00 -04:00
parent c3b221cd29
commit 33c6e62d68
59 changed files with 385 additions and 243 deletions

View File

@@ -1,4 +1,4 @@
# Matrix Calculus
# Matrix calculus
This section illustrates a more general setting for taking derivatives, that unifies the different expositions taken prior.
@@ -74,7 +74,7 @@ Additionally, many other set of objects form vector spaces. Certain families of
Let's take differentiable functions as an example. These form a vector space as the derivative of a linear combination of differentiable functions is defined through the simplest derivative rule: $[af(x) + bg(x)]' = a[f(x)]' + b[g(x)]'$. If $f$ and $g$ are differentiable, then so is $af(x)+bg(x)$.
A finite vector space is described by a *basis* -- a minimal set of vectors needed to describe the space, after consideration of linear combinations. For some typical vector spaces, this is the set of special vectors with $1$ as one of the entries, and $0$ otherwise.
A finite vector space is described by a *basis*---a minimal set of vectors needed to describe the space, after consideration of linear combinations. For some typical vector spaces, this is the set of special vectors with $1$ as one of the entries, and $0$ otherwise.
A key fact about a basis for a finite vector space is every vector in the vector space can be expressed *uniquely* as a linear combination of the basis vectors. The set of numbers used in the linear combination, along with an order to the basis, means an element in a finite vector space can be associated with a unique coordinate vector.
@@ -88,7 +88,7 @@ Vectors and matrices have properties that are generalizations of the real number
* Viewing a vector as a matrix is possible. The association chosen here is common and is through a *column* vector.
* The *transpose* of a matrix comes by permuting the rows and columns. The transpose of a column vector is a row vector, so $v\cdot w = v^T w$, where we use a superscript $T$ for the transpose. The transpose of a product, is the product of the transposes -- reversed: $(AB)^T = B^T A^T$; the tranpose of a transpose is an identity operation: $(A^T)^T = A$; the inverse of a transpose is the tranpose of the inverse: $(A^{-1})^T = (A^T)^{-1}$.
* The *transpose* of a matrix comes by permuting the rows and columns. The transpose of a column vector is a row vector, so $v\cdot w = v^T w$, where we use a superscript $T$ for the transpose. The transpose of a product, is the product of the transposes---reversed: $(AB)^T = B^T A^T$; the tranpose of a transpose is an identity operation: $(A^T)^T = A$; the inverse of a transpose is the tranpose of the inverse: $(A^{-1})^T = (A^T)^{-1}$.
* Matrices for which $A = A^T$ are called symmetric.
@@ -231,7 +231,7 @@ Various differentiation rules are still available such as the sum, product, and
### Sum and product rules for the derivative
Using the differential notation -- which implicitly ignores higher order terms as they vanish in a limit -- the sum and product rules can be derived.
Using the differential notation---which implicitly ignores higher order terms as they vanish in a limit---the sum and product rules can be derived.
For the sum rule, let $f(x) = g(x) + h(x)$. Then
@@ -377,7 +377,7 @@ Multiplying left to right (the first) is called reverse mode; multiplying right
The reason comes down to the shape of the matrices. To see, we need to know that matrix multiplication of an $m \times q$ matrix times a $q \times n$ matrix takes an order of $mqn$ operations.
When $m=1$, the derviative is a product of matrices of size $n\times j$, $j\times k$, and $k \times 1$ yielding a matrix of size $n \times 1$ matching the function dimension.
When $m=1$, the derivative is a product of matrices of size $n\times j$, $j\times k$, and $k \times 1$ yielding a matrix of size $n \times 1$ matching the function dimension.
The operations involved in multiplication from left to right can be quantified. The first operation takes $njk$ operation leaving an $n\times k$ matrix, the next multiplication then takes another $nk1$ operations or $njk + nk$ together.
@@ -435,7 +435,7 @@ That is $f'(A)$ is the operator $f'(A)[\delta A] = A \delta A + \delta A A$. (Th
Alternatively, we can identify $A$ through its
components, as a vector in $R^{n^2}$ and then leverage the Jacobian.
One such identification is vectorization -- consecutively stacking the
One such identification is vectorization---consecutively stacking the
column vectors into a single vector. In `Julia` the `vec` function does this
operation:
@@ -444,7 +444,7 @@ operation:
vec(A)
```
The stacking by column follows how `Julia` stores matrices and how `Julia` references a matrices entries by linear index:
The stacking by column follows how `Julia` stores matrices and how `Julia` references entries in a matrix by linear index:
```{julia}
vec(A) == [A[i] for i in eachindex(A)]
@@ -562,7 +562,7 @@ all(l == r for (l, r) ∈ zip(L, R))
----
Now to use this relationship to recognize $df = A dA + dA A$ with the Jacobian computed from $\text{vec}{f(a)}$.
Now to use this relationship to recognize $df = A dA + dA A$ with the Jacobian computed from $\text{vec}(f(a))$.
We have $\text{vec}(A dA + dA A) = \text{vec}(A dA) + \text{vec}(dA A)$, by obvious linearity of $\text{vec}$. Now inserting an identity matrix, $I$, which is symmteric, in a useful spot we have:
@@ -683,7 +683,7 @@ det(I + dA) - det(I)
## The adjoint method
The chain rule brings about a series of products. The adjoint method illustrated below, shows how to approach the computation of the series in a direction that minimizes the computational cost, illustrating why reverse mode is preferred to forward mode when a scalar function of several variables is considered.
The chain rule brings about a series of products. The adjoint method illustrated by @BrightEdelmanJohnson and summarize below, shows how to approach the computation of the series in a direction that minimizes the computational cost, illustrating why reverse mode is preferred to forward mode when a scalar function of several variables is considered.
@BrightEdelmanJohnson consider the derivative of
@@ -778,9 +778,9 @@ Here $v$ can be solved for by taking adjoints (as before). Let $A = \partial h/\
## Second derivatives, Hessian
@CarlssonNikitinTroedssonWendt
We reference a theorem presented by [Carlsson, Nikitin, Troedsson, and Wendt](https://arxiv.org/pdf/2502.03070v1) for exposition with some modification
We reference a theorem presented by @CarlssonNikitinTroedssonWendt for exposition with some modification
::: {.callout-note appearance="minimal"}
Theorem 1. Let $f:X \rightarrow Y$, where $X,Y$ are finite dimensional *inner product* spaces with elements in $R$. Suppose $f$ is smooth (a certain number of derivatives). Then for each $x$ in $X$ there exists a unique linear operator, $f'(x)$, and a unique *bilinear* *symmetric* operator $f'': X \oplus X \rightarrow Y$ such that
@@ -804,7 +804,7 @@ $$
\begin{align*}
f(x + dx) &= f(x) +
\frac{\partial f}{\partial x_1} dx_1 + \frac{\partial f}{\partial x_2} dx_2\\
&+ \frac{1}{2}\left(
&{+} \frac{1}{2}\left(
\frac{\partial^2 f}{\partial x_1^2}dx_1^2 +
\frac{\partial^2 f}{\partial x_1 \partial x_2}dx_1dx_2 +
\frac{\partial^2 f}{\partial x_2^2}dx_2^2
@@ -832,7 +832,7 @@ $$
$H$ being the *Hessian* with entries $H_{ij} = \frac{\partial f}{\partial x_i \partial x_j}$.
This formula -- $f(x+dx)-f(x) \approx f'(x)dx + dx^T H dx$ -- is valid for any $n$, showing $n=2$ was just for ease of notation when expressing in the coordinates and not as matrices.
This formula---$f(x+dx)-f(x) \approx f'(x)dx + dx^T H dx$---is valid for any $n$, showing $n=2$ was just for ease of notation when expressing in the coordinates and not as matrices.
By uniqueness, we have under these assumptions that the Hessian is *symmetric* and the expression $dx^T H dx$ is a *bilinear* form, which we can identify as $f''(x)[dx,dx]$.
@@ -909,24 +909,23 @@ $$
&= \left(
\text{det}(A) + \text{det}(A)\text{tr}(A^{-1}dA')
\right)
\text{tr}((A^{-1} - A^{-1}dA' A^{-1})dA) - \text{det}(A) \text{tr}(A^{-1}dA) \\
\text{tr}((A^{-1} - A^{-1}dA' A^{-1})dA)\\
&\quad{-} \text{det}(A) \text{tr}(A^{-1}dA) \\
&=
\text{det}(A) \text{tr}(A^{-1}dA)\\
&+ \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) \\
&- \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
&- \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA' A^{-1}dA)\\
&- \text{det}(A) \text{tr}(A^{-1}dA) \\
\textcolor{blue}{\text{det}(A) \text{tr}(A^{-1}dA)}\\
&\quad{+} \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) \\
&\quad{-} \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
&\quad{-} \textcolor{red}{\text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA' A^{-1}dA)}\\
&\quad{-} \textcolor{blue}{\text{det}(A) \text{tr}(A^{-1}dA)} \\
&= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) - \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
&+ \text{third order term}
&\quad{+} \textcolor{red}{\text{third order term}}
\end{align*}
$$
So, after dropping the third-order term, we see:
$$
\begin{align*}
f''(A)[dA,dA']
&= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA)\\
&\quad - \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA).
\end{align*}
= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) -
\text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA).
$$