em dash; sentence case

2025-07-27 15:26:00 -04:00
parent c3b221cd29
commit 33c6e62d68
59 changed files with 385 additions and 243 deletions
--- a/quarto/differentiable_vector_calculus/matrix_calculus_notes.qmd
+++ b/quarto/differentiable_vector_calculus/matrix_calculus_notes.qmd
@@ -1,4 +1,4 @@
-# Matrix Calculus
+# Matrix calculus

 This section illustrates a more general setting for taking derivatives, that unifies the different expositions taken prior.

@@ -74,7 +74,7 @@ Additionally, many other set of objects form vector spaces. Certain families of

 Let's take differentiable functions as an example. These form a vector space as the derivative of a linear combination of differentiable functions is defined through the simplest derivative rule: $[af(x) + bg(x)]' = a[f(x)]' + b[g(x)]'$. If $f$ and $g$ are differentiable, then so is $af(x)+bg(x)$.

-A finite vector space is described by a *basis* -- a minimal set of vectors needed to describe the space, after consideration of linear combinations. For some typical vector spaces, this is the set of special vectors with $1$ as one of the entries, and $0$ otherwise.
+A finite vector space is described by a *basis*---a minimal set of vectors needed to describe the space, after consideration of linear combinations. For some typical vector spaces, this is the set of special vectors with $1$ as one of the entries, and $0$ otherwise.

 A key fact about a basis for a finite vector space is every vector in the vector space can be expressed *uniquely* as a linear combination of the basis vectors. The set of numbers used in the linear combination, along with an order to the basis, means an element in a finite vector space can be associated with a unique coordinate vector.

@@ -88,7 +88,7 @@ Vectors and matrices have properties that are generalizations of the real number

 * Viewing a vector as a matrix is possible. The association chosen here is common and is through a *column* vector.

-* The *transpose* of a matrix comes by permuting the rows and columns. The transpose of a column vector is a row vector, so $v\cdot w = v^T w$, where we use a  superscript $T$ for the transpose. The transpose of a product, is the product of the transposes -- reversed: $(AB)^T = B^T A^T$; the tranpose of a transpose is an identity operation: $(A^T)^T = A$; the inverse of a transpose is the tranpose of the inverse: $(A^{-1})^T = (A^T)^{-1}$.
+* The *transpose* of a matrix comes by permuting the rows and columns. The transpose of a column vector is a row vector, so $v\cdot w = v^T w$, where we use a  superscript $T$ for the transpose. The transpose of a product, is the product of the transposes---reversed: $(AB)^T = B^T A^T$; the tranpose of a transpose is an identity operation: $(A^T)^T = A$; the inverse of a transpose is the tranpose of the inverse: $(A^{-1})^T = (A^T)^{-1}$.

 * Matrices for which $A = A^T$ are called symmetric.

@@ -231,7 +231,7 @@ Various differentiation rules are still available such as the sum, product, and

 ### Sum and product rules for the derivative

-Using the differential notation -- which implicitly ignores higher order terms as they vanish in a limit -- the sum and product rules can be derived.
+Using the differential notation---which implicitly ignores higher order terms as they vanish in a limit---the sum and product rules can be derived.

 For the sum rule, let $f(x) = g(x) + h(x)$. Then

@@ -377,7 +377,7 @@ Multiplying left to right (the first) is called reverse mode; multiplying right

 The reason comes down to the shape of the matrices. To see, we need to know that matrix multiplication  of an $m \times q$ matrix times a $q \times n$ matrix takes an order of $mqn$ operations.

-When $m=1$, the derviative is a product of matrices of size $n\times j$, $j\times k$, and $k \times 1$ yielding a matrix of size $n \times 1$ matching the function dimension.
+When $m=1$, the derivative is a product of matrices of size $n\times j$, $j\times k$, and $k \times 1$ yielding a matrix of size $n \times 1$ matching the function dimension.

 The operations involved in multiplication from left to right can be quantified. The first operation takes $njk$ operation leaving an $n\times k$ matrix, the next multiplication then takes another $nk1$ operations or $njk + nk$ together.

@@ -435,7 +435,7 @@ That is $f'(A)$ is the operator $f'(A)[\delta A] = A \delta A + \delta A A$. (Th
 Alternatively, we can identify $A$ through its
 components, as a vector in $R^{n^2}$ and then leverage the Jacobian.

-One such identification is vectorization -- consecutively stacking the
+One such identification is vectorization---consecutively stacking the
 column vectors into a single vector. In `Julia` the `vec` function does this
 operation:

@@ -444,7 +444,7 @@ operation:
 vec(A)
 ```

-The stacking by column follows how `Julia` stores matrices and how `Julia` references a matrices entries by linear index:
+The stacking by column follows how `Julia` stores matrices and how `Julia` references  entries in a matrix by linear index:

 ```{julia}
 vec(A) == [A[i] for i in eachindex(A)]
@@ -562,7 +562,7 @@ all(l == r for (l, r) ∈ zip(L, R))

 ----

-Now to use this relationship to recognize $df = A dA + dA A$ with the Jacobian computed from $\text{vec}{f(a)}$.
+Now to use this relationship to recognize $df = A dA + dA A$ with the Jacobian computed from $\text{vec}(f(a))$.

 We have $\text{vec}(A dA + dA A) = \text{vec}(A dA) + \text{vec}(dA A)$, by obvious linearity of $\text{vec}$. Now inserting an identity matrix, $I$, which is symmteric, in a useful spot we have:

@@ -683,7 +683,7 @@ det(I + dA) - det(I)

 ## The adjoint method

-The chain rule brings about a series of products. The adjoint method illustrated below, shows how to approach the computation of the series in a direction that minimizes the computational cost, illustrating why reverse mode is preferred to forward mode when a scalar function of several variables is considered.
+The chain rule brings about a series of products. The adjoint method illustrated by @BrightEdelmanJohnson and summarize below, shows how to approach the computation of the series in a direction that minimizes the computational cost, illustrating why reverse mode is preferred to forward mode when a scalar function of several variables is considered.


@BrightEdelmanJohnson consider the derivative of
@@ -778,9 +778,9 @@ Here $v$ can be solved for by taking adjoints (as before). Let $A = \partial h/\

 ## Second derivatives, Hessian

-@CarlssonNikitinTroedssonWendt

-We reference a theorem  presented by [Carlsson, Nikitin, Troedsson, and Wendt](https://arxiv.org/pdf/2502.03070v1) for exposition with some modification
+
+We reference a theorem  presented by @CarlssonNikitinTroedssonWendt for exposition with some modification

 ::: {.callout-note appearance="minimal"}
 Theorem 1. Let $f:X \rightarrow Y$, where $X,Y$ are finite dimensional *inner product* spaces with elements in $R$. Suppose $f$ is smooth (a certain number of derivatives). Then for each $x$ in $X$ there exists a unique linear operator, $f'(x)$, and a unique *bilinear* *symmetric* operator $f'': X \oplus X \rightarrow Y$ such that
@@ -804,7 +804,7 @@ $$
 \begin{align*}
 f(x + dx) &= f(x) +
 \frac{\partial f}{\partial x_1} dx_1 + \frac{\partial f}{\partial x_2} dx_2\\
-&+  \frac{1}{2}\left(
+&{+}  \frac{1}{2}\left(
 \frac{\partial^2 f}{\partial x_1^2}dx_1^2 +
 \frac{\partial^2 f}{\partial x_1 \partial x_2}dx_1dx_2 +
 \frac{\partial^2 f}{\partial x_2^2}dx_2^2
@@ -832,7 +832,7 @@ $$

 $H$ being the *Hessian* with entries $H_{ij} = \frac{\partial f}{\partial x_i \partial x_j}$.

-This formula -- $f(x+dx)-f(x) \approx f'(x)dx + dx^T H dx$ -- is valid for any $n$, showing $n=2$ was just for ease of notation when expressing in the coordinates and not as matrices.
+This formula---$f(x+dx)-f(x) \approx f'(x)dx + dx^T H dx$---is valid for any $n$, showing $n=2$ was just for ease of notation when expressing in the coordinates and not as matrices.

 By uniqueness, we have under these assumptions that the Hessian is *symmetric* and the expression $dx^T H dx$ is a *bilinear* form, which we can identify as $f''(x)[dx,dx]$.

@@ -909,24 +909,23 @@ $$
 &= \left(
 \text{det}(A) + \text{det}(A)\text{tr}(A^{-1}dA')
 \right)
-\text{tr}((A^{-1} - A^{-1}dA' A^{-1})dA) - \text{det}(A) \text{tr}(A^{-1}dA) \\
+\text{tr}((A^{-1} - A^{-1}dA' A^{-1})dA)\\
+&\quad{-} \text{det}(A) \text{tr}(A^{-1}dA) \\
 &=
-\text{det}(A) \text{tr}(A^{-1}dA)\\
-&+ \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) \\
-&- \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
-&- \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA' A^{-1}dA)\\
-&- \text{det}(A) \text{tr}(A^{-1}dA) \\
+\textcolor{blue}{\text{det}(A) \text{tr}(A^{-1}dA)}\\
+&\quad{+} \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) \\
+&\quad{-} \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
+&\quad{-} \textcolor{red}{\text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA' A^{-1}dA)}\\
+&\quad{-} \textcolor{blue}{\text{det}(A) \text{tr}(A^{-1}dA)} \\
 &= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) - \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
-&+ \text{third order term}
+&\quad{+} \textcolor{red}{\text{third order term}}
 \end{align*}
 $$

 So, after dropping the third-order term, we see:

 $$
-\begin{align*}
 f''(A)[dA,dA']
-&= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA)\\
-&\quad - \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA).
-\end{align*}
+= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) -
+\text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA).
 $$