This commit is contained in:
jverzani
2025-04-16 14:31:16 -04:00
parent d56705e09b
commit 30be930f0f
9 changed files with 2576 additions and 16 deletions

View File

@@ -79,7 +79,7 @@ R(0) &= 0\\
\end{align*}
$$
In `Julia` we define these, `N` to model the total population, and `u0` to be the proportions.
In `Julia` we define these parameter values and `N` to model the total population and `u0` to be represent the proportions.
```{julia}
@@ -94,7 +94,7 @@ An *estimated* set of values for $k$ and $b$ are $k=1/3$, coming from the averag
Okay, the mathematical modeling is done; now we try to solve for the unknown functions using `DifferentialEquations`.
To warm up, if $b=0$ then $i'(t) = -k \cdot i(t)$ describes the infected. (There is no circulation of people in this case.) The solution would be achieved through:
To warm up, if $b=0$ then $i'(t) = -k \cdot i(t)$ describes the infected. (There is no circulation of people in this case.) This is a single ODE. The solution would be achieved through:
```{julia}
@@ -102,10 +102,12 @@ To warm up, if $b=0$ then $i'(t) = -k \cdot i(t)$ describes the infected. (The
k = 1/3
f(u,p,t) = -k * u # solving u(t) = - k u(t)
uᵢ0= I0/N
time_span = (0.0, 20.0)
prob = ODEProblem(f, I0/N, time_span)
sol = solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8)
prob = ODEProblem(f, uᵢ0, time_span)
sol = solve(prob, Tsit5(); reltol=1e-8, abstol=1e-8)
plot(sol)
```
@@ -120,7 +122,7 @@ $$
\frac{di}{dt} = -k \cdot i(t) = F(i(t), k, t)
$$
where $F$ depends on the current value ($i$), a parameter ($k$), and the time ($t$). We did not utilize $p$ above for the parameter, as it was easy not to, but could have, and will in the following. The time variable $t$ does not appear by itself in our equation, so only `f(u, p, t) = -k * u` was used, `u` the generic name for a solution which in this case is $i$.
where $F$ depends on the current value ($i$), a parameter ($k$), and the time ($t$). We did not utilize $p$ above for the parameter, as it was easy not to, but could have, and will in the following. The time variable $t$ does not appear by itself in our equation, so only `f(u, p, t) = -k * u` was used, `u` the generic name for a solution which in this case was labeled with an $i$.
The problem we set up needs an initial value (the $u0$) and a time span to solve over. Here we want time to model real time, so use floating point values.
@@ -167,7 +169,7 @@ The `sir!` function has the trailing `!` indicating by convention it *mu
:::
With the update function defined, the problem is setup and a solution found with in the same manner:
With the update function defined, the problem is setup and a solution is found using the same manner as before:
```{julia}
@@ -193,7 +195,7 @@ p = (k=1/2, b=2) # change b from 1/2 to 2 -- more daily contact
prob = ODEProblem(sir!, u0, time_span, p)
sol = solve(prob, Tsit5())
plot(sol)
plot(sol; legend=:right)
```
The graphs are somewhat similar, but the steady state is reached much more quickly and nearly everyone became infected.
@@ -252,7 +254,7 @@ end
p
```
The 3-dimensional graph with `plotly` can have its viewing angle adjusted with the mouse. When looking down on the $x-y$ plane, which code `b` and `k`, we can see the rapid growth along a line related to $b/k$.
(A 3-dimensional graph with `plotly` or `Makie` can have its viewing angle adjusted with the mouse. When looking down on the $x-y$ plane, which code `b` and `k`, we can see the rapid growth along a line related to $b/k$.)
Smith and Moore point out that $k$ is roughly the reciprocal of the number of days an individual is sick enough to infect others. This can be estimated during a breakout. However, they go on to note that there is no direct way to observe $b$, but there is an indirect way.
@@ -382,7 +384,7 @@ SOL = solve(trajectory_problem, Tsit5(); p = ps, callback=cb)
plot(t -> SOL(t)[1], t -> SOL(t)[2], TSPAN...; legend=false)
```
Finally, we note that the `ModelingToolkit` package provides symbolic-numeric computing. This allows the equations to be set up symbolically, as in `SymPy` before being passed off to `DifferentialEquations` to solve numerically. The above example with no wind resistance could be translated into the following:
Finally, we note that the `ModelingToolkit` package provides symbolic-numeric computing. This allows the equations to be set up symbolically, as has been illustrated with `SymPy`, before being passed off to `DifferentialEquations` to solve numerically. The above example with no wind resistance could be translated into the following:
```{julia}

View File

@@ -118,10 +118,10 @@ function solve(prob::Problem, alg::EulerMethod)
end
```
The post has a more elegant means to unpack the parameters from the structures, but for each of the above, the parameters are unpacked, and then the corresponding algorithm employed. As of version `v1.7` of `Julia`, the syntax `(;g,y0,v0,tspan) = prob` could also be employed.
The post has a more elegant means to unpack the parameters from the structures, but for each of the above, the parameters are unpacked using the dot notation for `getproperty`, and then the corresponding algorithm employed. As of version `v1.7` of `Julia`, the syntax `(;g,y0,v0,tspan) = prob` could also have been employed.
The exact formulas, `y(t) = y0 + v0*(t - t0) - g*(t - t0)^2/2` and `v(t) = v0 - g*(t - t0)`, follow from well-known physics formulas. Each answer is wrapped in a `Solution` type so that the answers found can be easily extracted in a uniform manner.
The exact answers, `y(t) = y0 + v0*(t - t0) - g*(t - t0)^2/2` and `v(t) = v0 - g*(t - t0)`, follow from well-known physics formulas for constant-acceleration motion. Each answer is wrapped in a `Solution` type so that the answers found can be easily extracted in a uniform manner.
For example, plots of each can be obtained through:
@@ -138,7 +138,9 @@ plot!(sol_exact.t, sol_exact.y; label="exact solution", ls=:auto)
title!("On the Earth"; xlabel="t", legend=:bottomleft)
```
Following the post, since the time step `dt = 0.1` is not small enough, the error of the Euler method is rather large. Next we change the algorithm parameter, `dt`, to be smaller:
Following the post, since the time step `dt = 0.1` is not small enough, the error of the Euler method is readily identified.
Next we change the algorithm parameter, `dt`, to be smaller:
```{julia}
@@ -155,7 +157,7 @@ title!("On the Earth"; xlabel="t", legend=:bottomleft)
It is worth noting that only the first line is modified, and only the method requires modification.
Were the moon to be considered, the gravitational constant would need adjustment. This parameter is part of the problem, not the solution algorithm.
Were the moon to be considered, the gravitational constant would need adjustment. This parameter is a property of the problem, not the solution algorithm, as `dt` is.
Such adjustments are made by passing different values to the `Problem` constructor:
@@ -175,7 +177,9 @@ title!("On the Moon"; xlabel="t", legend=:bottomleft)
The code above also adjusts the time span in addition to the graviational constant. The algorithm for exact formula is set to use the `dt` value used in the `euler` formula, for easier comparison. Otherwise, outside of the labels, the patterns are the same. Only those things that need changing are changed, the rest comes from defaults.
The above shows the benefits of using a common interface. Next, the post illustrates how *other* authors could extend this code, simply by adding a *new* `solve` method. For example,
The above shows the benefits of using a common interface.
Next, the post illustrates how *other* authors could extend this code, simply by adding a *new* `solve` method. For example, a sympletic method conserves a quantity, so can track long-term evolution without drift.
```{julia}

View File

@@ -4,12 +4,22 @@ Short cut. Run first command until happy, then run second to publish
```
quarto render
#julia adjust_plotly.jl # <-- no longer needed
# maybe git config --global http.postBuffer 157286400
quarto publish gh-pages --no-render
```
But better to
```
quarto render
# commit changes and push
# fix typos
quarto render
quarto publish gh-pages --no-render
```
To compile the pages through quarto

View File

@@ -1,4 +1,4 @@
version: "0.23"
version: "0.24"
engine: julia
project:

View File

@@ -839,6 +839,311 @@ Taking $\partial/\partial{a_i}$ gives equations $2a_i\sigma_i^2 + \lambda = 0$,
For the special case of a common variance, $\sigma_i=\sigma$, the above simplifies to $a_i = 1/n$ and the estimator is $\sum X_i/n$, the familiar sample mean, $\bar{X}$.
##### Example: The mean value theorem
[Perturbing the Mean Value Theorem: Implicit Functions, the Morse Lemma, and Beyond](https://www.jstor.org/stable/48661587) by Lowry-Duda, and Wheeler presents an interesting take on the mean-value theorem by asking if the endpoint $b$ moves continuously, does the value $c$ move continuously?
Fix the left-hand endpoint, $a_0$, and consider:
$$
F(b,c) = \frac{f(b) - f(a_0)}{b-a_0} - f'(c).
$$
Solutions to $F(b,c)=0$ satisfy the mean value theorem for $f$.
Suppose $(b_0,c_0)$ is one such solution.
By using the implicit function theorem, the question of finding a $C(b)$ such that $C$ is continuous near $b_0$ and satisfied $F(b, C(b)) =0$ for $b$ near $b_0$ can be characterized.
To analyze this question, Lowry-Duda and Wheeler fix a set of points $a_0 = 0$, $b_0=3$ and consider functions $f$ with $f(a_0) = f(b_0) = 0$. Similar to how Rolle's theorem easily proves the mean value theorem, this choice imposes no loss of generality.
Suppose further that $c_0 = 1$, where $c_0$ solves the mean value theorem:
$$
f'(c_0) = \frac{f(b_0) - f(a_0)}{b_0 - a_0}.
$$
Again, this is no loss of generality. By construction $(b_0, c_0)$ is a zero of the just defined $F$.
We are interested in the shape of the level set $F(b,c) = 0$ which reveals other solutions $(b,c)$. For a given $f$, a contour plot, with $b>c$, can reveal this shape.
To find a source of examples for such functions, polynomials are considered, beginning with these constraints:
$$
f(a_0) = 0, f(b_0) = 0, f(c_0) = 1, f'(c_0) = 0
$$
With four conditions, we might guess a cubic parabola with four unknowns should fit. We use `SymPy` to identify the coefficients.
```{julia}
a₀, b₀, c₀ = 0, 3, 1
@syms x
@syms a[0:3]
p = sum(aᵢ*x^(i-1) for (i,aᵢ) ∈ enumerate(a))
dp = diff(p,x)
p, dp
```
The constraints are specified as follows; `solve` has no issue with this system of equations.
```{julia}
eqs = (p(x=>a₀) ~ 0,
p(x=>b₀) ~ 0,
p(x=>c₀) ~ 1,
dp(x=>c₀) ~ 0)
d = solve(eqs, a)
q = p(d...)
```
We can plot $q$ and emphasize the three points with:
```{julia}
xlims = (-0.5, 3.5)
plot(q; xlims, legend=false)
scatter!([a₀, b₀, c₀], [0,0,1]; marker=(5, 0.25))
```
We now make a plot of the level curve $F(x,y)=0$ using `contour` and the constraint that $b>c$ to graphically identify $C(b)$:
```{julia}
dq = diff(q, x)
λ(b,c) = b > c ? (q(b) - q(a₀)) / (b - a₀) - dq(c) : -Inf
bs = cs = range(0.5,3.5, 100)
plot(; legend=false)
contour!(bs, cs, λ; levels=[0])
plot!(identity; line=(1, 0.25))
scatter!([b₀], [c₀]; marker=(5, 0.25))
```
The curve that passes through the point $(3,1)$ is clearly continuous, and following it, we see continuous changes in $b$ result in continuous changes in $c$.
Following a behind-the-scenes blog post by [Lowry-Duda](https://davidlowryduda.com/choosing-functions-for-mvt-abscissa/) we wrap some of the above into a function to find a polynomial given a set of conditions on values for its self or its derivatives at a point.
```{julia}
function _interpolate(conds; x=x)
np1 = length(conds)
n = np1 - 1
as = [Sym("a$i") for i in 0:n]
p = sum(as[i+1] * x^i for i in 0:n)
# set p⁽ᵏ⁾(xᵢ) = v
eqs = Tuple(diff(p, x, k)(x => xᵢ) ~ v for (xᵢ, k, v) ∈ conds)
soln = solve(eqs, as)
p(soln...)
end
# sets p⁽⁰⁾(a₀) = 0, p⁽⁰⁾(b₀) = 0, p⁽⁰⁾(c₀) = 1, p⁽¹⁾(c₀) = 0
basic_conditions = [(a₀,0,0), (b₀,0,0), (c₀,0,1), (c₀,1,0)]
_interpolate(basic_conditions; x)
```
Before moving on, polynomial interpolation can suffer from the Runge phenomenon, where there can be severe oscillations between the points. To tamp these down, an additional *control* point is added which is adjusted to minimize the size of the derivative through the value $\int \| f'(x) \|^2 dx$ (the $L_2$ norm of the derivative):
```{julia}
function interpolate(conds)
@syms x, D
# set f'(2) = D, then adjust D to minimize L₂ below
new_conds = vcat(conds, [(2, 1, D)])
p = _interpolate(new_conds; x)
# measure size of p with ∫₀⁴f'(x)^2 dx
dp = diff(p, x)
L₂ = integrate(dp^2, (x, 0, 4))
dL₂ = diff(L₂, D)
soln = first(solve(dL₂ ~ 0, D)) # critical point to minimum L₂
p(D => soln)
end
q = interpolate(basic_conditions)
```
We also make a plotting function to show both `q` and the level curve of `F`:
```{julia}
function plot_q_level_curve(q; title="", layout=[1;1])
x = only(free_symbols(q)) # fish out x
dq = diff(q, x)
xlims = ylims = (-0.5, 4.5)
p₁ = plot(; xlims, ylims, title,
legend=false, aspect_ratio=:equal)
plot!(p₁, q; xlims, ylims)
scatter!(p₁, [a₀, b₀, c₀], [0,0,1]; marker=(5, 0.25))
λ(b,c) = b > c ? (q(b) - q(a₀)) / (b - a₀) - dq(c) : -Inf
bs = cs = range(xlims..., 100)
p₂ = plot(; xlims, ylims, legend=false, aspect_ratio=:equal)
contour!(p₂, bs, cs, λ; levels=[0])
plot!(p₂, identity; line=(1, 0.25))
scatter!(p₂, [b₀], [c₀]; marker=(5, 0.25))
plot(p₁, p₂; layout)
end
```
```{julia}
plot_q_level_curve(q; layout=(1,2))
```
Like previously, this highlights the presence of a continuous function in $b$ yielding $c$.
This is not the only possibility. Another such from their paper (Figure 3) looks like the following where some additional constraints are added ($f''(c_0) = 0, f'''(c_0)=3, f'(b_0)=-3$):
```{julia}
new_conds = [(c₀, 2, 0), (c₀, 3, 3), (b₀, 1, -3)]
q = interpolate(vcat(basic_conditions, new_conds))
plot_q_level_curve(q;layout=(1,2))
```
For this shape, if $b$ increases away from $b_0$, the secant line connecting $(a_0,0)$ and $(b, f(b)$ will have a negative slope, but there are no points nearby $x=c_0$ where the derivative has a tangent line with negative slope, so the continuous function is only on the left side of $b_0$. Mathematically, as $f$ is increasing $c_0$ -- as $f'''(c_0) = 3 > 0$ -- and $f$ is decreasing at $f(b_0)$ -- as $f'(b_0) = -1 < 0$, the signs alone suggest the scenario. The contour plot reveals, not one, but two one-sided functions of $b$ giving $c$.
----
Now to characterize all possibilities.
Suppose $F(x,y)$ is differentiable. Then $F(x,y)$ has this approximation (where $F_x$ and $F_y$ are the partial derivatives):
$$
F(x,y) \approx F(x_0,y_0) + F_x(x_0,y_0) (x - x_0) + F_y(x_0,y_0) (y-y_0)
$$
If $(x_0,y_0)$ is a zero of $F$, then the above can be solved for $y$ assuming $F_y$ does not vanish:
$$
y \approx y_0 - \frac{F_x(x_0, y_0)}{F_y(x_0, y_0)} \cdot (x - x_0)
$$
The main tool used in the authors' investigation is the implicit function theorem. The implicit function theorem states there is some function continuously describing $y$, not just approximately, under the above assumption of $F_y$ not vanishing.
Again, with $F(b,c) = (f(b) - f(a_0)) / (b -a_0) - f'(c)$ and assuming $f$ has at least two continuous derivatives, then:
$$
\begin{align*}
F(b_0,c_0) &= 0,\\
F_c(b_0, c_0) &= -f''(c_0).
\end{align*}
$$
Assuming $f''(c_0)$ is *non*-zero, then this proves that if $b$ moves continuously, a corresponding solution to the mean value theorem will as well, or there is a continuous function $C(b)$ with $F(b,C(b)) = 0$.
Further, they establish if $f'(b_0) \neq f'(c_0)$ then there is a continuous $B(c)$ near $c_0$ such that $F(B(c),c) = 0$; and that there are no other nearby solutions to $F(b,c)=0$ near $(b_0, c_0)$.
This leaves for consideration the possibilities when $f''(c_0) = 0$ and $f'(b_0) = f'(c_0)$.
One such possibility looks like:
```{julia}
new_conds = [(c₀, 2, 0), (c₀, 3, 3), (b₀, 1, 0), (b₀, 2, 3)]
q = interpolate(vcat(basic_conditions, new_conds))
plot_q_level_curve(q;layout=(1,2))
```
This picture shows more than one possible choice for a continuous function, as the contour plot has this looping intersection point at $(b_0,c_0)$.
To characterize possible behaviors, the authors recall the [Morse lemma](https://en.wikipedia.org/wiki/Morse_theory) applied to functions $f:R^2 \rightarrow R$ with vanishing gradient, but non-vanishing Hession. This states that after some continuous change of coordinates, $f$ looks like $\pm u^2 \pm v^2$. Only this one-dimensional Morse lemma (and a generalization) is required for this analysis:
> if $g(x)$ is three-times continuously differentiable with $g(x_0) = g'(x_0) = 0$ but $g''(x_0) \neq 0$ then *near* $x_0$ $g(x)$ can be transformed through a continuous change of coordinates to look like $\pm u^2$, where the sign is the sign of the second derivative of $g$.
That is, locally the function can be continuously transformed into a parabola opening up or down depending on the sign of the second derivative. Their proof starts with Taylor's remainder theorem to find a candidate for the change of coordinates and shows with the implicit function theorem this is a viable change.
Setting:
$$
\begin{align*}
g_1(b) &= (f(b) - f(a_0))/(b - a_0) - f'(c_0)\\
g_2(c) &= f'(c) - f'(c_0).
\end{align*}
$$
Then $F(c, b) = g_1(b) - g_2(c)$.
By construction, $g_2(c_0) = 0$ and $g_2^{(k)}(c_0) = f^{(k+1)}(c_0)$,
Adjusting $f$ to have a vanishing second -- but not third -- derivative at $c_0$ means $g_2$ will satisfy the assumptions of the lemma assuming $f$ has at least four continuous derivatives (as all our example polynomials do).
As for $g_1$, we have by construction $g_1(b_0) = 0$. By differentiation we get a pattern for some constants $c_j = (j+1)\cdot(j+2)\cdots \cdot k$ with $c_k = 1$.
$$
g^{(k)}(b) = k! \cdot \frac{f(a_0) - f(b)}{(a_0-b)^{k+1}} - \sum_{j=1}^k c_j \frac{f^{(j)}(b)}{(a_0 - b)^{k-j+1}}.
$$
Of note that when $f(a_0) = f(b_0) = 0$ that if $f^{(k)}(b_0)$ is the first non-vanishing derivative of $f$ at $b_0$ that $g^{(k)}(b_0) = f^{(k)}(b_0)/(b_0 - a_0)$ (they have the same sign).
In particular, if $f(a_0) = f(b_0) = 0$ and $f'(b_0)=0$ and $f''(b_0)$ is non-zero, the lemma applies to $g_1$, again assuming $f$ has at least four continuous derivatives.
Let $\sigma_1 = \text{sign}(f''(b_0))$ and $\sigma_2 = \text{sign}(f'''(c_0))$, then we have $F(b,c) = \sigma_1 u^2 - \sigma_2 v^2$ after some change of variables. The authors conclude:
* If $\sigma_1$ and $\sigma_2$ have different signs, then $F(b,c) = 0$ is like $u^2 = - v^2$ which has only one isolated solution, as the left hand side and right hand sign will have different signs except when $0$.
* If $\sigma_1$ and $\sigma_2$ have the same sign, then $F(b,c) = 0$ is like $u^2 = v^2$ which has two solutions $u = \pm v$.
Applied to the problem at hand:
* if $f''(b_0)$ and $f'''(c_0)$ have different signs, the $c_0$ can not be extended to a continuous function near $b_0$.
* if the two have the same sign, then there are two such functions possible.
```{julia}
conds₁ = [(b₀,1,0), (b₀,2,3), (c₀,2,0), (c₀,3,-3)]
conds₂ = [(b₀,1,0), (b₀,2,3), (c₀,2,0), (c₀,3, 3)]
q₁ = interpolate(vcat(basic_conditions, conds₁))
q₂ = interpolate(vcat(basic_conditions, conds₂))
p₁ = plot_q_level_curve(q₁)
p₂ = plot_q_level_curve(q₂)
plot(p₁, p₂; layout=(1,2))
```
There are more possibilities, as pointed out in the article.
Say a function, $h$, has *a zero of order $k$ at $x_0$* if the first $k-1$ derivatives of $h$ are zero at $x_0$, but that $h^{(k)}(x_0) \neq 0$. Now suppose $f$ has order $k$ at $b_0$ and order $l$ at $c_0$. Then $g_1$ will be order $k$ at $b_0$ and $g_2$ will have order $l-1$ at $c_0$. In the above, we had orders $2$ and $3$ respectively.
A generalization of the Morse lemma to the function, $h$ having a zero of order $k$ at $x_0$ is $h(x) = \pm u^k$ where if $k$ is odd either sign is possible and if $k$ is even, then the sign is that of $h^{(k)}(x_0)$.
With this, we get the following possibilities for $f$ with a zero of order $k$ at $b_0$ and $l$ at $c_0$:
* If $l$ is even, then there is one continuous solution near $(b_0,c_0)$
* If $l$ is odd and $k$ is even and $f^{(k)}(b_0)$ and $f^{(l)}(c_0)$ have the *same* sign, then there are two continuous solutions
* If $l$ is odd and $k$ is even and $f^{(k)}(b_0)$ and $f^{(l)}(c_0)$ have *opposite* signs, the $(b_0, c_0)$ is an isolated solution.
* If $l$ is add and $k$ is odd, then there are two continuous solutions, but only defined in a a one-sided neighborhood of $b_0$ where $f^{(k)}(b_0) f^{(l)}(c_0) (b - b_0) > 0$.
To visualize these four cases, we take $(l=2,k=1)$, $(l=3, k=2)$ (twice) and $(l=3, k=3)$.
```{julia}
condsₑ = [(c₀,2,3), (b₀,1,-3)]
condsₒₑ₊₊ = [(c₀,2,0), (c₀,3, 10), (b₀,1,0), (b₀,2,10)]
condsₒₑ₊₋ = [(c₀,2,0), (c₀,3,-20), (b₀,1,0), (b₀,2,20)]
condsₒₒ = [(c₀,2,0), (c₀,3,-20), (b₀,1,0), (b₀,2, 0), (b₀,3, 20)]
qₑ = interpolate(vcat(basic_conditions, condsₑ))
qₒₑ₊₊ = interpolate(vcat(basic_conditions, condsₒₑ₊₊))
qₒₑ₊₋ = interpolate(vcat(basic_conditions, condsₒₑ₊₋))
qₒₒ = interpolate(vcat(basic_conditions, condsₒₒ))
p₁ = plot_q_level_curve(qₑ; title = "(e,.)")
p₂ = plot_q_level_curve(qₒₑ₊₊; title = "(o,e,same)")
p₃ = plot_q_level_curve(qₒₑ₊₋; title = "(o,e,different)")
p₄ = plot_q_level_curve(qₒₒ; title = "(o,o)")
plot(p₁, p₂, p₃, p₄; layout=(1,4))
```
This handles most cases, but leaves the possibility that a function with infinite vanishing derivatives to consider. We steer the interested reader to the article for thoughts on that.
## Questions

View File

@@ -87,3 +87,25 @@ edition = {Julia adaptation},
URL = {https://tobydriscoll.net/fnc-julia/frontmatter.html},
eprint = {https://epubs.siam.org/doi/pdf/10.1137/1.9781611975086}
}
## matrix calculus
@misc{BrightEdelmanJohnson,
title={Matrix Calculus (for Machine Learning and Beyond)},
author={Paige Bright and Alan Edelman and Steven G. Johnson},
year={2025},
eprint={2501.14787},
archivePrefix={arXiv},
primaryClass={math.HO},
url={https://arxiv.org/abs/2501.14787},
}
@misc{CarlssonNikitinTroedssonWendt,
title={The bilinear Hessian for large scale optimization},
author={Marcus Carlsson and Viktor Nikitin and Erik Troedsson and Herwig Wendt},
year={2025},
eprint={2502.03070},
archivePrefix={arXiv},
primaryClass={math.OC},
url={https://arxiv.org/abs/2502.03070},
}

View File

@@ -0,0 +1,2 @@
[deps]
quarto_jll = "b7163347-bfae-5fd9-aba4-19f139889d78"

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,937 @@
# Matrix Calculus
XXX Add in examples from paper XXX
optimization? large number of parameters? ,...
::: {.callout-note}
## Based on Bright, Edelman, and Johnson's notes
This section samples material from the notes [Matrix Calculus (for Machine Learning and Beyond)](https://arxiv.org/abs/2501.14787) by Paige Bright, Alan Edelman, and Steven G. Johnson. These notes cover material taught in a course at MIT. Support materials for their course in `Julia` are available at [https://github.com/mitmath/matrixcalc/tree/main](https://github.com/mitmath/matrixcalc/tree/main). For more details and examples, please refer to the source.
:::
We have seen several "derivatives" of a function, based on the number of inputs and outputs. The first one was for functions $f: R \rightarrow R$.
Then $f$ has a derivative at $x$ if this limit exists
$$
\lim_{h \rightarrow 0}\frac{f(x + h) - f(x)}{h}.
$$
The derivative of the function $x$ is this limit for a given $x$. Common notation is:
$$
f'(x) = \frac{dy}{dx} = \lim_{h \rightarrow 0}\frac{f(x + h) - f(x)}{h}
$$
(when the limit exists).
This limit gets expressed in different ways:
* linearization write $f(x+\Delta x) - f(x) \approx f'(x)\Delta x$, where $\delta x$ is a small displacement from $x$. The reason there isn't equality is the unwritten higher order terms that vanish in a limit.
* Alternate limits. Another way of writing this is in terms of explicit smaller order terms:
$$
(f(x+h) - f(x)) - f'(x)h = \mathscr{o}(h),
$$
which means if we divide both sides by $h$ and take the limit, we will get $0$ on the right and the relationship on the left.
* Differential notation simply writes this as $dy = f(x)dx$. More verbosely, we might write
$$
df = f(x+dx) - f(x) = f'(x) dx.
$$
Here $dx$ is a differential, made rigorous by a limit, which hides the higher order terms.
In these notes the limit has been defined, with suitable modification, for functions of vectors (multiple values) with scalar or vector outputs.
For example, when $f: R \rightarrow R^m$ was a vector-valued function the derivative was defined similarly through a limit of $(f(t + \Delta t) - f(t))/{\Delta t}$, where each component needed to have a limit. This can be rewritten through $f(t + dt) - f(t) = f'(t) dt$, again using differentials to avoid the higher order terms.
When $f: R^n \rightarrow R$ is a scalar-valued function of a derivative, differentiability was defined by a gradient existing with $f(c+h) - f(c) - \nabla{f}(c) \cdot h$ being $\mathscr{o}(\|h\|)$. In other words $df = f(c + dh) - f(c) = \nabla{f}(c) \cdot dh$. The gradient has the same shape as $c$, a column vector. If we take the row vector (e.g. $f'(c) = \nabla{f}(c)^T$) then again we see $df = f(c+dh) - f(c) = f'(c) dh$, where the last term uses matrix multiplication of a row vector times a column vector.
Finally, when $f:R^n \rightarrow R^m$, the Jacobian was defined and characterized by
$\| f(x + dx) - f(x) - J_f(x)dx \|$ being $\mathscr{o}(\|dx\|)$. Again, we can express this through $df = f(x + dx) - f(x) = f'(x)dx$ where $f'(x) = J_f(x)$.
In writing $df = f(x + dx) - f(x) = f'(x) dx$ generically, some underlying facts are left implicit: $dx$ has the same shape as $x$ (so can be added); $f'(x) dx$ may mean usual multiplication or matrix multiplication; and there is an underlying concept of distance and size that allows the above to be rigorous. This may be an abolute value or a norm.
Further, various differentiation rules apply such as the sum, product, and chain rule.
The @BrightEdelmanJohnson notes cover differentiation of functions in this uniform manner and then extend the form by treating derivatives as *linear operators*.
A [linear operator](https://en.wikipedia.org/wiki/Operator_(mathematics)) is a mathematical object which satisfies
$$
f[\alpha v + \beta w] = \alpha f[v] + \beta f[w].
$$
where the $\alpha$ and $\beta$ are scalars, and $v$ and $w$ possibly not and come from a *vector space*. Regular multiplication and matrix multiplication are familiar linear operations, but there are many others.
The referenced notes identify $f'(x) dx$ with $f'(x)[dx]$, the latter emphasizing $f'(x)$ acts on $dx$ and the notation is not commutative (e.g., it is not $dx f'(x)$).
Linear operators are related to vector spaces.
A [vector space](https://en.wikipedia.org/wiki/Vector_space) is a set of mathematical objects which can be added together and also multiplied by a scalar. Vectors of similar size, as previously discussed, are the typical example, with vector addition and scalar multiplication previously discussed topics. Matrices of similar size (and some subclasses) also form a vector space. Additionally, many other set of objects form vector spaces. An example might be polynomial functions of degree $n$ or less; continuous functions, or functions with a certain number of derivatives.
Take differentiable functions as an example, then the simplest derivative rules $[af(x) + bg(x)]' = a[f(x)]' + b[g(x)]'$ show the linearity of the derivative in this setting. This linearity is different from how the derivative is a linear operator on $dx$.
A vector space is described by a *basis* -- a minimal set of vectors needed to describe the space, after consideration of linear combinations. For many vectors, this the set of special vectors with $1$ as one of the entries, and $0$ otherwise.
A key fact about a basis is every vector in the vector space can be expressed *uniquely* as a linear combination of the basis vectors.
Vectors and matrices have properties that are generalizations of the real numbers. As vectors and matrices form vector spaces, the concept of addition of vectors and matrices is defined, as is scalar multiplication. Additionally, we have seen:
* The dot product between two vectors of the same length is defined easily ($v\cdot w = \Sigma_i v_i w_i$). It is coupled with the length as $\|v\|^2 = v\cdot v$.
* Matrix multiplication is defined for two properly sized matrices. If $A$ is $m \times k$ and $B$ is $k \times n$ then $AB$ is a $m\times n$ matrix with $(i,j)$ term given by the dot product of the $i$th row of $A$ (viewed as a vector) and the $j$th column of $B$ (viewed as a vector). Matrix multiplication is associative but *not* commutative. (E.g. $(AB)C = A(BC)$ but $AB$ and $BA$ need not be equal (or even defined, as the shapes may not match up).
* A square matrix $A$ has an *inverse* $A^{-1}$ if $AA^{-1} = A^{-1}A = I$, where $I$ is the identity matrix (a matrix which is zero except on its diagonal entries which are all $1$). Square matrices may or may not have an inverse. When they don't the matrix is called singular.
* Viewing a vector as a matrix is possible. The association is typically through a *column* vector.
* The transpose of a matrix comes by permuting the rows and columns. The transpose of a column vector is a row vector, so $v\cdot w = v^T w$, where we use a superscript $T$ for the transpose. The transpose of a product, is the product of the transposes -- reversed: $(AB)^T = B^T A^T$; the tranpose of a transpose is an identity operation: $(A^T)^T = A$; the inverse of a transpose is the tranpose of the inverse: $(A^{-1})^T = (A^T){-1}$.
* Matrices for which $A = A^T$ are called symmetric.
* A few of the operations on matrices are the transpose and the inverse. These return a matrix, when defined. There is also the determinant and the trace, which return a scalar from a matrix. The trace is just the sum of the diagonal; the determinant is more involved to compute, but was previously seen to have a relationship to the volume of a certain parallellpiped. There are a few other operations described in the following.
.
## Scalar-valued functions of a vector
Suppose $f: R^n \rightarrow R$, a scalar-valued function of a vector. Then the directional derivative at $x$ in the direction $v$ was defined for a scalar $\alpha$ by:
$$
\frac{\partial}{\partial \alpha}f(x + \alpha v) \mid_{\alpha = 0} =
\lim_{\Delta\alpha \rightarrow 0} \frac{f(x + \Delta\alpha v) - f(x)}{\Delta\alpha}.
$$
This rate of change in the direction of $v$ can be expressed through the linear operator $f'(x)$ via
$$
f(x + d\alpha v) - f(x) = f'(x) [d\alpha v] = d\alpha f'(x)[v],
$$
using linearity to move the scalar part outside the $[]$. This connects the partial derivative at $x$ in the direction of $v$ with $f'(x)$:
$$
\frac{\partial}{\partial \alpha}f(x + \alpha v) \mid_{\alpha = 0} =
f'(x)[v].
$$
Not only does this give a connection in notation with the derivative, it naturally illustrates how the derivative as a linear operator can act on non-infinitesimal values.
Previously, we wrote $\nabla f \cdot v$ for the directional derivative, where the gradient is a column vector. The above uses the identification
$f' = (\nabla f)^T$.
For $f: R^n \rightarrow R$ we have
$$
df = f(x + dx) - f(x) = f'(x) [dx]
$$
is a scalar, so if $dx$ is a column vector, $f'(x)$ is a row vector with the same number of components (just as $\nabla f$ is a column vector with the same number of components).
##### Examples
@BrightEdelmanJohnson include this example to show that the computation of derivatives using components can be avoided. Consider $f(x) = x^T A x$ where $x$ is a vector in $R^n$ and $A$ is an $n\times n$ matrix. Then $f: R^n \rightarrow R$ and its derivative can be computed:
$$
\begin{align*}
df &= f(x + dx) - f(x)\\
&= (x + dx)^T A (x + dx) - x^TAx \\
&= x^TAx + dx^TA x + x^TAx + dx^T A dx - x^TAx\\
&= dx^TA x + x^TAdx \\
&= (dx^TAx)^T + x^TAdx \\
&= x^T A^T dx + x^T A dx\\
&= x^T(A^T + A) dx
\end{align*}
$$
The term $dx^t A dx$ is dropped, as it is higher order (goes to zero faster), it containing two $dx$ terms.
In the second to last step, an identity operation (taking the transpose of the scalar quantity) is taken to simplify the algebra. Finally, as $df = f'(x)[dx]$ the identity of $f'(x) = x^T(A^T+A)$ is made, or taking transposes $\nabla f = (A + A^T)x$.
Compare the elegance above, with the component version, even though simplified, it still requires a specification of the size to carry the following out:
```{julia}
using SymPy
@syms x[1:3]::real A[1:3, 1:3]::real
u = x' * A * x
grad_u = [diff(u, xi) for xi in x]
```
Compare to the formula for the gradient just derived:
```{julia}
grad_u_1 = (A + A')*x
```
The two are, of course, equal
```{julia}
all(a == b for (a,b) ∈ zip(grad_u, grad_u_1))
```
----
For $f: R^n \rightarrow R^m$, @BrightEdelmanJohnson give an example of computing the Jacobian without resorting to component wise computations. Let $f(x) = Ax$ with $A$ being a $m \times n$ matrix, it follows that
$$
\begin{align*}
df &= f(x + dx) - f(x)\\
&= A(x + dx) - Ax\\
&= Adx\\
&= f'(x) dx.
\end{align*}
$$
The Jacobian is the linear operator $A$ acting on $dx$.
## Sum and product rules for the derivative
Using the differential notation -- which implicitly ignores higher order terms as they vanish in a limit -- the sum and product rules can be derived.
For the sum rule, let $f(x) = g(x) + h(x)$. Then
$$
\begin{align*}
df &= f(x + dx) - f(x) \\
&= f'(x) dx\\
&= \left(g(x+dx) + h(x+dx)\right) - \left(g(x) + h(x)\right)\\
&= \left(g(x + dx) - g(x)\right) + \left(h(x + dx) - h(x)\right)\\
&= g'(x)dx + h'(x) dx\\
&= \left(g'(x) + h'(x)\right) dx
\end{align*}
$$
Comparing we get $f'(x) = g'(x) + h'(x)$.
The sum rule has the same derivation as was done with univariate, scalar functions. Similarly for the product rule.
The product rule has with $f(x) = g(x)h(x)$
$$
\begin{align*}
df &= f(x + dx) - f(x) \\
&= g(x+dx)h(x + dx) - g(x) h(x)\\
&= \left(g(x) + g'(x)dx\right)\left(h(x) + h'(x) dx\right) - \left(g(x) h(x)\right) \\
&= g(x)h(x) + g'(x) dx h(x) + g(x) h'(x) dx + g'(x)dx h'(x) dx - g(x) h(x)\\
&= gh + dg h + gdh + dg dh - gh\\
&= dg h + gdh,
\end{align*}
$$
**after** dropping the higher order term and cancelling $gh$ terms of opposite signs in the fourth row.
##### Examples
These two rules can be used to show the last two examples:
First, to differentiate $f(x) = x^TAx$:
$$
\begin{align*}
df &= dx^T (Ax) + x^T d(Ax) \\
&= x^T A^T dx + x^T A dx \\
&= x^T(A^T + A) dx
\end{align*}
$$
Again, taking the transpose of the scalar quantity $x^TAdx$ to simplify the expression.
When $A^T = A$ ($A$ is symmetric) this simplifies to a more familiar looking $2x^TA$, but we see that this requires assumptions not needed in the scalar case.
Next, if $f(x) = Ax$ then
$$
df = (dA)x + A(dx) = 0x + A dx = A dx,
$$
$A$ being a constant here.
##### Example
@BrightEdelmanJohnson consider what in `Julia` is `.*`. That is the operation:
$$
v .* w =
\begin{bmatrix}
v_1w_1 \\
v_2w_2 \\
\vdots\\
v_nw_n
\end{bmatrix}
=
\begin{bmatrix}
v_1 & 0 & \cdots & 0 \\
0 & v_2 & \cdots & 0 \\
& & \vdots & \\
0 & 0 & \cdots & v_n
\end{bmatrix}
\begin{bmatrix}
w_1 \\
w_2 \\
\vdots\\
w_n
\end{bmatrix}
= \text{diag}(v) w.
$$
They compute the derivative of $f(x) = A(x .* x)$ for some fixed matrix $A$ of the proper size.
We can see that $d (\text{diag}(v)w) = d(\text{diag}(v)) w + \text{diag}(v) dw = (dx) .* w + x .* dw$. So
$df = A(dx .* x + x .* dx) = 2A(x .* dx)$, as $.*$ is commutative by its definition. Writing this as $df = 2A(x .* dx) = 2A(\text{diag}(x) dx) = (2A\text{diag}(x)) dx$, we identify $f'(x) = 2A\text{diag}(x)$.
This operation is called the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) and it extends to matrices and arrays.
## The chain rule
Like the product rule, the chain rule is shown by @BrightEdelmanJohnson in this notation with $f(x) = g(h(x))$:
$$
\begin{align*}
df &= f(x + dx) - f(x)\\
&= g(h(x + dx)) - g(h(x))\\
&= g(h(x) + h'(x)[dx]) - g(h(x))\\
&= g'(h(x)) [h'(x) [dx]]\\
&= (g'(h(x)) h'(x)) [dx]
\end{align*}
$$
(The limit requires a bit more detail.)
The operator $f'(x)= g'(h(x)) h'(x)$ is a product of matrices.
### Computational differences with expressions from the chain rule
Of note here is the application of the chain rule to three (or more compositions):
The derivative of $f(x) = a(x) b(x) c(x)$ can be expressed as
$$
f' = (a'b')c' \text{ or } f' = a'(b'c')
$$
Multiplying left to right (the first) is called reverse mode; multiplying right to left (the second) is called forward mode. The distinction becomes important when considering the computational cost of the multiplications.
* If $f: R^n \rightarrow R^m$ has $n$ much bigger than $1$ and $m=1$, then it is much faster to do left to right multiplication
* if $f:R^n \rightarrow R^m$ has $n=1$ and $m$ much bigger than one, the it is faster to do right to left multiplication.
The basic idea comes down to the shape of the matrices. When $m=1$, the derviative is a product of matrices of size $n\times j$ $j\times k$ and $k \times 1$ yielding a matrix of size $n \times 1$ matching the function dimension. Matrix multiplication of an $m \times q$ times $q \times n$ takes an order of $mqn$ operations. The multiplication of left to right is then
The first operation takes $njk$ operation leaving an $n\times k$ matrix, the next multiplication then takes another $nk1$ operations or $njk + nk$ together. Whereas computing from the right to left is first $jk1$ operations leaving a $j \times 1$ matrix. The next operation would take another $nk1$ operations. In totalL
* left to right is $njk + nk$ = $nk \cdot (1 + j)$.
* right to left is $jk + j = j\cdot (k+1)$.
When $j=k$, say, we can compare and see the second is a factor less in terms of operations. This can be quite significant in higher dimensions, whereas the dimensions of calculus (where $n$ and $m$ are $3$ or less) it is not an issue.
##### Example
Using the `BenchmarkTools` package, we can check the time to compute various products:
```{julia}
using BenchmarkTools
n,j,k,m = 20,15,10,1
@btime A*(B*C) setup=(A=rand(n,j);B=rand(j,k); C=rand(k,m));
@btime (A*B)*C setup=(A=rand(n,j);B=rand(j,k); C=rand(k,m));
```
The latter computation is about 1.5 times slower.
Whereas the relationship is changed when the first matrix is skinny and the last is not:
```{julia}
@btime A*(B*C) setup=(A=rand(m,k);B=rand(k,j); C=rand(j,n));
@btime (A*B)*C setup=(A=rand(m,k);B=rand(k,j); C=rand(j,n));
```
##### Example
In calculus, we have $n$ and $m$ are $1$,$2$,or $3$. But that need not be the case, especially if differentiation is over a parameter space.
XXXX
(Maybe the ariplain wing, but please, something origi
## Derivatives of matrix functions
What is the the derivative of $f(A) = A^2$?
The function $f$ takes a $n\times n$ matrix and returns a matrix of the same size. This innocuous question isn't directly
handled by the Jacobian, which is defined for vector valued function $f:R^n \rightarrow R^m$.
This derivative can be derived directly from the *product rule*:
$$
\begin{align*}
f(A) &= [AA]'\\
&= A dA + dA A
\end{align*}
$$
That is $f'(A)$ is the operator $f'(A)[\delta A] = A \delta A + \delta A A$ and not $2A\delta A$, as $A$ may not commute with $\delta A$.
### Vectorization of a matrix
Alternatively, we can identify $A$ through its
components, as a vector in $R^{n^2}$ and then leverage the Jacobian.
One such identification is vectorization -- consecutively stacking the
column vectors into a vector. In `Julia` the `vec` function does this
operation:
```{julia}
@syms A[1:2, 1:2]
vec(A)
```
The stacking by column follows how `Julia` stores matrices and how `Julia` references a matrices entries by linear index:
```{julia}
vec(A) == [A[i] for i in eachindex(A)]
```
With this vectorization operation, $f$ may be viewed as
$\tilde{f}:R^{n^2} \rightarrow R^{n^2}$ through:
$$
\tilde{f}(\text{vec}(A)) = \text{vec}(f(A))
$$
We use `SymPy` to compute the Jacobian of this vector valued function.
```{julia}
@syms A[1:3, 1:3]::real
f(x) = x^2
J = vec(f(A)).jacobian(vec(A)) # jacobian of f̃
```
We do this via linear algebra first, then see a more elegant manner following the notes.
A basic course in linear algebra shows that any linear operator on a finite vector space can be represented as a matrix. The basic idea is to represent what the operator does to each *basis* element and put these values as columns of the matrix.
In this $3 \times 3$ case, the linear operator works on an object with $9$ slots and returns an object with $9$ slots, so the matrix will be $9 \times 9$.
The basis elements are simply the matrices with a $1$ in spot $(i,j)$ and zero elsewhere. Here we generate them through a function:
```{julia}
basis(i,j,A) = (b=zeros(Int, size(A)...); b[i,j] = 1; b)
JJ = [vec(basis(i,j,A)*A + A*basis(i,j,A)) for j in 1:3 for i in 1:3]
```
The elements of `JJ` show the representation of each of the $9$ basis elements under the linear transformation.
To construct the matrix representing the linear operator, we need to concatenate these horizontally as column vectors
```{julia}
JJ = hcat(JJ...)
```
The matrix $JJ$ is identical to $J$, above:
```{julia}
all(j == jj for (j, jj) in zip(J, JJ))
```
### Kronecker products
But how can we see the Jacobian, $J$, from the linear operator $f'(A)[\delta A] = \delta A A + A \delta A$?
To make this less magical, a related operation to `vec` is defined.
The $\text{vec}$ function takes a matrix and stacks its columns.
The $\text{vec}$ function can turn a matrix into a vector, so it can be used for finding the Jacobian, as above. However the shape of the matrix is lost, as are the fundamental matrix operations, like multiplication.
The [Kronecker product](https://en.wikipedia.org/wiki/Kronecker_product) replicates values making a bigger matrix. That is, if $A$ and $B$ are matrices, the Kronecker product replaces each value in $A$ with that value times $B$, making a bigger matrix, as each entry in $A$ is replaced by an entry with size $B$.
Formally,
$$
A \otimes B =
\begin{bmatrix}
a_{11}B & a_{12}B & \cdots & a_{1n}B \\
a_{21}B & a_{22}B & \cdots & a_{2n}B \\
&\vdots & & \\
a_{m1}B & a_{m2}B & \cdots & a_{mn}B
\end{bmatrix}
$$
The function `kron` forms this product:
```{julia}
@syms A[1:2, 1:3] B[1:3, 1:4]
kron(A, B) # same as hcat((vcat((A[i,j]*B for i in 1:2)...) for j in 1:3)...)
```
The $m\times n$ matrix $A$ and $j \times k$ matrix $B$ has a Kronecker product with size $mj \times nk$.
The Kronecker product has a certain algebra, including:
* transposes: $(A \otimes B)^T) = A^T \otimes B^T$
* multiplication: $(A\otimes B)(C \otimes D) = (AC) \otimes (BD)$
* inverses: $(A \otimes B)^{-1} = (A^{-1}) \otimes (B^{-1})$
* orthogonal: $(A\otimes B)^T = (A\otimes B)$ if both $A$ and $B$ has the same property
* determinants: $\det(A\otimes B) = \det(A)^m \det(B)^n$, where $A$ is $n\times n$, $B$ is $m \times m$.
* trace (sum of diagonal): $\text{tr}(A \otimes B) = \text{tr}(A)\text{tr}(B)$.
The main equation coupling `vec` and `kron` is the fact that if $A$, $B$, and $C$ have appropriate sizes, then:
$$
(A \otimes B) \text{vec}(C) = \text{vec}(B C A^T).
$$
Appropriate sizes for $A$, $B$, and $C$ are determined by the various products in $BCA^T$.
If $A$ is $m \times n$ and $B$ is $r \times s$, then since $BC$ is defined, $C$ has $s$ rows, and since $CA^T$ is defined, $C$ must have $n$ columns, as $A^T$ is $n \times m$, so $C$ must be $s\times n$. Checking this is correct on the other side, $A \times B$ would be size $mr \times ns$ and $\vec{C}$ would be size $sn$, so that product works, size wise.
The referred to notes have an explanation for this formula, but we confirm with an example with $m=n-2$, $r=s=3$:
```{julia}
@syms A[1:2, 1:2]::real B[1:3, 1:3]::real C[1:3, 1:2]::real
L, R = kron(A,B)*vec(C), vec(B*C*A')
all(l == r for (l, r) ∈ zip(L, R))
```
----
Now to use this relationship to recognize $df = A dA + dA A$ with the Jacobian computed from $\text{vec}{f(a)}$.
We have $\text{vec}(A dA + dA A) = \text{vec}(A dA) + \text{vec}(dA A)$, by obvious linearity of $\text{vec}$. Now inserting an identity matrix, $I$, which is symmteric, we have:
$$
\text{vec}(A dA) = \text{vec}(A dA I^T) = (I \otimes A) \text{vec}(dA),
$$
and
$$
\text{vec}(dA A) = \text{vec}(I dA (A^T)^T) = (A^T \otimes I) \text{vec}(dA)
$$
This leaves
$$
\text{vec}(A dA + dA A) =
\left((I \otimes A) + (A^T \otimes I)\right) \text{vec}(dA)
$$
We should then get the Jacobian we computed from the following:
```{julia}
@syms A[1:3, 1:3]::real
using LinearAlgebra: I
J = vec(A^2).jacobian(vec(A))
JJ = kron(I(3), A) + kron(A', I(3))
all(j == jj for (j,jj) in zip(J,JJ))
```
This technique can also be used with other powers, say $f(A) = A^3$, where the resulting $df = A^2 dA + A dA A + dA A^2$ is one answer that can be compared to a Jacobian through
$$
\begin{align*}
df &= \text{vec}(A^2 dA I^T) + \text{vec}(A dA A) + \text{vec}(I dA A^2)\\
&= (I \otimes A^2)\text{vec}(dA) + (A^T \otimes A) \text{vec}(dA) + ((A^T)^2 \otimes I) \text{vec}(dA)
\end{align*}
$$
The above shows how to relate the derivative of a matrix function to
the Jacobian of a vectorized function, but only for illustration. It
is decidely not necessary to express the derivative of $f$ in terms of
the derivative of its vectorized counterpart.
##### Example: derivative of the inverse
What is the derivative of $f(A) = A^{-1}$. When $A$ is a scalar, we related it to the reciprocal of the derivative of $f$ at some other point. The same technique is available. Starting with $I = AA^{-1}$ and noting $dI$ is $0$ we have
$$
\begin{align*}
0 &= d(AA^{-1})\\
&= dAA^{-1} + A d(A^{-1})
\end{align*}
$$
So, $d(A^{-1}) = -A^{-1} dA A^{-1}$.
This could be re-expressed as a linear operator through
$$
\text{vec}(dA^{-1}) =
\left((A^{-1})^T \otimes A^{-1}\right) \text{vec}(dA)
= \left((A^T)^{-1} \otimes A^{-1}\right) \text{vec}(dA).
$$
##### Example: derivative of the determinant
Let $f(A) = \text{det}(A)$. What is the derivative?
First, the determinant of a square, $n\times n$, matrix $A$ is a scalar summary of $A$ with different means to compute it, but one recursive one in particular is helpful here:
$$
\text{det}(A) = a_{1j}C_{1j} + a_{2j}C_{2j} + \cdots a_{nj}C_{nj}
$$
for any $j$. The *cofactor* $C_{ij}$ is the determinant of the $(n-1)\times(n-1)$ matrix with the $i$th row and $j$th column deleted times $(-1)^{i+j}$.
To find the *gradient* of $f$, we differentiate by each of the $A_{ij}$ variables, and so
$$
\frac{\partial\text{det}(A)}{\partial A_{ij}} =
\frac{\partial (a_{1j}C_{1j} + a_{2j}C_{2j} + \cdots a_{nj}C_{nj})}{\partial A_{ij}} =
C_{ij},
$$
as each cofactor in the expansion has no dependence on $A_{ij}$ as the cofactor removes the $i$th row and $j$th column.
So the gradient is the matrix of cofactors.
@BrightEdelmanJohnson also give a different proof, starting with this observation
$$
\text{det}(I + dA) - \text{det}(I) = \text{tr}(dA)
$$
Assuming that, then by the fact $\text{det}(AB) = \text{det}(A)\text{det}(B)$:
$$
\begin{align*}
\text{det}(A + A(A^{-1}dA)) - \text{det}(A) &= \text{det}(A)\cdot(\text{det}(I+ A^{-1}dA) - \text{det}(I)) \\
&= \text{det}(A) \text{tr}(A^{-1}dA)\\
&= \text{tr}(\text{det}(A)A^{-1}dA)\\
\end{align*}
$$
This agrees through a formula to compute the inverse of a matrix through its cofactor matrix divided by its determinant.
That the trace gets involved, can be seen from this computation, which shows the only first-order terms are from the diagonal sum:
```{julia}
using LinearAlgebra
@syms dA[1:2, 1:2]
det(I + dA) - det(I)
```
## The adjoint method
The chain rule brings about a series of products. The adjoint method illustrated below, shows how to approach the computation of the series in a direction that minimizes the computational cost, illustrating why reverse mode is preferred to forward mode when a scalar function of several variables is considered.
@BrightEdelmanJohnson consider the derivative of
$$
g(p) = f(A(p)^{-1} b)
$$
This might arise from applying a scalar-valued $f$ to the solution of $Ax = b$, where $A$ is parameterized by $p$.
The chain rule gives the following computation to find the derivative (or gradient):
$$
\begin{align*}
dg
&= f'(x)[dx]\\
&= f'(x) [d(A(p)^{1} b)]\\
&= f'(x)[-A(p)^{-1} dA A(p)^{-1} b + 0]\\
&= -f'(x) A(p)^{-1} dA A(p)^{-1} b.
\end{align*}
$$
By writing $dA = A'(p)[dp]$ and setting $v^T = f'(x)A(p)^{-1}$ this becomes
$$
dg = -v^T dA A(p)^{-1} b = -v^T dA x
$$
This product of three terms can be computed in two directions:
From left to right:
First $v$ is found by solving $v^T = f'(x) A^{-1}$ through
the solving of
$v = (A^{-1})^T (f'(x))^T = (A^T)^{-1} \nabla(f)$
or by solving $A^T v = \nabla f$. This is called the *adjoint* equation.
The partial derivatives in $g$ is related to each partial derivative of $dA$ through:
$$
\frac{\partial g}{\partial p_k} = -v^T\frac{\partial A}{\partial p_k} x,
$$
as the scalar factor commutes through. With $v$ and $x$ solved for (via the adjoint equation and from solving $Ax=b$) the partials in $p_k$ are computed with dot products. There are just two costly operations.
From right to left:
The value of $x$ can be solved for, as above, but computing the value of
$$
\frac{\partial g}{\partial p_k} =
-f'(x) \left(A^{-1} \frac{\partial A}{\partial p_k} x \right)
$$
requires a costly solve for each $p_k$, and $p$ may have many components. As mentioned above, the reverse mode offers advantages when there are many input parameters ($p$) and a single output parameter.
##### Example
Suppose $x(p)$ solves some system of equations $h(x(p),p) = 0$ in $R^n$ ($n$ possibly just $1$) and $g(p) = f(x(p))$ is some non-linear transformation of $x$. What is the derivative of $g$ in $p$?
Suppose the *implicit function theorem* applies to $h(x,p) = 0$, that is -- *locally* -- there is an implicitly defined function $x(p)$ with a derivative. Moreover by differentiating both sides it can be identified:
$$
0 = \frac{\partial h}{\partial p} dp + \frac{\partial h}{\partial x} dx
$$
which can be solved for $dx$ to give
$$
dx = -\left(\frac{\partial h}{\partial x}\right)^{-1} \frac{\partial h}{\partial p} dp.
$$
The chain rule then gives
$$
dg = f'(x) dx = -f'(x) \left(\frac{\partial h}{\partial x}\right)^{-1} \frac{\partial h}{\partial p} dp.
$$
This can be computed in two directions:
From left to right:
Call $A =\left(\frac{\partial h}{\partial x}\right)^{-1}$. Then define $v$ indirectly through $v^T = f'(x) A^{-1}$. With this:
$v = (A^{-1})^T (f'(x))^T = (A^T)^{-1} \nabla{f}$
which is found by solving
$A^Tv = \nabla{f}$.
Again, this is the *adjoint* equation.
The value of $dA$ is related to each partial derivative for which
$$
\frac{\partial g}{\partial p_k} = -v^T\frac{\partial A}{\partial p_k} x,
$$
as the scalar factor commutes through. With $v$ and $x$ solved for (via the adjoint equation and from solving $Ax=b$) the partials in $p_k$ are computed with dot products.
However, from right to left, the value of $x$ can be solved for, but computing the value of
$$
\frac{\partial g}{\partial p_k} =
-f'(x)
\left(A^{-1} \frac{\partial A}{\partial p_k} x \right)
$$
requires a costly solve for each $p_k$, and $p$ may have many components. The reverse mode offers advantages when there are many input parameters ($p$) and a single output parameter.
##### Example
Suppose $x(p)$ solves some system of equations $h(x(p),p) = 0$ in $R^n$ ($n$ possibly just 1$) and $g(p) = f(x(p))$ is some non-linear transformation of $x$. What is the derivative of $g$ in $p$?
Suppose the *implicit function theorem* applies to $h(x,p) = 0$, that is *locally* the response $x(p)$ has a derivative, and moreover by the chain rule
$$
0 = \frac{\partial h}{\partial p} dp + \frac{\partial h}{\partial x} dx.
$$
Solving the above for $dx$ gives:
$$
dx = -\left(\frac{\partial h}{\partial x}\right)^{-1} \frac{\partial h}{\partial p} dp.
$$
The chain rule applied to $g(p) = f(x(p))$ then yields
$$
dg = f'(x) dx = - f'(x) \left(\frac{\partial h}{\partial x}\right)^{-1} \frac{\partial h}{\partial p} dp.
$$
Setting
$$
v^T = -f'(x) \left(\frac{\partial h}{\partial x}\right)^{-1}
$$
then $v$ can be solved from taking adjoints (as before). Let $A = \partial h/\partial x$, the $v^T = -f'(x) A^{-1}$ or $v = -(A^{-1})^T (f'(x))^t= -(A^T)^{-1} \nabla f$. As before it would take two solves to get both $g$ and its gradient.
## Second derivatives, Hessian
@CarlssonNikitinTroedssonWendt
We reference a theorem presented by [Carlsson, Nikitin, Troedsson, and Wendt](https://arxiv.org/pdf/2502.03070v1) for exposition with some modification
::: {.callout-note appearance="minimal"}
Theorem 1. Let $f:X \rightarrow Y$, where $X,Y$ are finite dimensional *inner product* spaces with elements in $R$. Suppose $f$ is smooth (a certain number of derivatives). Then for each $x$ in $X$ there exists a unique linear operator, $f'(x)$, and a unique *bilinear* *symmetric* operator $f'': X \oplus X \rightarrow Y$ such that
$$
f(x + \delta x) = f(x) + f'(x)[\delta x] + \frac{1}{2}f''(x)[\delta x, \delta x] + \mathscr(||\delta x ||^2).
$$
:::
New terms include *bilinear*, *symmetric*, and *inner product*. An operator ($X\oplus X \rightarrow Y$) is bilinear if it is a linear operator in each of its two arguments. Such an operator is *symmetric* if interchanging its two arguments makes no difference in its output. Finally, an *inner product* space is one with a generalization of the dot product. An inner product takes two vectors $x$ and $y$ and returns a scalar; it is denoted $\langle x,y\rangle$; and has properties of symmetry, linearity, and non-negativity ($\langle x,x\rangle \geq 0$, and equal $0$ only if $x$ is the zero vector.) Inner products can be used to form a norm (or length) for a vector through $||x||^2 = \langle x,x\rangle$.
We reference this, as the values denoted $f'$ and $f''$ are *unique*. So if we identify them one way, we have identified them.
Specializing to $X=R^n$ and $Y=R^1$, we have, $f'=\nabla f^T$ and $f''$ is the Hessian.
Take $n=2$. Previously we wrote a formula for Taylor's theorem for $f:R^n \rightarrow R$ that with $n=2$ has with $x=\langle x_1,x_2\rangle$:
$$
\begin{align*}
f(x + dx) &= f(x) +
\frac{\partial f}{\partial x_1} dx_1 + \frac{\partial f}{\partial x_2} dx_2\\
&+ \frac{1}{2}\left(
\frac{\partial^2 f}{\partial x_1^2}dx_1^2 +
\frac{\partial^2 f}{\partial x_1 \partial x_2}dx_1dx_2 +
\frac{\partial^2 f}{\partial x_2^2}dx_2^2
\right) + \mathscr{o}(dx).
\end{align*}
$$
We can see $\nabla{f} \cdot dx = f'(x) dx$ to tidy up part of the first line, and more over the second line can be seen to be a matrix product:
$$
[dx_1 dx_2]
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} &
\frac{\partial^2 f}{\partial x_1 \partial x_2}\\
\frac{\partial^2 f}{\partial x_2 \partial x_1} &
\frac{\partial^2 f}{\partial x_2^2}
\end{bmatrix}
\begin{bmatrix}
dx_1\\
dx_2
\end{bmatrix}
= dx^T H dx,
$$
$H$ being the *Hessian* with entries $H_{ij} = \frac{\partial f}{\partial x_i \partial x_j}$.
This formula -- $f(x+dx)-f(x) \approx f'(x)dx + dx^T H dx$ -- is valid for any $n$, showing $n=2$ was just for ease of notation when expressing in the coordinates and not as matrices.
By uniqueness, we have under these assumptions that the Hessian is *symmetric* and the expression $dx^T H dx$ is a *bilinear* form, which we can identify as $f''(x)[dx,dx]$.
That the Hessian is symmetric could also be derived under these assumptions by directly computing that the mixed partials can have their order exchanged. But in this framework, as explained by @BrightEdelmanJohnson it is a result of the underlying vector space having an addition that is commutative (e.g. $u+v = v+u$).
The mapping $(u,v) \rightarrow u^T A v$ for a matrix $A$ is bilinear. For a fixed $u$, it is linear as it can be viewed as $(u^TA)[v]$ and matrix multiplication is linear. Similarly for a fixed $v$.
@BrightEdelmanJohnson extend this characterization to a broader setting.
The second derivative can be viewed as expressing first-order change in $f'(x)$, a linear operator. The value $df'$ has the same shape as $f'$, which is a linear operator, so $df'$ acts on vectors, say $dx$, then:
$$
df'[dx] = f''(x)[dx'][dx] = f''(x)[dx', dx]
$$
The prime in $dx'$ is just notation, not a derivative operation for $dx$.
With this view, we can see that $f''(x)$ has two vectors it acts on. By definition it is linear in $dx$. However, as $f'(x)$ is a linear operator and the sum and product rules apply to derivatives, this operator is linear in $dx'$ as well. So $f''(x)$ is bilinear and as mentioned earlier symmetric.
### Polarization
@BrightEdelmanJohnson interpret $f''$ by looking at the image under $f$ of $x + dx + dx'$. If $x$ is a vector, then this has a geometrical picture, from vector addtion, relating $x + dx$, $x+dx'$, and $x + dx + dx'$.
The image for $x +dx$ is to second order $f(x) + f'(x)[dx] + (1/2)f''(x)[dx, dx]$, similarly $x + dx'$ is to second order $f(x) + f'(x)[dx'] + (1/2)f''(x)[dx', dx']$. The key formula for $f''(x)$ is
$$
\begin{align*}
f(x + dx + dx') &= f(x) + f'(x)[dx + dx'] + \frac{1}{2}f''(x)[dx, dx']\\
&= f(x) + f'(x)[dx] + (1/2)f''(x)[dx, dx]
&+ f(x) + f'(x)[dx] + (1/2)f''(x)[dx', dx']
&+ f''(x)[dx, dx']
\end{align}
$$
This gives a means to compute $f''$ in terms of $f''$ acting on diagonal terms, where the two vectors are equal:
$$
f''(x)[dx, dx'] = \frac{1}{2} f''(x)[dx+dx',dx+dx'] - f''(x)[dx,dx] - f''(x)[dx',dx']
$$
### XXX does this fit in?
However, as a description of second-order change in $f$, we recover the initial terms in the Taylor series
$$
f(x + \delta x) = f(x) + f'(x)\delta x + (1/2) f''(x)[\delta x, \delta x] + \mathscr{o}(||\delta x||^2).
$$
### Examples
##### Example: second derivative of $x^TAx$
Consider an expression from earlier $f(x) = x^T A x$ for some constant $A$. Then $f''$ is found by noting that $f' = (\nabla f)^T = x^T(A + A^T)$, or $\nabla f = (A^T + A)x$ and $f'' = H = A^T + A$ is the Jacobian of the gradient.
By rearranging terms, it can be shown that $f(x) = 1/2 x^THx = 1/2 f''[x,x]$.
##### Example: second derivative of $\text{det}(A)$
Consider $f(A) = \text{det}(A)$. We saw previously that:
$$
\begin{align*}
\text{tr}(A + B) &= \text{tr}(A) + \text{tr}(B)\\
\text{det}(A + dA') &= \text{det}(A) + \text{det}(A)\text{tr}(A^{-1}dA')\\
(A + dA') &= A^{-1} - A^{-1} dA' A^{-1}
\end{align*}
$$
These are all used to simplify:
$$
\begin{align*}
\text{det}(A+dA')&\text{tr}((A + dA')^{-1} dA) - \text{det}(A) \text{tr}(A^{-1}dA) \\
&= \left(
\text{det}(A) + \text{det}(A)\text{tr}(A^{-1}dA')
\right)
\text{tr}((A^{-1} - A^{-1}dA' A^{-1})dA) - \text{det}(A) \text{tr}(A^{-1}dA) \\
&=
\text{det}(A) \text{tr}(A^{-1}dA)\\
&+ \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) \\
&- \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
&- \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA' A^{-1}dA)\\
&- \text{det}(A) \text{tr}(A^{-1}dA) \\
&= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA) - \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA)\\
&+ \text{third order term}
\end{align*}
$$
So, after dropping the third-order term, we see:
$$
\begin{align*}
f''(A)&[dA,dA'] \\
&= \text{det}(A)\text{tr}(A^{-1}dA')\text{tr}(A^{-1}dA)
- \text{det}(A)\text{tr}(A^{-1}dA' A^{-1}dA).
\end{align*}
$$