align fix; theorem style; condition number

2024-10-31 14:22:21 -04:00
parent 3e7e3a9727
commit 18aae2aa93
61 changed files with 1705 additions and 819 deletions
--- a/quarto/derivatives/condition.qmd
+++ b/quarto/derivatives/condition.qmd
@@ -2,73 +2,74 @@

 In Part III of @doi:10.1137/1.9781611977165 we find language of numerical analysis useful to  formally describe the zero-finding problem. Key concepts are errors, conditioning, and stability.

-Abstractly a *problem* is a mapping, or function, from a domain $X$ of data to a range $Y$ of solutions. Both $X$ and $Y$ have a sense of distance given by a *norm*. A norm is a generalization of the absolute value.
+Abstractly a *problem* is a mapping, $F$, from a domain $X$ of data to a range $Y$ of solutions. Both $X$ and $Y$ have a sense of distance given by a *norm*. A norm is a generalization of the absolute value and gives quantitative meaning to terms like small and large.


-> A *well-conditioned* problem is one with the property that all small perturbations of $x$ lead to only small changes in $f(x)$.
+> A *well-conditioned* problem is one with the property that all small perturbations of $x$ lead to only small changes in $F(x)$.

-This sense of "small" is quantified through a *condition number*.
+This sense of "small" is measured through a *condition number*.

-If we let $\delta_x$ be a small perturbation of $x$ then $\delta_f = f(x + \delta_x) - f(x)$.
+If we let $\delta_x$ be a small perturbation of $x$ then $\delta_F = F(x + \delta_x) - F(x)$.

-The *forward error* is $\lVert\delta_f\rVert = \lVert f(x+\delta_x) - f(x)\rVert$, the *relative forward error* is  $\lVert\delta_f\rVert/\lVert f\rVert = \lVert f(x+\delta_x) - f(x)\rVert/ \lVert f(x)\rVert$.
+The *forward error* is $\lVert\delta_F\rVert = \lVert F(x+\delta_x) - F(x)\rVert$, the *relative forward error* is  $\lVert\delta_F\rVert/\lVert F\rVert = \lVert F(x+\delta_x) - F(x)\rVert/ \lVert F(x)\rVert$.

 The *backward error* is $\lVert\delta_x\rVert$, the *relative backward error* is $\lVert\delta_x\rVert / \lVert x\rVert$.

- The *absolute condition number* $\hat{\kappa}$ is worst case of this ratio $\lVert\delta_f\rVert/ \lVert\delta_x\rVert$ as the perturbation size shrinks to $0$.
-The relative condition number $\kappa$ divides $\lVert\delta_f\rVert$ by $\lVert f(x)\rVert$ and $\lVert\delta_x\rVert$ by $\lVert x\rVert$ before taking the ratio.
+ The *absolute condition number* $\hat{\kappa}$ is worst case of this ratio $\lVert\delta_F\rVert/ \lVert\delta_x\rVert$ as the perturbation size shrinks to $0$.
+The relative condition number $\kappa$ divides $\lVert\delta_F\rVert$ by $\lVert F(x)\rVert$ and $\lVert\delta_x\rVert$ by $\lVert x\rVert$ before taking the ratio.


-A *problem* is a mathematical concept, an *algorithm* the computational version. Algorithms may differ for many reasons, such as floating point errors, tolerances, etc. We use notation $\tilde{f}$ to indicate the algorithm.
+A *problem* is a mathematical concept, an *algorithm* the computational version. Algorithms may differ for many reasons, such as floating point errors, tolerances, etc. We use notation $\tilde{F}$ to indicate the algorithm.

-The absolute error in the algorithm is $\lVert\tilde{f}(x) - f(x)\rVert$, the relative error divides by $\lVert f(x)\rVert$. A good algorithm would have smaller relative errors.
+The absolute error in the algorithm is $\lVert\tilde{F}(x) - F(x)\rVert$, the relative error divides by $\lVert F(x)\rVert$. A good algorithm would have smaller relative errors.

 An algorithm is called *stable* if

 $$
-\frac{\lVert\tilde{f}(x) - f(\tilde{x})\rVert}{\lVert f(\tilde{x})\rVert}
+\frac{\lVert\tilde{F}(x) - F(\tilde{x})\rVert}{\lVert F(\tilde{x})\rVert}
 $$

 is *small* for *some* $\tilde{x}$ relatively near $x$, $\lVert\tilde{x}-x\rVert/\lVert x\rVert$.

-> "A *stable* algorithm gives nearly the right answer to nearly the right question."
+> A *stable* algorithm gives nearly the right answer to nearly the right question.

-(The answer it gives is $\tilde{f}(x)$, the nearly right question is $f(\tilde{x})$.)
+(The answer it gives is $\tilde{F}(x)$, the nearly right question: what is $F(\tilde{x})$?)

-A related concept is an algorithm $\tilde{f}$ for a problem $f$ is *backward stable* if for each $x \in X$,
+A related concept is an algorithm $\tilde{F}$ for a problem $F$ is *backward stable* if for each $x \in X$,

 $$
-\tilde{f}(x) = f(\tilde{x})
+\tilde{F}(x) = F(\tilde{x})
 $$

-	for some $\tilde{x}$ with $\lVert\tilde{x} - x\rVert/\lVert x\rVert$ is small.
+for some $\tilde{x}$ with $\lVert\tilde{x} - x\rVert/\lVert x\rVert$ is small.

 > "A backward stable algorithm gives exactly the right answer to nearly the right question."


-The concepts are related by Trefethen and Bao's Theorem 15.1 which says  for a backward stable algorithm the relative error $\lVert\tilde{f}(x) - f(x)\rVert/\lVert f(x)\rVert$ is small in a manner proportional to the relative condition number.
+The concepts are related by Trefethen and Bao's Theorem 15.1 which says  for a backward stable algorithm the relative error $\lVert\tilde{F}(x) - F(x)\rVert/\lVert F(x)\rVert$ is small in a manner proportional to the relative condition number.

 Applying this to the zero-finding we follow @doi:10.1137/1.9781611975086.

-To be specific, the problem is finding a zero of $f$ starting at an initial point $x_0$. The data is $(f, x_0)$, the solution is $r$ a zero of $f$.
+To be specific, the problem, $F$, is finding a zero of a function $f$ starting at an initial point $x_0$. The data is $(f, x_0)$, the solution is $r$ a zero of $f$.

 Take the algorithm as Newton's method. Any implementation must incorporate tolerances, so this is a computational approximation to the problem. The data is the same, but technically we use $\tilde{f}$ for the function, as any computation is dependent on machine implementations. The output is $\tilde{r}$ an *approximate* zero.

-Suppose for sake of argument that $\tilde{f}(x) = f(x) + \epsilon$ and $r$ is a root of $f$ and $\tilde{r}$ is a root of $\tilde{f}$. Then by linearization:
+Suppose for sake of argument that $\tilde{f}(x) = f(x) + \epsilon$, $f$ has a continuous derivative, and $r$ is a root of $f$ and $\tilde{r}$ is a root of $\tilde{f}$. Then by linearization:

+$$
 \begin{align*}
 0 &= \tilde{f}(\tilde r) \\
  &= f(r + \delta) + \epsilon\\
  &\approx f(r) + f'(r)\delta + \epsilon\\
  &= 0 + f'(r)\delta + \epsilon
 \end{align*}
-
-Rearranging gives $\lVert\delta/\epsilon\rVert \approx 1/\lVert f'(r)\rVert$ leading to:
+$$
+Rearranging gives $\lVert\delta/\epsilon\rVert \approx 1/\lVert f'(r)\rVert$. But the $|\delta|/|\epsilon|$ ratio is related to the the condition number:

 > The absolute condition number is $\hat{\kappa}_r = |f'(r)|^{-1}$.


-The error formula in Newton's method includes the derivative in the denominator, so we see large condition numbers are tied into larger errors.
+The error formula in Newton's method measuring the distance between the actual root and an approximation includes the derivative in the denominator, so we see large condition numbers are tied into possibly larger errors.

 Now consider $g(x) = f(x) - f(\tilde{r})$. Call $f(\tilde{r})$ the residual. We have $g$ is near $f$ if the residual is small. The algorithm will solve $(g, x_0)$ with $\tilde{r}$, so with a small residual an exact solution to an approximate question will be found. Driscoll and Braun state

@@ -83,4 +84,5 @@ Practically these two observations lead to

 For the first observation, the example of Wilkinson's polynomial is often used where $f(x) = (x-1)\cdot(x-2)\cdot \cdots\cdot(x-20)$. When expanded this function has exactness issues of typical floating point values, the condition number is large and some of the roots found are quite different from the mathematical values.

-The second observation helps explain why a problem like finding the zero of $f(x) = x \cdot \exp(x)$ using Newton's method starting at $2$ might return a value like $5.89\dots$. The residual is checked to be zero in a *relative* manner which would basically use a tolerance of `atol + abs(xn)*rtol`. Functions with asymptotes of $0$ will eventually be smaller than this value.
+
+The second observation follows from $f(x_n)$ monitoring the backward error and the product of the condition number and the backward error monitoring the forward error. This product is on the order of $|f(x_n)/f'(x_n)|$ or $|x_{n+1} - x_n|$.
--- a/quarto/derivatives/derivatives.qmd
+++ b/quarto/derivatives/derivatives.qmd
@@ -22,6 +22,8 @@ nothing

 ---

+![Device to measure units of distance by units of time](figures/galileo-ramp.png){width=60%}
+

 Before defining the derivative of a function, let's begin with two motivating examples.

@@ -520,13 +522,14 @@ This holds two rules: the derivative of a constant times a function is the const
 This example shows a useful template:


-
+$$
 \begin{align*}
 [2x^2 - \frac{x}{3} + 3e^x]' & = 2[\square]' - \frac{[\square]'}{3} + 3[\square]'\\
 &= 2[x^2]' - \frac{[x]'}{3} + 3[e^x]'\\
 &= 2(2x) - \frac{1}{3} + 3e^x\\
 &= 4x - \frac{1}{3} + 3e^x
 \end{align*}
+$$


 ### Product rule
@@ -548,12 +551,13 @@ The output uses the Leibniz notation to represent that the derivative of $u(x) \
 This example shows a useful template for the product rule:


-
+$$
 \begin{align*}
 [(x^2+1)\cdot e^x]' &= [\square]' \cdot (\square) + (\square) \cdot [\square]'\\
 &= [x^2 + 1]' \cdot (e^x) + (x^2+1) \cdot [e^x]'\\
 &= (2x)\cdot e^x + (x^2+1)\cdot e^x
 \end{align*}
+$$


 ### Quotient rule
@@ -572,12 +576,13 @@ limit((f(x+h) - f(x))/h, h => 0)
 This example shows a useful template for the quotient rule:


-
+$$
 \begin{align*}
 [\frac{x^2+1}{e^x}]' &= \frac{[\square]' \cdot (\square) - (\square) \cdot [\square]'}{(\square)^2}\\
 &= \frac{[x^2 + 1]' \cdot (e^x) - (x^2+1) \cdot [e^x]'}{(e^x)^2}\\
 &= \frac{(2x)\cdot e^x - (x^2+1)\cdot e^x}{e^{2x}}
 \end{align*}
+$$


 ##### Examples
@@ -672,19 +677,21 @@ There are $n$ terms, each where one of the $f_i$s have a derivative. Were we to


 With this, we can proceed. Each term $x-i$ has derivative $1$, so the answer to $f'(x)$, with $f$ as above, is
-
+$$
 \begin{align*}
 f'(x) &= f(x)/(x-1) + f(x)/(x-2) + f(x)/(x-3)\\
      &+ f(x)/(x-4) + f(x)/(x-5),
 \end{align*}
+$$

 That is
-
+$$
 \begin{align*}
 f'(x) &= (x-2)(x-3)(x-4)(x-5) + (x-1)(x-3)(x-4)(x-5)\\
      &+ (x-1)(x-2)(x-4)(x-5) + (x-1)(x-2)(x-3)(x-5) \\
 	  &+ (x-1)(x-2)(x-3)(x-4).
 \end{align*}
+$$

 ---

@@ -749,17 +756,18 @@ Combined, we would end up with:
 To see that this works in our specific case, we assume the general power rule that $[x^n]' = n x^{n-1}$ to get:


-
+$$
 \begin{align*}
 f(x) &= x^2 & g(x) &= \sqrt{x}\\
 f'(\square) &= 2(\square) & g'(x) &= \frac{1}{2}x^{-1/2}
 \end{align*}
+$$


 We use $\square$ for the argument of `f'` to emphasize that $g(x)$ is the needed value, not just $x$:


-
+$$
 \begin{align*}
 [(\sqrt{x})^2]' &= [f(g(x)]'\\
 &= f'(g(x)) \cdot g'(x) \\
@@ -767,6 +775,7 @@ We use $\square$ for the argument of `f'` to emphasize that $g(x)$ is the needed
 &= \frac{2\sqrt{x}}{2\sqrt{x}}\\
 &=1
 \end{align*}
+$$


 This is the same as the derivative of $x$ found by first evaluating the composition. For this problem, the chain rule is not necessary, but typically it is a needed rule to fully differentiate a function.
@@ -778,11 +787,12 @@ This is the same as the derivative of $x$ found by first evaluating the composit
 Find the derivative of $f(x) = \sqrt{1 - x^2}$. We identify the composition of $\sqrt{x}$  and $(1-x^2)$. We set the functions and their derivatives into a pattern to emphasize the pieces in the chain-rule formula:


-
+$$
 \begin{align*}
 f(x) &=\sqrt{x} = x^{1/2}  & g(x) &= 1 - x^2     \\
 f'(\square) &=(1/2)(\square)^{-1/2}  & g'(x) &= -2x
 \end{align*}
+$$


 Then:
@@ -823,11 +833,12 @@ This is a useful rule to remember for expressions involving exponentials.

 Find the derivative of $\sin(x)\cos(2x)$ at $x=\pi$.

-
+$$
 \begin{align*}
 [\sin(x)\cos(2x)]'\big|_{x=\pi} &=(\cos(x)\cos(2x) + \sin(x)(-\sin(2x)\cdot 2))\big|_{x=\pi} \\
 & =((-1)(1) + (0)(-0)(2)) = -1.
 \end{align*}
+$$

 ##### Proof of the Chain Rule

@@ -844,23 +855,25 @@ g(a+h) = g(a) + g'(a)h + \epsilon_g(h) h = g(a) + h',
 $$

 Where $h' = (g'(a) + \epsilon_g(h))h \rightarrow 0$ as $h \rightarrow 0$ will be used to simplify the following:
-
+$$
 \begin{align*}
 f(g(a+h)) - f(g(a)) &=
 f(g(a) + g'(a)h + \epsilon_g(h)h) - f(g(a)) \\
 &= f(g(a)) + f'(g(a)) (g'(a)h + \epsilon_g(h)h) + \epsilon_f(h')(h') - f(g(a))\\
 &= f'(g(a)) g'(a)h  + f'(g(a))(\epsilon_g(h)h) + \epsilon_f(h')(h').
 \end{align*}
+$$

 Rearranging:

-
+$$
 \begin{align*}
 f(g(a+h)) &- f(g(a)) - f'(g(a)) g'(a) h\\
 &= f'(g(a))\epsilon_g(h)h + \epsilon_f(h')(h')\\
 &=(f'(g(a)) \epsilon_g(h)  + \epsilon_f(h') (g'(a) + \epsilon_g(h)))h \\
 &=\epsilon(h)h,
 \end{align*}
+$$

 where $\epsilon(h)$ combines the above terms which go to zero as $h\rightarrow 0$ into one. This is the alternative definition of the derivative, showing $(f\circ g)'(a) = f'(g(a)) g'(a)$ when $g$ is differentiable at $a$ and $f$ is differentiable at $g(a)$.

@@ -871,17 +884,18 @@ where $\epsilon(h)$ combines the above terms which go to zero as $h\rightarrow 0
 The chain rule name could also be simply the "composition rule," as that is the operation the rule works for. However, in practice, there are usually *multiple* compositions, and the "chain" rule is used to chain together the different pieces. To get a sense, consider a triple composition $u(v(w(x)))$. This will have derivative:


-
+$$
 \begin{align*}
 [u(v(w(x)))]' &= u'(v(w(x))) \cdot [v(w(x))]' \\
 &=  u'(v(w(x))) \cdot v'(w(x)) \cdot w'(x)
 \end{align*}
+$$


 The answer can be viewed as a repeated peeling off of the outer function, a view with immediate application to many compositions. To see that in action with an expression, consider this derivative problem, shown in steps:


-
+$$
 \begin{align*}
 [\sin(e^{\cos(x^2-x)})]'
 &= \cos(e^{\cos(x^2-x)}) \cdot [e^{\cos(x^2-x)}]'\\
@@ -889,6 +903,7 @@ The answer can be viewed as a repeated peeling off of the outer function, a view
 &= \cos(e^{\cos(x^2-x)}) \cdot e^{\cos(x^2-x)} \cdot (-\sin(x^2-x)) \cdot [x^2-x]'\\
 &= \cos(e^{\cos(x^2-x)}) \cdot e^{\cos(x^2-x)} \cdot (-\sin(x^2-x)) \cdot (2x-1)\\
 \end{align*}
+$$


 ##### More examples of differentiation
@@ -1004,7 +1019,7 @@ Find the derivative of $f(x) = x \cdot e^{-x^2}$.


 Using the product rule and then the chain rule, we have:
-
+$$
 \begin{align*}
 f'(x) &= [x \cdot e^{-x^2}]'\\
 &= [x]' \cdot e^{-x^2} + x \cdot [e^{-x^2}]'\\
@@ -1012,6 +1027,7 @@ f'(x) &= [x \cdot e^{-x^2}]'\\
 &= e^{-x^2} + x \cdot e^{-x^2} \cdot (-2x)\\
 &= e^{-x^2} (1 - 2x^2).
 \end{align*}
+$$

 ---

@@ -1022,7 +1038,7 @@ Find the derivative of $f(x) = e^{-ax} \cdot \sin(x)$.
 Using the  product rule and then the chain rule, we have:


-
+$$
 \begin{align*}
 f'(x) &= [e^{-ax} \cdot \sin(x)]'\\
 &= [e^{-ax}]' \cdot \sin(x) + e^{-ax} \cdot [\sin(x)]'\\
@@ -1030,6 +1046,7 @@ f'(x) &= [e^{-ax} \cdot \sin(x)]'\\
 &= e^{-ax} \cdot (-a)   \cdot \sin(x) + e^{-ax} \cos(x)\\
 &= e^{-ax}(\cos(x) - a\sin(x)).
 \end{align*}
+$$


 ---
@@ -1164,13 +1181,14 @@ Find the first $3$ derivatives of $f(x) = ax^3 + bx^2 + cx + d$.


 Differentiating a polynomial is done with the sum rule, here we repeat three times:
-
+$$
 \begin{align*}
 f(x)    &= ax^3 + bx^2 + cx + d\\
 f'(x)   &= 3ax^2 + 2bx + c \\
 f''(x)  &= 3\cdot 2 a x + 2b \\
 f'''(x) &= 6a
 \end{align*}
+$$

 We can see, the fourth derivative – and all higher order ones – would be identically $0$. This is part of a general phenomenon: an $n$th degree polynomial has only $n$ non-zero derivatives.

@@ -1181,7 +1199,7 @@ We can see, the fourth derivative – and all higher order ones – would be ide
 Find the first $5$ derivatives of $\sin(x)$.


-
+$$
 \begin{align*}
 f(x)    &= \sin(x) \\
 f'(x)   &= \cos(x) \\
@@ -1190,6 +1208,7 @@ f'''(x) &= -\cos(x) \\
 f^{(4)} &= \sin(x) \\
 f^{(5)} &= \cos(x)
 \end{align*}
+$$


 We see the derivatives repeat themselves. (We also see alternative notation for higher order derivatives.)
@@ -1616,13 +1635,14 @@ The right graph is of $g(x) = \exp(x)$ at $x=1$, the left graph of $f(x) = \sin(
 Assuming the approximation gets better for $h$ close to $0$, as it visually does, the derivative at $1$ for $f(g(x))$ should be given by this limit:


-
+$$
 \begin{align*}
 \frac{d(f\circ g)}{dx}\mid_{x=1}
 &= \lim_{h\rightarrow 0} \frac{f(g(1) + g'(1)h)-f(g(1))}{h}\\
 &= \lim_{h\rightarrow 0} \frac{f(g(1) + g'(1)h)-f(g(1))}{g'(1)h} \cdot g'(1)\\
 &= \lim_{h\rightarrow 0} (f\circ g)'(g(1)) \cdot g'(1).
 \end{align*}
+$$


 What limit law, described below assuming all limits exist. allows the last equals sign?
--- a/quarto/derivatives/figures/galileo-ramp.png
+++ b/quarto/derivatives/figures/galileo-ramp.png
--- a/quarto/derivatives/first_second_derivatives.qmd
+++ b/quarto/derivatives/first_second_derivatives.qmd
@@ -36,16 +36,16 @@ Of course, we  define *negative* in a parallel manner. The intermediate value th

 Next,

+::: {.callout-note icon=false}
+## Strictly increasing

-> A function, $f$, is (strictly) **increasing** on an interval $I$ if for any $a < b$ it must be that $f(a) < f(b)$.
-
-
+A function, $f$, is (strictly) **increasing** on an interval $I$ if for any $a < b$ it must be that $f(a) < f(b)$.

 The word strictly is related to the inclusion of the $<$ precluding the possibility of a function being flat over an interval that the $\leq$ inequality would allow.


 A parallel definition with $a < b$ implying $f(a) > f(b)$ would be used for a *strictly decreasing* function.
-
+:::

 We can try and prove these properties for a function algebraically – we'll see both are related to the zeros of some function. However, before proceeding to that it is usually helpful to get an idea of where the answer is using exploratory graphs.

@@ -160,13 +160,17 @@ This leaves the question:
 This question can be answered by considering the first derivative.


-> *The first derivative test*: If $c$ is a critical point for $f(x)$ and *if* $f'(x)$ changes sign at $x=c$, then $f(c)$ will be either a relative maximum or a relative minimum.
->
->   * $f$ will have  a relative maximum at $c$ if the derivative changes sign from $+$ to $-$.
->   * $f$ will have  a relative minimum at $c$ if the derivative changes sign from $-$ to $+$.
->
->   Further,  If $f'(x)$ does *not* change sign at $c$, then $f$ will  *not* have a relative maximum or minimum at $c$.
+::: {.callout-note icon=false}
+## The first derivative test

+If $c$ is a critical point for $f(x)$ and *if* $f'(x)$ changes sign at $x=c$, then $f(c)$ will be either a relative maximum or a relative minimum.
+
+   * $f$ will have  a relative maximum at $c$ if the derivative changes sign from $+$ to $-$.
+   * $f$ will have  a relative minimum at $c$ if the derivative changes sign from $-$ to $+$.
+
+   Further,  If $f'(x)$ does *not* change sign at $c$, then $f$ will  *not* have a relative maximum or minimum at $c$.
+
+:::


 The classification part, should be clear: e.g., if the derivative is positive then negative, the function $f$ will increase to $(c,f(c))$ then decrease from $(c,f(c))$ – so $f$ will have a local maximum at $c$.
@@ -424,12 +428,15 @@ The graph attempts to illustrate that for this function the secant line between

 This is a special property not shared by all functions. Let $I$ be an open interval.

+::: {.callout-note icon=false}
+## Concave up

-> **Concave up**: A function $f(x)$ is concave up on $I$ if for any $a < b$ in $I$, the secant line between $a$ and $b$ lies above the graph of $f(x)$ over $[a,b]$.
+A function $f(x)$ is concave up on $I$ if for any $a < b$ in $I$, the secant line between $a$ and $b$ lies above the graph of $f(x)$ over $[a,b]$.

+A similar definition exists for *concave down* where the secant lines lie below the graph.
+:::

-
-A similar definition exists for *concave down* where the secant lines lie below the graph.  Notationally, concave up says  for any $x$ in $[a,b]$:
+Notationally, concave up says  for any $x$ in $[a,b]$:


 $$
@@ -491,14 +498,16 @@ sign_chart(h'', 0, 8)

 Concave up functions are "opening" up, and often clearly $U$-shaped, though that is not necessary. At a relative minimum, where there is a $U$-shape, the graph will be concave up; conversely at a relative maximum, where the graph has a downward $\cap$-shape, the function will be concave down. This observation becomes:

+::: {.callout-note icon=false}
+## The second derivative test

-> The **second derivative test**: If $c$ is a critical point of $f(x)$ with $f''(c)$ existing in a neighborhood of $c$, then
->
->   * $f$ will have a relative maximum at the critical point $c$ if $f''(c) > 0$,
->   * $f$ will have a relative minimum at the critical point $c$ if $f''(c) < 0$, and
->   * *if* $f''(c) = 0$ the test is *inconclusive*.
+If $c$ is a critical point of $f(x)$ with $f''(c)$ existing in a neighborhood of $c$, then

+   * $f$ will have a relative maximum at the critical point $c$ if $f''(c) > 0$,
+   * $f$ will have a relative minimum at the critical point $c$ if $f''(c) < 0$, and
+   * *if* $f''(c) = 0$ the test is *inconclusive*.

+:::

 If $f''(c)$ is positive in an interval about $c$, then  $f''(c) > 0$ implies the function is concave up at $x=c$. In turn, concave up implies the derivative is increasing so must go from negative to positive at the critical point.

--- a/quarto/derivatives/lhospitals_rule.qmd
+++ b/quarto/derivatives/lhospitals_rule.qmd
@@ -52,15 +52,18 @@ Wouldn't that be nice? We could find difficult limits just by differentiating th

 Well, in fact that is more or less true, a fact that dates back to [L'Hospital](http://en.wikipedia.org/wiki/L%27H%C3%B4pital%27s_rule) - who wrote the first textbook on differential calculus - though this result is likely due to one of the Bernoulli brothers.

+::: {.callout-note icon=false}
+## L'Hospital's rule

-> *L'Hospital's rule*: Suppose:
->
->   * that $\lim_{x\rightarrow c+} f(c) =0$ and $\lim_{x\rightarrow c+} g(c) =0$,
->   * that $f$ and $g$ are differentiable in $(c,b)$, and
->   * that $g(x)$ exists and is non-zero for *all* $x$ in $(c,b)$,
->
-> then **if** the following limit exists: $\lim_{x\rightarrow c+}f'(x)/g'(x)=L$ it follows that $\lim_{x \rightarrow c+}f(x)/g(x) = L$.
+Suppose:

+  * that $\lim_{x\rightarrow c+} f(c) =0$ and $\lim_{x\rightarrow c+} g(c) =0$,
+  * that $f$ and $g$ are differentiable in $(c,b)$, and
+  * that $g(x)$ exists and is non-zero for *all* $x$ in $(c,b)$,
+
+then **if** the following limit exists: $\lim_{x\rightarrow c+}f'(x)/g'(x)=L$ it follows that $\lim_{x \rightarrow c+}f(x)/g(x) = L$.
+
+:::


 That is *if* the right limit of $f(x)/g(x)$ is indeterminate of the form $0/0$, but the right limit of $f'(x)/g'(x)$ is known, possibly by simple continuity, then the right limit of $f(x)/g(x)$ exists and is equal to that of $f'(x)/g'(x)$.
@@ -308,23 +311,25 @@ L'Hospital's rule generalizes to other indeterminate forms, in particular the in
 The value $c$ in the limit can also be infinite. Consider this case with $c=\infty$:


-
+$$
 \begin{align*}
 \lim_{x \rightarrow \infty} \frac{f(x)}{g(x)} &=
 \lim_{x \rightarrow 0} \frac{f(1/x)}{g(1/x)}
 \end{align*}
+$$


 L'Hospital's limit applies as $x \rightarrow 0$, so we differentiate to get:


-
+$$
 \begin{align*}
 \lim_{x \rightarrow 0} \frac{[f(1/x)]'}{[g(1/x)]'}
 &= \lim_{x \rightarrow 0} \frac{f'(1/x)\cdot(-1/x^2)}{g'(1/x)\cdot(-1/x^2)}\\
 &= \lim_{x \rightarrow 0} \frac{f'(1/x)}{g'(1/x)}\\
 &= \lim_{x \rightarrow \infty} \frac{f'(x)}{g'(x)},
 \end{align*}
+$$


 *assuming* the latter limit exists, L'Hospital's rule assures the equality
@@ -415,11 +420,12 @@ Be just saw that $\lim_{x \rightarrow 0+}\log(x)/(1/x) = 0$. So by the rules for
 A limit $\lim_{x \rightarrow c} f(x) - g(x)$ of indeterminate form $\infty - \infty$ can be reexpressed to be of the from $0/0$ through the transformation:


-
+$$
 \begin{align*}
 f(x) - g(x) &= f(x)g(x) \cdot (\frac{1}{g(x)} - \frac{1}{f(x)}) \\
 &= \frac{\frac{1}{g(x)} - \frac{1}{f(x)}}{\frac{1}{f(x)g(x)}}.
 \end{align*}
+$$


 Applying this to
@@ -438,7 +444,7 @@ $$
 \lim_{x \rightarrow 1} \frac{x\log(x)-(x-1)}{(x-1)\log(x)}
 $$

-In `SymPy` we have (using italic variable names to avoid a problem with the earlier use of `f` as a function):
+In `SymPy` we have:


 ```{julia}
--- a/quarto/derivatives/linearization.qmd
+++ b/quarto/derivatives/linearization.qmd
@@ -510,21 +510,23 @@ $$
 Suppose $f(x)$ and $g(x)$ are represented by their tangent lines about $c$, respectively:


-
+$$
 \begin{align*}
 f(x) &= f(c) + f'(c)(x-c) + \mathcal{O}((x-c)^2), \\
 g(x) &= g(c) + g'(c)(x-c) + \mathcal{O}((x-c)^2).
 \end{align*}
+$$


 Consider the sum, after rearranging we have:


-
+$$
 \begin{align*}
 f(x) + g(x) &=  \left(f(c) + f'(c)(x-c) + \mathcal{O}((x-c)^2)\right) + \left(g(c) + g'(c)(x-c) + \mathcal{O}((x-c)^2)\right)\\
 &= \left(f(c) + g(c)\right) + \left(f'(c)+g'(c)\right)(x-c) + \mathcal{O}((x-c)^2).
 \end{align*}
+$$


 The two big "Oh" terms become just one as the sum of a constant times $(x-c)^2$ plus a constant time $(x-c)^2$ is just some other constant times $(x-c)^2$. What we can read off from this is the term multiplying $(x-c)$ is just the derivative of $f(x) + g(x)$ (from the sum rule), so this too is a tangent line approximation.
@@ -533,7 +535,7 @@ The two big "Oh" terms become just one as the sum of a constant times $(x-c)^2$
 Is it a coincidence that a basic algebraic operation with tangent lines approximations produces a tangent line approximation? Let's try multiplication:


-
+$$
 \begin{align*}
 f(x) \cdot g(x) &=  [f(c) + f'(c)(x-c) + \mathcal{O}((x-c)^2)] \cdot [g(c) + g'(c)(x-c) + \mathcal{O}((x-c)^2)]\\
 &=[f(c) + f'(c)(x-c)] \cdot  [g(c) + g'(c)(x-c)] + (f(c) + f'(c)(x-c)) \cdot \mathcal{O}((x-c)^2) + (g(c) + g'(c)(x-c)) \cdot \mathcal{O}((x-c)^2) + [\mathcal{O}((x-c)^2)]^2\\
@@ -541,6 +543,7 @@ f(x) \cdot g(x) &=  [f(c) + f'(c)(x-c) + \mathcal{O}((x-c)^2)] \cdot [g(c) + g'(
 &= f(c) \cdot g(c) + [f'(c)\cdot g(c) + f(c)\cdot g'(c)] \cdot (x-c) + [f'(c)\cdot g'(c) \cdot (x-c)^2 + \mathcal{O}((x-c)^2)] \\
 &= f(c) \cdot g(c) + [f'(c)\cdot g(c) + f(c)\cdot g'(c)] \cdot (x-c) + \mathcal{O}((x-c)^2)
 \end{align*}
+$$


 The big "oh" notation just sweeps up many things including any products of it *and* the term $f'(c)\cdot g'(c) \cdot (x-c)^2$. Again, we see from the product rule that this is just a tangent line approximation for $f(x) \cdot g(x)$.
@@ -803,13 +806,14 @@ numericq(abs(answ))
 The [Birthday problem](https://en.wikipedia.org/wiki/Birthday_problem) computes the probability that in a group of $n$ people, under some assumptions, that no two share a birthday. Without trying to spoil the problem, we focus on the calculus specific part of the problem below:


-
+$$
 \begin{align*}
 p
 &= \frac{365 \cdot 364 \cdot \cdots (365-n+1)}{365^n} \\
 &=  \frac{365(1 - 0/365) \cdot 365(1 - 1/365) \cdot 365(1-2/365) \cdot \cdots \cdot 365(1-(n-1)/365)}{365^n}\\
 &= (1 - \frac{0}{365})\cdot(1 -\frac{1}{365})\cdot \cdots \cdot (1-\frac{n-1}{365}).
 \end{align*}
+$$


 Taking logarithms, we have $\log(p)$ is
--- a/quarto/derivatives/mean_value_theorem.qmd
+++ b/quarto/derivatives/mean_value_theorem.qmd
@@ -92,9 +92,12 @@ Lest you think that continuous functions always have derivatives except perhaps
 We have defined an *absolute maximum* of $f(x)$ over an interval to be a value $f(c)$ for a point $c$ in the interval that is as large as any other value in the interval. Just specifying a function and an interval does not guarantee an absolute maximum, but specifying a *continuous* function and a *closed* interval does, by the extreme value theorem.


-> *A relative maximum*: We say $f(x)$ has a *relative maximum* at $c$ if there exists *some* interval $I=(a,b)$ with $a < c < b$ for which $f(c)$ is an absolute maximum for $f$ and $I$.
+::: {.callout-note icon=false}
+## A relative maximum

+We say $f(x)$ has a *relative maximum* at $c$ if there exists *some* interval $I=(a,b)$ with $a < c < b$ for which $f(c)$ is an absolute maximum for $f$ and $I$.

+:::

 The difference is a bit subtle, for an absolute maximum the interval must also be specified, for a relative maximum there just needs to exist some interval, possibly really small, though it must be bigger than a point.

@@ -139,12 +142,16 @@ For a continuous function $f(x)$, call a point $c$ in the domain of $f$ where ei

 We can combine Bolzano's extreme value theorem with Fermat's insight to get the following:

+::: {.callout-note icon=false}
+## Absolute maxima characterization

-> A continuous function on $[a,b]$ has an absolute maximum that occurs at a critical point $c$, $a < c < b$, or an endpoint, $a$ or $b$.
+A continuous function on $[a,b]$ has an absolute maximum that occurs at a critical point $c$, $a < c < b$, or an endpoint, $a$ or $b$.
+
+A similar statement holds for an absolute minimum.
+:::


-
-A similar statement holds for an absolute minimum.  This gives a restricted set of places to look for absolute maximum and minimum values - all the critical points and the endpoints.
+The above gives a restricted set of places to look for absolute maximum and minimum values - all the critical points and the endpoints.


 It is also the case that all relative extrema occur at a critical point, *however* not all critical points correspond to relative extrema. We will see *derivative tests* that help characterize when that occurs.
@@ -263,10 +270,12 @@ Here the maximum occurs at an endpoint. The critical point $c=0.67\dots$ does no

 Let $f(x)$ be differentiable on $(a,b)$ and continuous on $[a,b]$. Then the absolute maximum occurs at an endpoint or where the derivative is $0$ (as the derivative is always defined). This gives rise to:

+::: {.callout-note icon=false}
+## [Rolle's](http://en.wikipedia.org/wiki/Rolle%27s_theorem) theorem

-> *[Rolle's](http://en.wikipedia.org/wiki/Rolle%27s_theorem) theorem*: For $f$ differentiable on $(a,b)$ and continuous on $[a,b]$, if $f(a)=f(b)$, then there exists some $c$ in $(a,b)$ with $f'(c) = 0$.
-
+For $f$ differentiable on $(a,b)$ and continuous on $[a,b]$, if $f(a)=f(b)$, then there exists some $c$ in $(a,b)$ with $f'(c) = 0$.

+:::

 This modest observation opens the door to many relationships between a function and its derivative, as it ties the two together in one statement.

@@ -311,10 +320,12 @@ We are driving south and in one hour cover 70 miles. If the speed limit is 65 mi

 The mean value theorem is a direct generalization of Rolle's theorem.

+::: {.callout-note icon=false}
+## Mean value theorem

-> *Mean value theorem*: Let $f(x)$ be differentiable on $(a,b)$ and continuous on $[a,b]$. Then there exists a value $c$ in $(a,b)$ where $f'(c) = (f(b) - f(a)) / (b - a)$.
-
+Let $f(x)$ be differentiable on $(a,b)$ and continuous on $[a,b]$. Then there exists a value $c$ in $(a,b)$ where $f'(c) = (f(b) - f(a)) / (b - a)$.

+:::

 This says for any secant line between $a < b$ there will be a parallel tangent line at some $c$ with $a < c < b$ (all provided $f$ is differentiable on $(a,b)$ and continuous on $[a,b]$).

@@ -425,13 +436,20 @@ Suppose it is known that $f'(x)=0$ on some interval $I$ and we take any $a < b$
 ### The Cauchy mean value theorem


-[Cauchy](http://en.wikipedia.org/wiki/Mean_value_theorem#Cauchy.27s_mean_value_theorem) offered an extension to the mean value theorem above. Suppose both $f$ and $g$ satisfy the conditions of the mean value theorem on $[a,b]$ with $g(b)-g(a) \neq 0$, then there exists at least one $c$ with $a < c < b$ such that
+[Cauchy](http://en.wikipedia.org/wiki/Mean_value_theorem#Cauchy.27s_mean_value_theorem) offered an extension to the mean value theorem above.
+
+::: {.callout-note icon=false}
+## Cauchy mean value theorem
+
+Suppose both $f$ and $g$ satisfy the conditions of the mean value theorem on $[a,b]$ with $g(b)-g(a) \neq 0$, then there exists at least one $c$ with $a < c < b$ such that


 $$
 f'(c)  = g'(c) \cdot \frac{f(b) - f(a)}{g(b) - g(a)}.
 $$

+:::
+
 The proof follows by considering $h(x) = f(x) - r\cdot g(x)$, with $r$ chosen so that $h(a)=h(b)$. Then Rolle's theorem applies so that there is a $c$ with $h'(c)=0$, so $f'(c) = r g'(c)$, but $r$ can be seen to be $(f(b)-f(a))/(g(b)-g(a))$, which proves the theorem.


--- a/quarto/derivatives/more_zeros.qmd
+++ b/quarto/derivatives/more_zeros.qmd
@@ -127,12 +127,13 @@ Though the derivative is related to the slope of the secant line, that is in the
 Let $\epsilon_{n+1} = x_{n+1}-\alpha$, where $\alpha$ is assumed to be the *simple* zero of $f(x)$ that the secant method converges to. A [calculation](https://math.okstate.edu/people/binegar/4513-F98/4513-l08.pdf) shows that


-
+$$
 \begin{align*}
 \epsilon_{n+1} &\approx \frac{x_n-x_{n-1}}{f(x_n)-f(x_{n-1})} \frac{(1/2)f''(\alpha)(\epsilon_n-\epsilon_{n-1})}{x_n-x_{n-1}} \epsilon_n \epsilon_{n-1}\\
 & \approx \frac{f''(\alpha)}{2f'(\alpha)} \epsilon_n \epsilon_{n-1}\\
 &= C  \epsilon_n \epsilon_{n-1}.
 \end{align*}
+$$


 The constant `C` is similar to that for Newton's method, and reveals potential troubles for the secant method similar to those of Newton's method: a poor initial guess (the initial error is too big), the second derivative is too large, the first derivative too flat near the answer.
@@ -185,7 +186,7 @@ Here we use `SymPy` to identify the degree-$2$ polynomial as a function of $y$,
@syms y hs[0:2] xs[0:2] fs[0:2]
 H(y) = sum(hᵢ*(y - fs[end])^i for (hᵢ,i) ∈ zip(hs, 0:2))

-eqs = [H(fᵢ) ~ xᵢ for (xᵢ, fᵢ) ∈ zip(xs, fs)]
+eqs = tuple((H(fᵢ) ~ xᵢ for (xᵢ, fᵢ) ∈ zip(xs, fs))...)
 ϕ = solve(eqs, hs)
 hy = subs(H(y), ϕ)
 ```
@@ -279,41 +280,6 @@ We can see it in action on the sine function.   Here we pass in $\lambda$, but i
 chandrapatla(sin, 3, 4,  λ3, verbose=true)
 ```

-```{julia}
-#| output: false
-
-#=
-The condition `Φ^2 < ξ < 1 - (1-Φ)^2` can be visualized. Assume `a,b=0,1`, `fa,fb=-1/2,1`, Then `c < a < b`, and `fc` has the same sign as `fa`, but what values of `fc` will satisfy the inequality?
-
-
-XX```{julia}
-ξ(c,fc) = (a-b)/(c-b)
-Φ(c,fc) = (fa-fb)/(fc-fb)
-Φl(c,fc) = Φ(c,fc)^2
-Φr(c,fc) = 1 - (1-Φ(c,fc))^2
-a,b = 0, 1
-fa,fb = -1/2, 1
-region = Lt(Φl, ξ) & Lt(ξ,Φr)
-plot(region, xlims=(-2,a), ylims=(-3,0))
-XX```
-
-When `(c,fc)` is in the shaded area, the inverse quadratic step is chosen. We can see that `fc < fa` is needed.
-
-
-For these values, this area is within the area where a implicit quadratic step will result in a value between `a` and `b`:
-
-
-XX```{julia}
-l(c,fc) = λ3(fa,fb,fc,a,b,c)
-region₃ = ImplicitEquations.Lt(l,b) & ImplicitEquations.Gt(l,a)
-plot(region₃, xlims=(-2,0), ylims=(-3,0))
-XX```
-
-There are values in the parameter space where this does not occur.
-=#
-nothing
-```
-
 ## Tolerances


@@ -426,6 +392,96 @@ So a modified criteria for convergence might look like:
 It is not uncommon to assign `rtol` to have a value like `sqrt(eps())` to account for accumulated floating point errors and the factor of $f'(\alpha)$, though in the `Roots` package it is set smaller by default.


+### Conditioning and stability
+
+In Part III of @doi:10.1137/1.9781611977165 we find language of numerical analysis useful to  formally describe the zero-finding problem. Key concepts are errors, conditioning, and stability. These give some theoretical justification for the tolerances above.
+
+Abstractly a *problem* is a mapping, $F$, from a domain $X$ of data to a range $Y$ of solutions. Both $X$ and $Y$ have a sense of distance given by a *norm*. A norm is a generalization of the absolute value and gives quantitative meaning to terms like small and large.
+
+
+> A *well-conditioned* problem is one with the property that all small perturbations of $x$ lead to only small changes in $F(x)$.
+
+This sense of "small" is measured through a *condition number*.
+
+If we let $\delta_x$ be a small perturbation of $x$ then $\delta_F = F(x + \delta_x) - F(x)$.
+
+The *forward error* is $\lVert\delta_F\rVert = \lVert F(x+\delta_x) - F(x)\rVert$, the *relative forward error* is  $\lVert\delta_F\rVert/\lVert F\rVert = \lVert F(x+\delta_x) - F(x)\rVert/ \lVert F(x)\rVert$.
+
+The *backward error* is $\lVert\delta_x\rVert$, the *relative backward error* is $\lVert\delta_x\rVert / \lVert x\rVert$.
+
+ The *absolute condition number* $\hat{\kappa}$ is worst case of this ratio $\lVert\delta_F\rVert/ \lVert\delta_x\rVert$ as the perturbation size shrinks to $0$.
+The relative condition number $\kappa$ divides $\lVert\delta_F\rVert$ by $\lVert F(x)\rVert$ and $\lVert\delta_x\rVert$ by $\lVert x\rVert$ before taking the ratio.
+
+
+A *problem* is a mathematical concept, an *algorithm* the computational version. Algorithms may differ for many reasons, such as floating point errors, tolerances, etc. We use notation $\tilde{F}$ to indicate the algorithm.
+
+The absolute error in the algorithm is $\lVert\tilde{F}(x) - F(x)\rVert$, the relative error divides by $\lVert F(x)\rVert$. A good algorithm would have smaller relative errors.
+
+An algorithm is called *stable* if
+
+$$
+\frac{\lVert\tilde{F}(x) - F(\tilde{x})\rVert}{\lVert F(\tilde{x})\rVert}
+$$
+
+is *small* for *some* $\tilde{x}$ relatively near $x$, $\lVert\tilde{x}-x\rVert/\lVert x\rVert$.
+
+> A *stable* algorithm gives nearly the right answer to nearly the right question.
+
+(The answer it gives is $\tilde{F}(x)$, the nearly right question: what is $F(\tilde{x})$?)
+
+A related concept is an algorithm $\tilde{F}$ for a problem $F$ is *backward stable* if for each $x \in X$,
+
+$$
+\tilde{F}(x) = F(\tilde{x})
+$$
+
+for some $\tilde{x}$ with $\lVert\tilde{x} - x\rVert/\lVert x\rVert$ is small.
+
+> "A backward stable algorithm gives exactly the right answer to nearly the right question."
+
+
+The concepts are related by Trefethen and Bao's Theorem 15.1 which says  for a backward stable algorithm the relative error $\lVert\tilde{F}(x) - F(x)\rVert/\lVert F(x)\rVert$ is small in a manner proportional to the relative condition number.
+
+Applying this to the zero-finding we follow @doi:10.1137/1.9781611975086.
+
+To be specific, the problem, $F$, is finding a zero of a function $f$ starting at an initial point $x_0$. The data is $(f, x_0)$, the solution is $r$ a zero of $f$.
+
+Take the algorithm as Newton's method. Any implementation must incorporate tolerances, so this is a computational approximation to the problem. The data is the same, but technically we use $\tilde{f}$ for the function, as any computation is dependent on machine implementations. The output is $\tilde{r}$ an *approximate* zero.
+
+Suppose for sake of argument that $\tilde{f}(x) = f(x) + \epsilon$, $f$ has a continuous derivative, and $r$ is a root of $f$ and $\tilde{r}$ is a root of $\tilde{f}$. Then by linearization:
+
+$$
+\begin{align*}
+0 &= \tilde{f}(\tilde r) \\
+  &= f(r + \delta) + \epsilon\\
+  &\approx f(r) + f'(r)\delta + \epsilon\\
+  &= 0 + f'(r)\delta + \epsilon
+\end{align*}
+$$
+Rearranging gives $\lVert\delta/\epsilon\rVert \approx 1/\lVert f'(r)\rVert$. But the $|\delta|/|\epsilon|$ ratio is related to the the condition number:
+
+> The absolute condition number is $\hat{\kappa}_r = |f'(r)|^{-1}$.
+
+
+The error formula in Newton's method measuring the distance between the actual root and an approximation includes the derivative in the denominator, so we see large condition numbers are tied into possibly larger errors.
+
+Now consider $g(x) = f(x) - f(\tilde{r})$. Call $f(\tilde{r})$ the residual. We have $g$ is near $f$ if the residual is small. The algorithm will solve $(g, x_0)$ with $\tilde{r}$, so with a small residual an exact solution to an approximate question will be found. Driscoll and Braun state
+
+> The backward error in a root estimate is equal to the residual.
+
+
+Practically these two observations lead to
+
+* If there is a large condition number, it may not be possible to find an approximate root near the real root.
+
+* A tolerance in an algorithm should consider both the size of $x_{n} - x_{n-1}$ and the residual $f(x_n)$.
+
+For the first observation, the example of Wilkinson's polynomial is often used where $f(x) = (x-1)\cdot(x-2)\cdot \cdots\cdot(x-20)$. When expanded this function has exactness issues of typical floating point values, the condition number is large and some of the roots found are quite different from the mathematical values.
+
+
+The second observation follows from $f(x_n)$ monitoring the backward error and the product of the condition number and the backward error monitoring the forward error. This product is on the order of $|f(x_n)/f'(x_n)|$ or $|x_{n+1} - x_n|$.
+
+
 ## Questions


--- a/quarto/derivatives/newtons_method.qmd
+++ b/quarto/derivatives/newtons_method.qmd
@@ -178,15 +178,18 @@ x4, f(x4), f(x3)

 We see now that $f(x_4)$ is within machine tolerance of $0$, so we call $x_4$ an *approximate zero* of $f(x)$.

+::: {.callout-note icon=false}
+## Newton's method

-> **Newton's method:** Let $x_0$ be an initial guess for a zero of $f(x)$. Iteratively define $x_{i+1}$ in terms of the just generated $x_i$ by:
->
-> $$
-> x_{i+1} = x_i - f(x_i) / f'(x_i).
-> $$
->
-> Then for reasonable functions and reasonable initial guesses, the sequence of points converges to a zero of $f$.
+Let $x_0$ be an initial guess for a zero of $f(x)$. Iteratively define $x_{i+1}$ in terms of the just generated $x_i$ by:

+$$
+x_{i+1} = x_i - f(x_i) / f'(x_i).
+$$
+
+Then for reasonable functions and reasonable initial guesses, the sequence of points converges to a zero of $f$.
+
+:::


 On the computer, we know that actual convergence will likely never occur, but accuracy to a certain tolerance can often be achieved.
@@ -206,7 +209,12 @@ In practice, the algorithm is implemented not by repeating the update step a fix

 :::{.callout-note}
 ## Note
-Newton looked at this same example in 1699 (B.T. Polyak, *Newton's method and its use in optimization*, European Journal of Operational Research. 02/2007; 181(3):1086-1096.) though his technique was slightly different as he did not use the derivative, *per se*, but rather an approximation based on the fact that his function was a polynomial (though identical to the derivative). Raphson (1690) proposed the general form, hence the usual name of the Newton-Raphson method.
+
+Newton looked at this same example in 1699 (B.T. Polyak, *Newton's method and its use in optimization*, European Journal of Operational Research. 02/2007; 181(3):1086-1096.; and Deuflhard *Newton Methods for Nonlinear Problems: Affine Invariance and Adaptive Algorithms*) though his technique was slightly different as he did not use the derivative, *per se*, but rather an approximation based on the fact that his function was a polynomial.
+
+We can read that he guessed the answer was ``2 + p``, as there is a sign change between $2$ and $3$. Newton put this guess into the polynomial to get after simplification ``p^3 + 6p^2 + 10p - 1``. This has an **approximate** zero found by solving the linear part ``10p-1 = 0``.  Taking ``p = 0.1`` he then can say the answer looks like ``2 + p + q`` and repeat to get ``q^3 + 6.3\cdot q^2 + 11.23 \cdot q + 0.061 = 0``. Again taking just the linear part estimates `q = 0.005431...`.  After two steps the estimate is `2.105431...`. This can be continued by expressing the answer as ``2 + p + q + r`` and then solving for an estimate for ``r``.
+
+Raphson (1690) proposed a simplification avoiding the computation of new polynomials, hence the usual name of the Newton-Raphson method. Simpson introduced derivatives into the formulation and systems of equations.

 :::

@@ -392,6 +400,24 @@ x, f(x)
 To machine tolerance the answer is a zero, even though the exact answer is irrational and all finite floating point values can be represented as rational numbers.


+##### Example non-polynomial
+
+The first example by Newton of applying the method to a non-polynomial function was solving an equation from astronomy: $x - e \sin(x) = M$, where $e$ is an eccentric anomaly and $M$ a mean anomaly. Newton used polynomial approximations for the trigonometric functions, here we can solve directly.
+
+Let $e = 1/2$ and $M = 3/4$. With $f(x) = x - e\sin(x) - M$ then $f'(x) = 1 - e cos(x)$. Starting at 1, Newton's method for 3 steps becomes:
+
+```{julia}
+ec, M = 0.5, 0.75
+f(x) = x - ec * sin(x) - M
+fp(x) =  1 - ec * cos(x)
+x = 1
+x = x - f(x) / fp(x)
+x = x - f(x) / fp(x)
+x = x - f(x) / fp(x)
+x, f(x)
+```
+
+
 ##### Example


@@ -429,7 +455,6 @@ end

 So it takes $8$ steps to get an increment that small and about `10` steps to get to full convergence.

-
 ##### Example division as multiplication


@@ -686,7 +711,7 @@ $$
 For this value, we have


-
+$$
 \begin{align*}
 x_{i+1} - \alpha
 &= \left(x_i  - \frac{f(x_i)}{f'(x_i)}\right) - \alpha\\
@@ -696,6 +721,7 @@ x_{i+1} - \alpha
 \right)\\
 &=  \frac{1}{2}\frac{f''(\xi)}{f'(x_i)} \cdot(x_i - \alpha)^2.
 \end{align*}
+$$


 That is
--- a/quarto/derivatives/optimization.qmd
+++ b/quarto/derivatives/optimization.qmd
@@ -302,20 +302,20 @@ We could also do the above problem symbolically with the aid of `SymPy`. Here ar


 ```{julia}
-@syms 𝐰::real 𝐡::real
+@syms w₀::real h₀::real

-𝐀₀    = 𝐰 * 𝐡 + pi * (𝐰/2)^2 / 2
-𝐏erim = 2*𝐡 + 𝐰 + pi * 𝐰/2
-𝐡₀    = solve(𝐏erim - 20, 𝐡)[1]
-𝐀₁    = 𝐀₀(𝐡 => 𝐡₀)
-𝐰₀    = solve(diff(𝐀₁,𝐰), 𝐰)[1]
+A₀    = w₀ * h₀ + pi * (w₀/2)^2 / 2
+Perim = 2*h₀ + w₀ + pi * w₀/2
+h₁    = solve(Perim - 20, h₀)[1]
+A₁    = A₀(h₀ => h₁)
+w₁    = solve(diff(A₁,w₀), w₀)[1]
 ```

-We know that `𝐰₀` is the maximum in this example from our previous work. We shall see soon, that just knowing that the second derivative is negative at `𝐰₀` would suffice to know this. Here we check that condition:
+We know that `w₀` is the maximum in this example from our previous work. We shall see soon, that just knowing that the second derivative is negative at `w₀` would suffice to know this. Here we check that condition:


 ```{julia}
-diff(𝐀₁, 𝐰, 𝐰)(𝐰 => 𝐰₀)
+diff(A₁, w₀, w₀)(w₀ => w₁)
 ```

 As an aside, compare the steps involved above for a symbolic solution to those of previous work for a numeric solution:
@@ -614,7 +614,7 @@ We see two terms: one with $x=L$ and another quadratic. For the simple case $r_0


 ```{julia}
-solve(q(r1=>r0), x)
+solve(q(r1=>r0) ~ 0, x)
 ```

 Well, not so fast. We need to check the other endpoint, $x=0$:
@@ -632,7 +632,7 @@ Now, if, say, travel above the line is half as slow as travel along, then $2r_0


 ```{julia}
-out = solve(q(r1 => 2r0), x)
+out = solve(q(r1 => 2r0) ~ 0, x)
 ```

 It is hard to tell which would minimize time without more work. To check a case ($a=1, L=2, r_0=1$) we might have
@@ -1372,11 +1372,12 @@ solve(x/b ~ (x+a)/(b + b*p), x)
 With $x = a/p$ we get by Pythagorean's theorem that


-
+$$
 \begin{align*}
 c^2 &= (a + a/p)^2 + (b + bp)^2 \\
    &= a^2(1 + \frac{1}{p})^2 + b^2(1+p)^2.
 \end{align*}
+$$


 The ladder problem minimizes $c$ or equivalently $c^2$.
--- a/quarto/derivatives/taylor_series_polynomials.qmd
+++ b/quarto/derivatives/taylor_series_polynomials.qmd
@@ -115,7 +115,7 @@ The term "best" is deserved, as any other straight line will differ at least in
 (This is a consequence of Cauchy's mean value theorem with $F(c) = f(c) - f'(c)\cdot(c-x)$ and $G(c) = (c-x)^2$


-
+$$
 \begin{align*}
 \frac{F'(\xi)}{G'(\xi)} &=
 \frac{f'(\xi) - f''(\xi)(\xi-x) - f(\xi)\cdot 1}{2(\xi-x)} \\
@@ -124,6 +124,7 @@ The term "best" is deserved, as any other straight line will differ at least in
 &= \frac{f(c) - f'(c)(c-x) - (f(x) - f'(x)(x-x))}{(c-x)^2 - (x-x)^2} \\
 &= \frac{f(c) + f'(c)(x-c) - f(x)}{(x-c)^2}
 \end{align*}
+$$


 That is, $f(x) = f(c) + f'(c)(x-c) + f''(\xi)/2\cdot(x-c)^2$, or $f(x)-tl(x)$ is as described.)
@@ -154,13 +155,14 @@ As in the linear case, there is flexibility in the exact points chosen for the i

 Now, we take a small detour to define some notation. Instead of writing our two points as $c$ and $c+h,$ we use $x_0$ and $x_1$. For any set of points $x_0, x_1, \dots, x_n$, define the **divided differences** of $f$ inductively, as follows:

-
+$$
 \begin{align*}
 f[x_0] &= f(x_0) \\
 f[x_0, x_1] &= \frac{f[x_1] - f[x_0]}{x_1 - x_0}\\
 \cdots &\\
 f[x_0, x_1, x_2, \dots, x_n] &= \frac{f[x_1, \dots, x_n] - f[x_0, x_1, x_2, \dots, x_{n-1}]}{x_n - x_0}.
 \end{align*}
+$$

 We see the first two values look familiar, and to generate more we just take certain ratios akin to those formed when finding a secant line.

@@ -252,11 +254,12 @@ A proof based on Rolle's theorem appears in the appendix.
 Why the fuss? The answer comes from a result of Newton on *interpolating* polynomials. Consider a function $f$ and $n+1$ points $x_0$, $x_1, \dots, x_n$. Then an interpolating polynomial is a polynomial of least degree that goes through each point $(x_i, f(x_i))$. The [Newton form](https://en.wikipedia.org/wiki/Newton_polynomial) of such a polynomial can be written as:


-
+$$
 \begin{align*}
 f[x_0] &+ f[x_0,x_1] \cdot (x-x_0) + f[x_0, x_1, x_2] \cdot (x-x_0) \cdot (x-x_1) + \\
 & \cdots + f[x_0, x_1, \dots, x_n] \cdot (x-x_0)\cdot \cdots \cdot (x-x_{n-1}).
 \end{align*}
+$$


 The case $n=0$ gives the value $f[x_0] = f(c)$, which can be interpreted as the slope-$0$ line that goes through the point $(c,f(c))$.
@@ -485,16 +488,19 @@ On inspection, it is seen that this is Newton's method applied to $f'(x)$. This
 Starting with the Newton form of the interpolating polynomial of smallest degree:


-
+$$
 \begin{align*}
 f[x_0] &+ f[x_0,x_1] \cdot (x - x_0) + f[x_0, x_1, x_2] \cdot (x - x_0)\cdot(x-x_1) + \\
 & \cdots + f[x_0, x_1, \dots, x_n] \cdot (x-x_0) \cdot \cdots \cdot (x-x_{n-1}).
 \end{align*}
+$$


 and taking $x_i = c + i\cdot h$, for a given $n$, we have in the limit as $h > 0$ goes to zero that coefficients of this polynomial converge to the coefficients of the *Taylor Polynomial of degree n*:


+
+
 $$
 f(c) + f'(c)\cdot(x-c) + \frac{f''(c)}{2!}(x-c)^2 + \cdots + \frac{f^{(n)}(c)}{n!} (x-c)^n.
 $$
@@ -850,23 +856,25 @@ The actual  code is different, as the Taylor polynomial isn't used. The Taylor p
 For notational purposes, let $g(x)$ be the inverse function for $f(x)$. Assume *both* functions have a Taylor polynomial expansion:


-
+$$
 \begin{align*}
 f(x_0 + \Delta_x) &= f(x_0) + a_1 \Delta_x + a_2 (\Delta_x)^2 + \cdots a_n  (\Delta_x)^n + \dots\\
 g(y_0 + \Delta_y) &= g(y_0) + b_1 \Delta_y + b_2 (\Delta_y)^2 + \cdots b_n  (\Delta_y)^n + \dots
 \end{align*}
+$$


 Then using $x = g(f(x))$, we have expanding the terms and using $\approx$ to drop the $\dots$:


-
+$$
 \begin{align*}
 x_0 + \Delta_x &= g(f(x_0 + \Delta_x)) \\
 &\approx g(f(x_0) + \sum_{j=1}^n a_j (\Delta_x)^j) \\
 &\approx g(f(x_0)) + \sum_{i=1}^n b_i \left(\sum_{j=1}^n a_j (\Delta_x)^j \right)^i \\
 &\approx x_0 + \sum_{i=1}^{n-1} b_i \left(\sum_{j=1}^n a_j (\Delta_x)^j\right)^i + b_n \left(\sum_{j=1}^n a_j (\Delta_x)^j\right)^n
 \end{align*}
+$$


 That is:
@@ -1207,7 +1215,7 @@ $$
 These two polynomials are of degree $n$ or less and have $u(x) = h(x)-g(x)=0$, by uniqueness. So the coefficients of $u(x)$ are $0$. We have that the coefficient of $x^n$ must be $a_n-b_n$ so $a_n=b_n$. Our goal is to express $a_n$ in terms of $a_{n-1}$ and $b_{n-1}$. Focusing on the $x^{n-1}$ term, we have:


-
+$$
 \begin{align*}
 b_n(x-x_n)(x-x_{n-1})\cdot\cdots\cdot(x-x_1)
 &- a_n\cdot(x-x_0)\cdot\cdots\cdot(x-x_{n-1}) \\
@@ -1215,6 +1223,7 @@ b_n(x-x_n)(x-x_{n-1})\cdot\cdots\cdot(x-x_1)
 a_n [(x-x_1)\cdot\cdots\cdot(x-x_{n-1})] [(x- x_n)-(x-x_0)] \\
 &= -a_n \cdot(x_n - x_0) x^{n-1} + p_{n-2},
 \end{align*}
+$$


 where $p_{n-2}$ is a polynomial of at most degree $n-2$. (The expansion of $(x-x_1)\cdot\cdots\cdot(x-x_{n-1}))$ leaves $x^{n-1}$ plus some lower degree polynomial.) Similarly, we have $a_{n-1}(x-x_0)\cdot\cdots\cdot(x-x_{n-2}) = a_{n-1}x^{n-1} + q_{n-2}$ and $b_{n-1}(x-x_n)\cdot\cdots\cdot(x-x_2) = b_{n-1}x^{n-1}+r_{n-2}$. Combining, we get that the $x^{n-1}$ term of $u(x)$ is