pset 3 solutions

2023-03-17 15:03:33 -04:00 · 2023-03-17 15:03:33 -04:00 · 6458b96ef3
commit 6458b96ef3
parent 3a9e02f622
2 changed files with 517 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -233,7 +233,7 @@ Derivatives viewed as linear approximations have many important applications in
 ## Lecture 18 (Mar 17)

 * d(A⁻¹), adjoint problems, chain rules, and forward vs reverse mode (backpropagation) derivatives
-* pset 3 solutions: coming soon
+* [pset 3 solutions](psets/pset3sol.ipynb)
 * pset 4: coming soon, due 4/7

 Showed that d(A⁻¹)=–A⁻¹ dA A⁻¹.  (For example, if you have a matrix A(t) that depends on a scalar t∈ℝ, this tells you that d(A⁻¹)/dt = –A⁻¹ dA/dt A⁻¹.)   Applied this to computing ∇ₚf for scalar functions f(A⁻¹b) where A(p) depends on some parameters p∈ℝⁿ, and showed that ∂f/∂pᵢ = -(∇ₓf)ᵀA⁻¹(∂A/∂pᵢ)x.  Moreover, showed that it is critically important to evaluate this expression from *left to right*, i.e. first computing vᵀ=-(∇ₓf)ᵀA⁻¹m, which corresponds to solving the "adjoint (transposed) equation" Aᵀv=–∇ₓf.  By doing so, you can compute both f and ∇ₚf at the cost of only "two solves" with matrices A and Aᵀ.  This is critically important in enabling engineering/physics optimization, where you often want to optimize a property of the solution A⁻¹b of a physical model A depending on many design variables p (e.g. the distribution of materials in space).
--- a/psets/pset3sol.ipynb
+++ b/psets/pset3sol.ipynb
@ -0,0 +1,516 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "21a9eae2",
+   "metadata": {},
+   "source": [
+    "# 18.065 Problem Set 3 Solutions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a19ccce3",
+   "metadata": {},
+   "source": [
+    "## Problem 1 (4+4+4+4 points)\n",
+    "\n",
+    "Suppose $x = A^{-1} b$ for an invertible $m \\times m$ square matrix $A$.   If we perturb the right hand-side $b$ to $b + \\Delta b$, the new solution is $x+\\Delta x$ where $\\Delta x = A^{-1} \\Delta b$.\n",
+    "\n",
+    "In class, I claimed that\n",
+    "$$\n",
+    "\\frac{\\Vert \\Delta x \\Vert_2}{\\Vert x \\Vert_2} \\le \\kappa(A) \\frac{\\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2}\n",
+    "$$\n",
+    "where $\\kappa{A} = \\sigma_1 / \\sigma_m$ is the condition number of $A$.  In this problem, you will prove that fact.\n",
+    "\n",
+    "**(a)** Show that $\\Vert \\Delta x \\Vert_2 \\le \\Vert A^{-1} \\Vert_2 \\, \\Vert \\Delta b \\Vert_2$.  (Hint: this should follow easily from the definition of the induced norm, which you studied in pset 2.)   Using the same proof, also show that $\\Vert b \\Vert_2 \\le \\Vert A \\Vert_2 \\, \\Vert x \\Vert_2$\n",
+    "\n",
+    "**(b)** Why does (a) tell you that $\\frac{\\Vert \\Delta x \\Vert_2}{\\Vert x \\Vert_2} \\le \\Vert A \\Vert_2 \\, \\Vert A^{-1} \\Vert_2 \\frac{\\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2}$?\n",
+    "\n",
+    "**(c)** Why does $\\Vert A \\Vert_2 \\, \\Vert A^{-1} \\Vert_2 = \\sigma_1 / \\sigma_m = \\kappa(A)$, the ratio of the largest to the smallest singular values of $A$?   Why does $\\kappa(A) = \\kappa(A^{-1})$?\n",
+    "\n",
+    "**(d)** The following code generates a random 1000x1000 matrix with a condition number of $10^6$.   How close does $\\frac{\\Vert \\Delta x \\Vert_2 \\, \\Vert b \\Vert_2}{\\Vert x \\Vert_2\\, \\Vert \\Delta b \\Vert_2}$ get to the upper bound above if you try a random $\\Delta b$ vector for a random $b$?   Change the `????` lines in the code below to choose a *particular* $b$ and $\\Delta b$ such that $\\frac{\\Vert \\Delta x \\Vert_2 \\, \\Vert b \\Vert_2}{\\Vert x \\Vert_2\\, \\Vert \\Delta b \\Vert_2}$ reaches the upper bound."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b5596c9",
+   "metadata": {},
+   "source": [
+    "### Solution:\n",
+    "\n",
+    "**(a)** By definition of the induced norm\n",
+    "$$\n",
+    "\\Vert M \\Vert_2 = \\max_{y\\ne 0} \\frac{\\Vert My \\Vert_2}{\\Vert y \\Vert_2} \\ge \n",
+    "\\frac{\\Vert Mz \\Vert_2}{\\Vert z \\Vert_2} \\implies \\boxed{ \\Vert M \\Vert_2 \\Vert z \\Vert_2 \\ge \\Vert Mz \\Vert_2 }\n",
+    "$$\n",
+    "for any matrix $M$ and any particular $z$.  Hence, plugging in $M=A^{-1}$ and $z = \\Delta b$, we have the desired result:\n",
+    "$$\n",
+    "\\Vert \\Delta x \\Vert_2 = \\Vert A^{-1} \\Delta b \\Vert_2 \\le \\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2 \\, .\n",
+    "$$\n",
+    "Similarly, plugging in $M = A$ and $z = x$ we have:\n",
+    "$$\n",
+    "\\Vert b \\Vert_2 = \\Vert A x \\Vert_2 \\le \\Vert A \\Vert_2 \\Vert x \\Vert_2 \\, .\n",
+    "$$\n",
+    "\n",
+    "**(b)** Using part (a), we find (replacing the numerator with something bigger and the denominator with something smaller):\n",
+    "$$\n",
+    "\\frac{\\Vert \\Delta x \\Vert_2}{\\Vert  x \\Vert_2} \\le \n",
+    "\\frac{\\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2}{\\Vert  x \\Vert_2} \\le\n",
+    "\\frac{\\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2 / \\Vert A \\Vert_2} = \\Vert A \\Vert_2 \\Vert A^{-1} \\Vert_2 \\frac{\\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2} \\, .\n",
+    "$$\n",
+    "\n",
+    "**(c)** From class (and pset 2), $\\Vert A \\Vert_2 = \\sigma_1$, the largest singular value of $A$.   If $A$ is invertible, then the singular values of $A^{-1} = V \\Sigma^{-1} U^T$ are simply the inverses of the singular values $\\sigma_k$ of $A$, hence  $\\Vert A^{-1} \\Vert_2 = 1/\\sigma_m$: the largest singular value of $A^{-1}$ is the inverse of the *smallest* singular value of $A$.  Hence $\\Vert A \\Vert_2 \\Vert A^{-1} \\Vert_2 = \\sigma_1 / \\sigma_m = \\kappa(A)$.\n",
+    "\n",
+    "**(d)** See the code and our comments thereon below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "f0c203c3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "cond(A) = 999999.9999943373\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "1.1445045178383875"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "using LinearAlgebra\n",
+    "\n",
+    "m = 1000\n",
+    "U = qr(randn(1000,1000)).Q # random unitary U\n",
+    "V = qr(randn(1000,1000)).Q # random unitary V\n",
+    "σ = exp10.(range(0, -6, length=m)) # log-spaced singular vals from 1 to 10⁻⁶\n",
+    "A = U * Diagonal(σ) * V' # construct A from its SVD\n",
+    "@show cond(A) # condition number ≈ 10⁶\n",
+    "\n",
+    "b = randn(m) # a random b\n",
+    "x = A \\ b\n",
+    "\n",
+    "Δb = randn(m) * 1e-3 # random perturbation\n",
+    "Δx = A \\ Δb # change in solution\n",
+    "\n",
+    "(norm(Δx) * norm(b)) / (norm(x) * norm(Δb))  # ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2258d15",
+   "metadata": {},
+   "source": [
+    "In this case ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂ ≈ 1.14 is **much less** than $\\kappa(A) \\approx 10^6$.   Why is this?\n",
+    "\n",
+    "The answer becomes clear if we choose $b$ and $\\Delta b$ to saturate the bound.  In order to do this we must change our inequalities to equalities in part (a).  That is, we must have $\\Vert b \\Vert_2 = \\Vert A \\Vert_2 \\Vert x \\Vert_2 = \\sigma_1 \\Vert x \\Vert_2$, which requires that $x$ **must be a multiple of the first right-singular vector** $v_1$ and hence $b$ is a **multiple of the first left-singular vector** $u_1$.   Similarly, to have $\\Vert \\Delta x \\Vert_2 = \\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2 = \\sigma_m^{-1} \\Vert \\Delta b \\Vert_2$, we must have $\\Delta b$ **a multiple of the last left-singular vector** $u_m$.\n",
+    "\n",
+    "For example, we can choose $b = u_1$ and $\\Delta b = 10^{-3} u_m$:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e3631ff5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "999999.9999925973"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# change this code so that ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂ attains its upper bound.\n",
+    "b = U[:,1] # any multiple of u₁\n",
+    "x = A \\ b\n",
+    "Δb = 1e-3 * U[:,m] # any multiple of uₘ\n",
+    "Δx = A \\ Δb\n",
+    "(norm(Δx) * norm(b)) / (norm(x) * norm(Δb))   # ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "acd3fb30",
+   "metadata": {},
+   "source": [
+    "Voilà, we have hit $\\kappa(A)$!\n",
+    "\n",
+    "But this is quite unlikely with a *random* $b$ and $\\Delta b$; e.g. a random $\\Delta b$ will probably have a substantial $u_1$ component so you will get $\\Vert \\Delta x \\Vert_2 \\ll \\sigma_m^{-1} \\Vert \\Delta b \\Vert_2$. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b094adc3",
+   "metadata": {},
+   "source": [
+    "## Problem 2 (10 points)\n",
+    "\n",
+    "In class, I said that solving least-square problems via $\\hat{R}\\hat{x} = \\hat{Q}^* b$ using the thin QR factorization ($A = \\hat{Q} \\hat{R}$ where both $A$ and $\\hat{Q}$ are $m \\times n$ with rank $n$) is good because it avoids squaring the condition number.\n",
+    "\n",
+    "In particular, I said $\\kappa(A^T A) = \\kappa(A)^2$ but $\\kappa(\\hat{R}) = \\kappa(A)$.  Derive this. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e88bb43c",
+   "metadata": {},
+   "source": [
+    "### Solution:\n",
+    "\n",
+    "Note that $A = \\hat{Q} \\hat{R}$ and $\\hat{Q}^T \\hat{Q} = I$ (its columns are orthonormal), so $\\hat{R} = \\hat{Q}^T A$.\n",
+    "\n",
+    "First, let's show that $\\kappa(A^T A) = \\kappa(A)$.  This is easy, since if $A = U\\Sigma V^T$ then $A^T A = V \\Sigma^T \\Sigma V^T$ is exactly an SVD with the squared singular values ($\\Sigma^T \\Sigma$ is diagonal with $\\sigma_k^2$ on the diagonal, as we also saw in class.)  If the singular values of $A^TA$ are squared, then the condition number is squared.\n",
+    "\n",
+    "Second, the singular values of $\\hat{R}$ are the square roots of eigenvalues of:\n",
+    "$$\n",
+    "\\hat{R}^T \\hat{R} = A^T \\hat{Q} \\hat{Q}^T A \\, .\n",
+    "$$\n",
+    "Now $\\hat{Q} \\hat{Q}^T \\ne I$ since $\\hat{Q}$ is not square, but it is *orthogonal projection* onto $C(\\hat{Q}) = C(A)$, so applying this projection to $A$ yields $A$!  i.e. $\\hat{Q} \\hat{Q}^T A = A$ and so\n",
+    "$\\boxed{\\hat{R}^T \\hat{R} = A^T A}$, so they have the **same singular values** and hence the same condition number $\\boxed{\\kappa(\\hat{R}) = \\kappa(A)}$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bffaf19f",
+   "metadata": {},
+   "source": [
+    "## Problem 3 (10 points)\n",
+    "\n",
+    "Another form of regularized least-squares problem is to assume you know something about what the solution $x$ should look like, i.e. you have a \"prior\" $x_p$ guess of the solution.   Suppose you then want to solve\n",
+    "$$\n",
+    "\\min_x \\left(\\Vert b - Ax \\Vert_2^2 + \\delta^2 \\Vert x - x_p \\Vert_2^2 \\right)\n",
+    "$$\n",
+    "where $\\delta \\ne 0$ is a regularization strength (larger δ means that the solution will be closer to $x_p$).\n",
+    "\n",
+    "Show how this is equivalent to solving an equation of the form $(A^T A + ?)\\hat{x} = ?$.  (Very similar to how we handled the Tikhonov regularization in class, which corresponds to the special case $x_p = 0$!)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2454ba7e",
+   "metadata": {},
+   "source": [
+    "### Solution:\n",
+    "\n",
+    "We can solve this in several ways.  One of the simplest is a change of variables: let $y = x - x_p$.  Then $Ax = Ay + Ax_p$, and our equation becomes ordinary Tikhonov least squares:\n",
+    "$$\n",
+    "\\min_y \\left( \\Vert b - Ax_p - Ay \\Vert_2^2 + \\delta^2 \\Vert y \\Vert_2^2 \\right) \\implies \\left(A^T A + \\delta^2 I \\right) \\hat{y} = A^T \\left( b - Ax_p \\right)\n",
+    "$$\n",
+    "Substituting back in $\\hat{y} = \\hat{x} - x_p$ yields:\n",
+    "$$\n",
+    "\\boxed{\\left(A^T A + \\delta^2 I \\right) \\hat{x}} = (A^T b - A^T Ax_p) + \\left(A^T A + \\delta^2 I \\right) x_p \\boxed{ = A^T b + \\delta^2 x_p}\n",
+    "$$\n",
+    "which is identical to the Tikhonov normal equations but with a modified right-hand side.\n",
+    "\n",
+    "Another way to derive this is to follow the procedure from class and \"stack\" the two norms into a single $\\ell^2$ norm to get an ordinary least-squares problem:\n",
+    "$$\n",
+    "\\Vert b - Ax \\Vert_2^2 + \\delta^2 \\Vert x - x_p \\Vert_2^2\n",
+    "= \\left\\Vert \\begin{pmatrix} b - Ax \\\\ -\\delta x_p + \\delta Ix \\end{pmatrix} \\right\\Vert_2^2 = \\left\\Vert \\underbrace{\\begin{pmatrix} b \\\\ -\\delta x_p  \\end{pmatrix}}_c - \\underbrace{\\begin{pmatrix} A \\\\ -\\delta I \\end{pmatrix}}_B x \\right\\Vert_2^2\n",
+    "$$\n",
+    "which is an ordinary least-squares problem with a new matrix $B$ and a new right-hand side, giving normal equations\n",
+    "$$\n",
+    "\\underbrace{B^T B}_{A^T A + \\delta^2 I} \\hat{x} = \\underbrace{B^T c}_{A^T b + \\delta^2 x_p}\n",
+    "$$\n",
+    "equivalent to those above."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d59e8a0",
+   "metadata": {},
+   "source": [
+    "## Problem 4 (10 points)\n",
+    "\n",
+    "In pset 1, problem 3, you took a $1000 \\times 1000$ matrix $A$ of rank 4 and multiplied it by a random $1000 \\times 10$ matrix $X$, and we argued that the rank of $A$ almost certainly equalled the rank of $AX$ or $(AX)^T (AX)$ or $X^T AX$.   However, the *values* of the singular values were not the same.\n",
+    "\n",
+    "In this, problem, you'll see what happens if you orthonormalize the outputs $AX$.   We can orthonormalize $AX$ using `Q = Matrix(qr(A*X).Q)` in Julia, which forms the thin QR factorization.\n",
+    "\n",
+    "For the same rank-4 `A` and random `X` as in pset 1, problem 3 (or at least, generated randomly by the same code, reproduced below), compute `Q` and use it to obtain a *much smaller matrix* (e.g. 10x10, 1000x10, or 10x1000) whose 4 biggest singular values *nearly match* those of `A`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91cda73e",
+   "metadata": {},
+   "source": [
+    "### Solution:\n",
+    "\n",
+    "As discussed in class (lecture 13 and [these notes](https://gregorygundersen.com/blog/2019/01/17/randomized-svd/)), since we expect $Q$ to span $C(A)$, we should have $QQ^TA = A$, and hence the singular values of the much smaller ($10\\times 1000$) matrix $S = Q^TA$ should nearly match those of $A$:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "74797236",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(svdvals(A))[1:6] = [1063.8855386293208, 1004.87019779304, 976.5002449189662, 948.0683756097217, 2.5808249126107803e-12, 2.2288314628390244e-12]\n",
+      "(svdvals(S))[1:6] = [1063.8855386293203, 1004.8701977930399, 976.5002449189659, 948.0683756097219, 3.14037555395831e-14, 2.697874343992467e-14]\n"
+     ]
+    }
+   ],
+   "source": [
+    "using LinearAlgebra\n",
+    "A = randn(1000, 4) * randn(4, 1000) # random 1000x1000 matrix of rank 4\n",
+    "@show svdvals(A)[1:6]\n",
+    "\n",
+    "X = randn(1000, 10)\n",
+    "Q = Matrix(qr(A*X).Q)\n",
+    "\n",
+    "S = Q'*A # much smaller matrix using Q and/or A and/or X\n",
+    "@show svdvals(S)[1:6];"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49175961",
+   "metadata": {},
+   "source": [
+    "By eye, the first 4 (nonzero) singular values nearly match!\n",
+    "\n",
+    "Let's be more quantitative and look at the difference $\\Delta \\sigma_k$ in the singular values.  We want this to be small, but small compared to what?  An appropriate yardstick here is $\\sigma_1$, the largest singular value of $A$, or equivalently $\\Vert A \\Vert_2$:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "c9e400b2",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "6-element Vector{Float64}:\n",
+       "  4.274401092737358e-16\n",
+       "  1.0686002731843394e-16\n",
+       "  3.2058008195530186e-16\n",
+       " -2.137200546368679e-16\n",
+       "  2.396330304814367e-15\n",
+       "  2.069633094398391e-15"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "σ_A = svdvals(A)\n",
+    "σ_S = svdvals(S)\n",
+    "(σ_A[1:6] - σ_S[1:6]) / σ_A[1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e68cc1b7",
+   "metadata": {},
+   "source": [
+    "Hooray, they match to within a factor of ≈10 of [machine precision](https://en.wikipedia.org/wiki/Machine_epsilon), given by `eps()` in Julia:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "f589dc35",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2.220446049250313e-16"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eps()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9415ba3",
+   "metadata": {},
+   "source": [
+    "## Problem 5 (10 points)\n",
+    "\n",
+    "Suppose that $X$ is a random $m \\times n$ matrix (drawn from some probability distribution) and its mean (expectation value) is $E[X] = X_0$.\n",
+    "\n",
+    "Derive the following identity for the variance (which we used in class):\n",
+    "\n",
+    "$$\n",
+    "E[\\Vert X - X_0 \\Vert_F^2] = E[\\Vert X \\Vert_F^2] - \\Vert X_0 \\Vert_F^2\n",
+    "$$\n",
+    "\n",
+    "(If you just write out the Frobenius norm definition $\\Vert A \\Vert_F^2 = \\text{tr}[A^T A]$, the problem should be very straightforward.  Remember that the expectation value is conceptually just a sum, so it can be moved inside any linear operation like the trace.)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc5addc1",
+   "metadata": {},
+   "source": [
+    "### Solution:\n",
+    "\n",
+    "As indicated by the hint, let's write out the trace formula for the Frobenius norm and use linearity:\n",
+    "$$\n",
+    "E[\\Vert X - X_0 \\Vert_F^2] = E[\\text{tr} (X - X_0)^T(X - X_0)] = \\\\\n",
+    "\\underbrace{E[\\text{tr}(X^T X)]}_{E[\\Vert X \\Vert_F^2]} \\; - \\underbrace{E[\\text{tr}(X^T X_0)]}_{\\text{tr}(E[X]^T X_0)=\\text{tr}(X_0^T X_0)=\\Vert X_0 \\Vert_F^2} - \\underbrace{E[\\text{tr}(X_0^T X)]}_{\\text{tr}(X_0^T E[X])=\\text{tr}(X_0^T X_0)=\\Vert X_0 \\Vert_F^2} + \\;\\underbrace{E[\\text{tr}(X_0^T X_0)]}_{\\Vert X_0 \\Vert_F^2} \\\\\n",
+    "= E[\\Vert X \\Vert_F^2] - \\Vert X_0 \\Vert_F^2\n",
+    "$$\n",
+    "The key point is that $X_0$ is a constant and trace is linear (a sum of traces = trace of sum!), so the expectation value can be moved inside the trace terms to convert $X$ to $X_0$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "55b01fd5",
+   "metadata": {},
+   "source": [
+    "## Problem 6 (5+5 points)\n",
+    "\n",
+    "Suppose we are solving a generalized/weighted least-square problem:\n",
+    "$$\n",
+    "\\min_{x \\in \\mathbb{R}^n} \\Vert b - Ax \\Vert_W\n",
+    "$$\n",
+    "where $A$ is $m \\times n$, $W = W^T$ is an $m \\times m$ positive-definite \"weight\" matrix, and $\\Vert y \\Vert_W = \\sqrt{y^T W y}$ is a weighted $\\ell^2$ norm.\n",
+    "\n",
+    "**(a)** Show that this is equivalent to an ordinary least-square problem of minimizing $\\Vert d - Cx \\Vert_2 = \\Vert b - Ax \\Vert_W$ for some matrix $C$ and vector $d$.   (Hint: use the fact that any positive-definite matrix $W$ can be factored as $W = $ \\_\\_\\_\\_\\_.)\n",
+    "\n",
+    "**(b)** Show that the normal equations $C^T C \\hat{x} = C^T d$ are equivalent to the generalized normal equations $A^T W A \\hat{x} = A^T W b$ given in class."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ac0dbc1",
+   "metadata": {},
+   "source": [
+    "### Solution:\n",
+    "\n",
+    "**(a)** From class, *any* positive-definite matrix $W$ can be factored as $W = B^T B$ for some matrix $B$.  (There are many possible factors $B$. Computationally, one can use the [matrix square root](https://en.wikipedia.org/wiki/Square_root_of_a_matrix) $B = \\sqrt{W}$, or more efficiently one can use $B$ to be the [Cholesky factor](https://en.wikipedia.org/wiki/Cholesky_decomposition) of $W$.)   If we then plug this in to our least-square objective, we obtain:\n",
+    "$$\n",
+    "\\Vert b - Ax \\Vert_W = (b-Ax)^T B^T B (b-Ax) = \\Vert B(b-Ax) \\Vert_2 = \\Vert d - Cx \\Vert_2\n",
+    "$$\n",
+    "for $d = Bb$ and $C = BA$.\n",
+    "\n",
+    "**(b)** The normal equations now give:\n",
+    "$$\n",
+    "C^T C \\hat{x} = (BA)^T (BA) \\hat{x} = A^T \\underbrace{B^T B}_W A \\hat{x} \\\\ = C^T d = (BA)^T Bb = A^T \\underbrace{B^T B}_W b\n",
+    "$$\n",
+    "which matches the equations from class.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "88ab5b5e",
+   "metadata": {},
+   "source": [
+    "## Problem 7 (5+5 points)\n",
+    "\n",
+    "The most common form of least-squares is linear regression, i.e. fitting $m$ data points $(a_i, b_i)$ to a model of the form $b(a) = x_1 + a x_2$.\n",
+    "\n",
+    "Suppose the data points $b_i$ have independent random errors with equal variances $\\sigma^2$ (i.e. they are \"homoscedastic\").   (This is the case in which Gauss–Markov says that ordinary least-squares minimizes the variance.)   In this case, many sources give simple explicit formulas for the variances of the fit coefficients $x_1$ and $x_2$\n",
+    "\n",
+    "$$\n",
+    "\\text{variance of }\\hat{x}_1 = \\frac{\\sigma^2 \\sum_{i=1}^m a_i^2} {m \\sum_{j=1}^m (a_j - \\text{mean } a)^2}\n",
+    "$$\n",
+    "\n",
+    "$$\n",
+    "\\text{variance of }\\hat{x}_2 = \\frac{\\sigma^2 } {\\sum_{j=1}^m (a_j - \\text{mean } a)^2}\n",
+    "$$\n",
+    "\n",
+    "**Derive these formulas** from the more general equation $W = (A^T V^{-1} A)^{-1}$ for the covariance matrix of $\\hat{x}$ that we found in class for weighted least squares (where $V$ is the covariance matrix of $b$).\n",
+    "\n",
+    "(The formula for the inverse of a 2x2 matrix will be helpful; google it.)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "940f45e9",
+   "metadata": {},
+   "source": [
+    "### Solution:\n",
+    "\n",
+    "First, we need the matrix $A$ that yields our linear least-square fitting problem for $Ax \\approx b$.  For the model $x_1 + a x_2$, this is simply:\n",
+    "$$\n",
+    "A = \\begin{pmatrix} 1 & a_1 \\\\ 1 & a_2 \\\\ \\vdots & \\vdots \\\\ 1 & a_m \\end{pmatrix}\n",
+    "$$\n",
+    "Moreover, for homoscedastic data $b$, the variance is simply:\n",
+    "$$\n",
+    "V = E[(b - \\text{mean }b) (b - \\text{mean }b)^T] = \\sigma^2 I\n",
+    "$$\n",
+    "So, we have\n",
+    "$$\n",
+    "A^T V^{-1} A = \\sigma^{-2} A^T A = \\sigma^{-2} \\begin{pmatrix} m & \\sum_i a_i \\\\ \\sum_i a_i & \\sum_i a_i^2 \\end{pmatrix} = \\sigma^{-2} \\begin{pmatrix} m & m a_0 \\\\ m a_0 & \\sum_i a_i^2 \\end{pmatrix}\n",
+    "$$\n",
+    "where for convenience we define $a_0 = \\text{mean } a$.\n",
+    "\n",
+    "To invert this matrix and find $W$, we need the determinant of the $2\\times 2$ matrix $(\\cdots)$ in parentheses, which is:\n",
+    "$$\n",
+    "\\det \\begin{pmatrix} m & m a_0 \\\\ m a_0 & \\sum_i a_i^2 \\end{pmatrix} \n",
+    "= m \\sum_i a_i^2 - m^2 a_0^2 = m^2 (\\text{mean}[a^2] - a_0) = m^2 \\text{mean}[(a-a_0)^2]\n",
+    "$$\n",
+    "where the last identity is exactly the scalar version of the identity from problem 5 (i.e. for \"$1\\times 1$ matrices\" $X=a$).\n",
+    "\n",
+    "Then, applying the well-known formula $\\begin{pmatrix} a & b \\\\ c & d \\end{pmatrix}^{-1} =  \\frac{1}{ad-bc} \\begin{pmatrix} d & -b \\\\ -c & a \\end{pmatrix}$\n",
+    "we obtain\n",
+    "$$\n",
+    "W = E[(\\hat{x} - \\text{mean }\\hat{x}) (\\hat{x} - \\text{mean }\\hat{x})^T] = \\sigma^2 (A^T A)^{-1} \\\\\n",
+    "= \\frac{\\sigma^2}{m\\sum_i (a_i - a_0)^2} \n",
+    "\\begin{pmatrix} \\sum_i a_i^2 & -m a_0 \\\\ -m a_0 & m \\end{pmatrix}\n",
+    "= \\begin{pmatrix} \\text{variance of }\\hat{x}_1 & \\text{covariance} \\\\\n",
+    "\\text{covariance} & \\text{variance of }\\hat{x}_2 \\end{pmatrix}\n",
+    "$$\n",
+    "The diagonal elements of the covariance matrix $W$ are the desired variances, and by inspection match the formulas above.\n",
+    "\n",
+    "(Note also that these variances should go $\\to 0$ as $m \\to \\infty$: more data gives a more precise fit!)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f45c3283",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Julia 1.8.3",
+   "language": "julia",
+   "name": "julia-1.8"
+  },
+  "language_info": {
+   "file_extension": ".jl",
+   "mimetype": "application/julia",
+   "name": "julia",
+   "version": "1.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}