pset 3 solutions
This commit is contained in:
parent
3a9e02f622
commit
6458b96ef3
@ -233,7 +233,7 @@ Derivatives viewed as linear approximations have many important applications in
|
||||
## Lecture 18 (Mar 17)
|
||||
|
||||
* d(A⁻¹), adjoint problems, chain rules, and forward vs reverse mode (backpropagation) derivatives
|
||||
* pset 3 solutions: coming soon
|
||||
* [pset 3 solutions](psets/pset3sol.ipynb)
|
||||
* pset 4: coming soon, due 4/7
|
||||
|
||||
Showed that d(A⁻¹)=–A⁻¹ dA A⁻¹. (For example, if you have a matrix A(t) that depends on a scalar t∈ℝ, this tells you that d(A⁻¹)/dt = –A⁻¹ dA/dt A⁻¹.) Applied this to computing ∇ₚf for scalar functions f(A⁻¹b) where A(p) depends on some parameters p∈ℝⁿ, and showed that ∂f/∂pᵢ = -(∇ₓf)ᵀA⁻¹(∂A/∂pᵢ)x. Moreover, showed that it is critically important to evaluate this expression from *left to right*, i.e. first computing vᵀ=-(∇ₓf)ᵀA⁻¹m, which corresponds to solving the "adjoint (transposed) equation" Aᵀv=–∇ₓf. By doing so, you can compute both f and ∇ₚf at the cost of only "two solves" with matrices A and Aᵀ. This is critically important in enabling engineering/physics optimization, where you often want to optimize a property of the solution A⁻¹b of a physical model A depending on many design variables p (e.g. the distribution of materials in space).
|
||||
|
516
psets/pset3sol.ipynb
Normal file
516
psets/pset3sol.ipynb
Normal file
@ -0,0 +1,516 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "21a9eae2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 18.065 Problem Set 3 Solutions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a19ccce3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Problem 1 (4+4+4+4 points)\n",
|
||||
"\n",
|
||||
"Suppose $x = A^{-1} b$ for an invertible $m \\times m$ square matrix $A$. If we perturb the right hand-side $b$ to $b + \\Delta b$, the new solution is $x+\\Delta x$ where $\\Delta x = A^{-1} \\Delta b$.\n",
|
||||
"\n",
|
||||
"In class, I claimed that\n",
|
||||
"$$\n",
|
||||
"\\frac{\\Vert \\Delta x \\Vert_2}{\\Vert x \\Vert_2} \\le \\kappa(A) \\frac{\\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2}\n",
|
||||
"$$\n",
|
||||
"where $\\kappa{A} = \\sigma_1 / \\sigma_m$ is the condition number of $A$. In this problem, you will prove that fact.\n",
|
||||
"\n",
|
||||
"**(a)** Show that $\\Vert \\Delta x \\Vert_2 \\le \\Vert A^{-1} \\Vert_2 \\, \\Vert \\Delta b \\Vert_2$. (Hint: this should follow easily from the definition of the induced norm, which you studied in pset 2.) Using the same proof, also show that $\\Vert b \\Vert_2 \\le \\Vert A \\Vert_2 \\, \\Vert x \\Vert_2$\n",
|
||||
"\n",
|
||||
"**(b)** Why does (a) tell you that $\\frac{\\Vert \\Delta x \\Vert_2}{\\Vert x \\Vert_2} \\le \\Vert A \\Vert_2 \\, \\Vert A^{-1} \\Vert_2 \\frac{\\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2}$?\n",
|
||||
"\n",
|
||||
"**(c)** Why does $\\Vert A \\Vert_2 \\, \\Vert A^{-1} \\Vert_2 = \\sigma_1 / \\sigma_m = \\kappa(A)$, the ratio of the largest to the smallest singular values of $A$? Why does $\\kappa(A) = \\kappa(A^{-1})$?\n",
|
||||
"\n",
|
||||
"**(d)** The following code generates a random 1000x1000 matrix with a condition number of $10^6$. How close does $\\frac{\\Vert \\Delta x \\Vert_2 \\, \\Vert b \\Vert_2}{\\Vert x \\Vert_2\\, \\Vert \\Delta b \\Vert_2}$ get to the upper bound above if you try a random $\\Delta b$ vector for a random $b$? Change the `????` lines in the code below to choose a *particular* $b$ and $\\Delta b$ such that $\\frac{\\Vert \\Delta x \\Vert_2 \\, \\Vert b \\Vert_2}{\\Vert x \\Vert_2\\, \\Vert \\Delta b \\Vert_2}$ reaches the upper bound."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3b5596c9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Solution:\n",
|
||||
"\n",
|
||||
"**(a)** By definition of the induced norm\n",
|
||||
"$$\n",
|
||||
"\\Vert M \\Vert_2 = \\max_{y\\ne 0} \\frac{\\Vert My \\Vert_2}{\\Vert y \\Vert_2} \\ge \n",
|
||||
"\\frac{\\Vert Mz \\Vert_2}{\\Vert z \\Vert_2} \\implies \\boxed{ \\Vert M \\Vert_2 \\Vert z \\Vert_2 \\ge \\Vert Mz \\Vert_2 }\n",
|
||||
"$$\n",
|
||||
"for any matrix $M$ and any particular $z$. Hence, plugging in $M=A^{-1}$ and $z = \\Delta b$, we have the desired result:\n",
|
||||
"$$\n",
|
||||
"\\Vert \\Delta x \\Vert_2 = \\Vert A^{-1} \\Delta b \\Vert_2 \\le \\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2 \\, .\n",
|
||||
"$$\n",
|
||||
"Similarly, plugging in $M = A$ and $z = x$ we have:\n",
|
||||
"$$\n",
|
||||
"\\Vert b \\Vert_2 = \\Vert A x \\Vert_2 \\le \\Vert A \\Vert_2 \\Vert x \\Vert_2 \\, .\n",
|
||||
"$$\n",
|
||||
"\n",
|
||||
"**(b)** Using part (a), we find (replacing the numerator with something bigger and the denominator with something smaller):\n",
|
||||
"$$\n",
|
||||
"\\frac{\\Vert \\Delta x \\Vert_2}{\\Vert x \\Vert_2} \\le \n",
|
||||
"\\frac{\\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2}{\\Vert x \\Vert_2} \\le\n",
|
||||
"\\frac{\\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2 / \\Vert A \\Vert_2} = \\Vert A \\Vert_2 \\Vert A^{-1} \\Vert_2 \\frac{\\Vert \\Delta b \\Vert_2}{\\Vert b \\Vert_2} \\, .\n",
|
||||
"$$\n",
|
||||
"\n",
|
||||
"**(c)** From class (and pset 2), $\\Vert A \\Vert_2 = \\sigma_1$, the largest singular value of $A$. If $A$ is invertible, then the singular values of $A^{-1} = V \\Sigma^{-1} U^T$ are simply the inverses of the singular values $\\sigma_k$ of $A$, hence $\\Vert A^{-1} \\Vert_2 = 1/\\sigma_m$: the largest singular value of $A^{-1}$ is the inverse of the *smallest* singular value of $A$. Hence $\\Vert A \\Vert_2 \\Vert A^{-1} \\Vert_2 = \\sigma_1 / \\sigma_m = \\kappa(A)$.\n",
|
||||
"\n",
|
||||
"**(d)** See the code and our comments thereon below:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "f0c203c3",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"cond(A) = 999999.9999943373\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"1.1445045178383875"
|
||||
]
|
||||
},
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"using LinearAlgebra\n",
|
||||
"\n",
|
||||
"m = 1000\n",
|
||||
"U = qr(randn(1000,1000)).Q # random unitary U\n",
|
||||
"V = qr(randn(1000,1000)).Q # random unitary V\n",
|
||||
"σ = exp10.(range(0, -6, length=m)) # log-spaced singular vals from 1 to 10⁻⁶\n",
|
||||
"A = U * Diagonal(σ) * V' # construct A from its SVD\n",
|
||||
"@show cond(A) # condition number ≈ 10⁶\n",
|
||||
"\n",
|
||||
"b = randn(m) # a random b\n",
|
||||
"x = A \\ b\n",
|
||||
"\n",
|
||||
"Δb = randn(m) * 1e-3 # random perturbation\n",
|
||||
"Δx = A \\ Δb # change in solution\n",
|
||||
"\n",
|
||||
"(norm(Δx) * norm(b)) / (norm(x) * norm(Δb)) # ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d2258d15",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂ ≈ 1.14 is **much less** than $\\kappa(A) \\approx 10^6$. Why is this?\n",
|
||||
"\n",
|
||||
"The answer becomes clear if we choose $b$ and $\\Delta b$ to saturate the bound. In order to do this we must change our inequalities to equalities in part (a). That is, we must have $\\Vert b \\Vert_2 = \\Vert A \\Vert_2 \\Vert x \\Vert_2 = \\sigma_1 \\Vert x \\Vert_2$, which requires that $x$ **must be a multiple of the first right-singular vector** $v_1$ and hence $b$ is a **multiple of the first left-singular vector** $u_1$. Similarly, to have $\\Vert \\Delta x \\Vert_2 = \\Vert A^{-1} \\Vert_2 \\Vert \\Delta b \\Vert_2 = \\sigma_m^{-1} \\Vert \\Delta b \\Vert_2$, we must have $\\Delta b$ **a multiple of the last left-singular vector** $u_m$.\n",
|
||||
"\n",
|
||||
"For example, we can choose $b = u_1$ and $\\Delta b = 10^{-3} u_m$:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "e3631ff5",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"999999.9999925973"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# change this code so that ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂ attains its upper bound.\n",
|
||||
"b = U[:,1] # any multiple of u₁\n",
|
||||
"x = A \\ b\n",
|
||||
"Δb = 1e-3 * U[:,m] # any multiple of uₘ\n",
|
||||
"Δx = A \\ Δb\n",
|
||||
"(norm(Δx) * norm(b)) / (norm(x) * norm(Δb)) # ‖Δx‖₂‖b‖₂ / ‖x‖₂‖Δb‖₂"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "acd3fb30",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Voilà, we have hit $\\kappa(A)$!\n",
|
||||
"\n",
|
||||
"But this is quite unlikely with a *random* $b$ and $\\Delta b$; e.g. a random $\\Delta b$ will probably have a substantial $u_1$ component so you will get $\\Vert \\Delta x \\Vert_2 \\ll \\sigma_m^{-1} \\Vert \\Delta b \\Vert_2$. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b094adc3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Problem 2 (10 points)\n",
|
||||
"\n",
|
||||
"In class, I said that solving least-square problems via $\\hat{R}\\hat{x} = \\hat{Q}^* b$ using the thin QR factorization ($A = \\hat{Q} \\hat{R}$ where both $A$ and $\\hat{Q}$ are $m \\times n$ with rank $n$) is good because it avoids squaring the condition number.\n",
|
||||
"\n",
|
||||
"In particular, I said $\\kappa(A^T A) = \\kappa(A)^2$ but $\\kappa(\\hat{R}) = \\kappa(A)$. Derive this. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e88bb43c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Solution:\n",
|
||||
"\n",
|
||||
"Note that $A = \\hat{Q} \\hat{R}$ and $\\hat{Q}^T \\hat{Q} = I$ (its columns are orthonormal), so $\\hat{R} = \\hat{Q}^T A$.\n",
|
||||
"\n",
|
||||
"First, let's show that $\\kappa(A^T A) = \\kappa(A)$. This is easy, since if $A = U\\Sigma V^T$ then $A^T A = V \\Sigma^T \\Sigma V^T$ is exactly an SVD with the squared singular values ($\\Sigma^T \\Sigma$ is diagonal with $\\sigma_k^2$ on the diagonal, as we also saw in class.) If the singular values of $A^TA$ are squared, then the condition number is squared.\n",
|
||||
"\n",
|
||||
"Second, the singular values of $\\hat{R}$ are the square roots of eigenvalues of:\n",
|
||||
"$$\n",
|
||||
"\\hat{R}^T \\hat{R} = A^T \\hat{Q} \\hat{Q}^T A \\, .\n",
|
||||
"$$\n",
|
||||
"Now $\\hat{Q} \\hat{Q}^T \\ne I$ since $\\hat{Q}$ is not square, but it is *orthogonal projection* onto $C(\\hat{Q}) = C(A)$, so applying this projection to $A$ yields $A$! i.e. $\\hat{Q} \\hat{Q}^T A = A$ and so\n",
|
||||
"$\\boxed{\\hat{R}^T \\hat{R} = A^T A}$, so they have the **same singular values** and hence the same condition number $\\boxed{\\kappa(\\hat{R}) = \\kappa(A)}$."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bffaf19f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Problem 3 (10 points)\n",
|
||||
"\n",
|
||||
"Another form of regularized least-squares problem is to assume you know something about what the solution $x$ should look like, i.e. you have a \"prior\" $x_p$ guess of the solution. Suppose you then want to solve\n",
|
||||
"$$\n",
|
||||
"\\min_x \\left(\\Vert b - Ax \\Vert_2^2 + \\delta^2 \\Vert x - x_p \\Vert_2^2 \\right)\n",
|
||||
"$$\n",
|
||||
"where $\\delta \\ne 0$ is a regularization strength (larger δ means that the solution will be closer to $x_p$).\n",
|
||||
"\n",
|
||||
"Show how this is equivalent to solving an equation of the form $(A^T A + ?)\\hat{x} = ?$. (Very similar to how we handled the Tikhonov regularization in class, which corresponds to the special case $x_p = 0$!)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2454ba7e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Solution:\n",
|
||||
"\n",
|
||||
"We can solve this in several ways. One of the simplest is a change of variables: let $y = x - x_p$. Then $Ax = Ay + Ax_p$, and our equation becomes ordinary Tikhonov least squares:\n",
|
||||
"$$\n",
|
||||
"\\min_y \\left( \\Vert b - Ax_p - Ay \\Vert_2^2 + \\delta^2 \\Vert y \\Vert_2^2 \\right) \\implies \\left(A^T A + \\delta^2 I \\right) \\hat{y} = A^T \\left( b - Ax_p \\right)\n",
|
||||
"$$\n",
|
||||
"Substituting back in $\\hat{y} = \\hat{x} - x_p$ yields:\n",
|
||||
"$$\n",
|
||||
"\\boxed{\\left(A^T A + \\delta^2 I \\right) \\hat{x}} = (A^T b - A^T Ax_p) + \\left(A^T A + \\delta^2 I \\right) x_p \\boxed{ = A^T b + \\delta^2 x_p}\n",
|
||||
"$$\n",
|
||||
"which is identical to the Tikhonov normal equations but with a modified right-hand side.\n",
|
||||
"\n",
|
||||
"Another way to derive this is to follow the procedure from class and \"stack\" the two norms into a single $\\ell^2$ norm to get an ordinary least-squares problem:\n",
|
||||
"$$\n",
|
||||
"\\Vert b - Ax \\Vert_2^2 + \\delta^2 \\Vert x - x_p \\Vert_2^2\n",
|
||||
"= \\left\\Vert \\begin{pmatrix} b - Ax \\\\ -\\delta x_p + \\delta Ix \\end{pmatrix} \\right\\Vert_2^2 = \\left\\Vert \\underbrace{\\begin{pmatrix} b \\\\ -\\delta x_p \\end{pmatrix}}_c - \\underbrace{\\begin{pmatrix} A \\\\ -\\delta I \\end{pmatrix}}_B x \\right\\Vert_2^2\n",
|
||||
"$$\n",
|
||||
"which is an ordinary least-squares problem with a new matrix $B$ and a new right-hand side, giving normal equations\n",
|
||||
"$$\n",
|
||||
"\\underbrace{B^T B}_{A^T A + \\delta^2 I} \\hat{x} = \\underbrace{B^T c}_{A^T b + \\delta^2 x_p}\n",
|
||||
"$$\n",
|
||||
"equivalent to those above."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5d59e8a0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Problem 4 (10 points)\n",
|
||||
"\n",
|
||||
"In pset 1, problem 3, you took a $1000 \\times 1000$ matrix $A$ of rank 4 and multiplied it by a random $1000 \\times 10$ matrix $X$, and we argued that the rank of $A$ almost certainly equalled the rank of $AX$ or $(AX)^T (AX)$ or $X^T AX$. However, the *values* of the singular values were not the same.\n",
|
||||
"\n",
|
||||
"In this, problem, you'll see what happens if you orthonormalize the outputs $AX$. We can orthonormalize $AX$ using `Q = Matrix(qr(A*X).Q)` in Julia, which forms the thin QR factorization.\n",
|
||||
"\n",
|
||||
"For the same rank-4 `A` and random `X` as in pset 1, problem 3 (or at least, generated randomly by the same code, reproduced below), compute `Q` and use it to obtain a *much smaller matrix* (e.g. 10x10, 1000x10, or 10x1000) whose 4 biggest singular values *nearly match* those of `A`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "91cda73e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Solution:\n",
|
||||
"\n",
|
||||
"As discussed in class (lecture 13 and [these notes](https://gregorygundersen.com/blog/2019/01/17/randomized-svd/)), since we expect $Q$ to span $C(A)$, we should have $QQ^TA = A$, and hence the singular values of the much smaller ($10\\times 1000$) matrix $S = Q^TA$ should nearly match those of $A$:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "74797236",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"(svdvals(A))[1:6] = [1063.8855386293208, 1004.87019779304, 976.5002449189662, 948.0683756097217, 2.5808249126107803e-12, 2.2288314628390244e-12]\n",
|
||||
"(svdvals(S))[1:6] = [1063.8855386293203, 1004.8701977930399, 976.5002449189659, 948.0683756097219, 3.14037555395831e-14, 2.697874343992467e-14]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"using LinearAlgebra\n",
|
||||
"A = randn(1000, 4) * randn(4, 1000) # random 1000x1000 matrix of rank 4\n",
|
||||
"@show svdvals(A)[1:6]\n",
|
||||
"\n",
|
||||
"X = randn(1000, 10)\n",
|
||||
"Q = Matrix(qr(A*X).Q)\n",
|
||||
"\n",
|
||||
"S = Q'*A # much smaller matrix using Q and/or A and/or X\n",
|
||||
"@show svdvals(S)[1:6];"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "49175961",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"By eye, the first 4 (nonzero) singular values nearly match!\n",
|
||||
"\n",
|
||||
"Let's be more quantitative and look at the difference $\\Delta \\sigma_k$ in the singular values. We want this to be small, but small compared to what? An appropriate yardstick here is $\\sigma_1$, the largest singular value of $A$, or equivalently $\\Vert A \\Vert_2$:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "c9e400b2",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"6-element Vector{Float64}:\n",
|
||||
" 4.274401092737358e-16\n",
|
||||
" 1.0686002731843394e-16\n",
|
||||
" 3.2058008195530186e-16\n",
|
||||
" -2.137200546368679e-16\n",
|
||||
" 2.396330304814367e-15\n",
|
||||
" 2.069633094398391e-15"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"σ_A = svdvals(A)\n",
|
||||
"σ_S = svdvals(S)\n",
|
||||
"(σ_A[1:6] - σ_S[1:6]) / σ_A[1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e68cc1b7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Hooray, they match to within a factor of ≈10 of [machine precision](https://en.wikipedia.org/wiki/Machine_epsilon), given by `eps()` in Julia:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "f589dc35",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"2.220446049250313e-16"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"eps()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b9415ba3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Problem 5 (10 points)\n",
|
||||
"\n",
|
||||
"Suppose that $X$ is a random $m \\times n$ matrix (drawn from some probability distribution) and its mean (expectation value) is $E[X] = X_0$.\n",
|
||||
"\n",
|
||||
"Derive the following identity for the variance (which we used in class):\n",
|
||||
"\n",
|
||||
"$$\n",
|
||||
"E[\\Vert X - X_0 \\Vert_F^2] = E[\\Vert X \\Vert_F^2] - \\Vert X_0 \\Vert_F^2\n",
|
||||
"$$\n",
|
||||
"\n",
|
||||
"(If you just write out the Frobenius norm definition $\\Vert A \\Vert_F^2 = \\text{tr}[A^T A]$, the problem should be very straightforward. Remember that the expectation value is conceptually just a sum, so it can be moved inside any linear operation like the trace.)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bc5addc1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Solution:\n",
|
||||
"\n",
|
||||
"As indicated by the hint, let's write out the trace formula for the Frobenius norm and use linearity:\n",
|
||||
"$$\n",
|
||||
"E[\\Vert X - X_0 \\Vert_F^2] = E[\\text{tr} (X - X_0)^T(X - X_0)] = \\\\\n",
|
||||
"\\underbrace{E[\\text{tr}(X^T X)]}_{E[\\Vert X \\Vert_F^2]} \\; - \\underbrace{E[\\text{tr}(X^T X_0)]}_{\\text{tr}(E[X]^T X_0)=\\text{tr}(X_0^T X_0)=\\Vert X_0 \\Vert_F^2} - \\underbrace{E[\\text{tr}(X_0^T X)]}_{\\text{tr}(X_0^T E[X])=\\text{tr}(X_0^T X_0)=\\Vert X_0 \\Vert_F^2} + \\;\\underbrace{E[\\text{tr}(X_0^T X_0)]}_{\\Vert X_0 \\Vert_F^2} \\\\\n",
|
||||
"= E[\\Vert X \\Vert_F^2] - \\Vert X_0 \\Vert_F^2\n",
|
||||
"$$\n",
|
||||
"The key point is that $X_0$ is a constant and trace is linear (a sum of traces = trace of sum!), so the expectation value can be moved inside the trace terms to convert $X$ to $X_0$."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "55b01fd5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Problem 6 (5+5 points)\n",
|
||||
"\n",
|
||||
"Suppose we are solving a generalized/weighted least-square problem:\n",
|
||||
"$$\n",
|
||||
"\\min_{x \\in \\mathbb{R}^n} \\Vert b - Ax \\Vert_W\n",
|
||||
"$$\n",
|
||||
"where $A$ is $m \\times n$, $W = W^T$ is an $m \\times m$ positive-definite \"weight\" matrix, and $\\Vert y \\Vert_W = \\sqrt{y^T W y}$ is a weighted $\\ell^2$ norm.\n",
|
||||
"\n",
|
||||
"**(a)** Show that this is equivalent to an ordinary least-square problem of minimizing $\\Vert d - Cx \\Vert_2 = \\Vert b - Ax \\Vert_W$ for some matrix $C$ and vector $d$. (Hint: use the fact that any positive-definite matrix $W$ can be factored as $W = $ \\_\\_\\_\\_\\_.)\n",
|
||||
"\n",
|
||||
"**(b)** Show that the normal equations $C^T C \\hat{x} = C^T d$ are equivalent to the generalized normal equations $A^T W A \\hat{x} = A^T W b$ given in class."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7ac0dbc1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Solution:\n",
|
||||
"\n",
|
||||
"**(a)** From class, *any* positive-definite matrix $W$ can be factored as $W = B^T B$ for some matrix $B$. (There are many possible factors $B$. Computationally, one can use the [matrix square root](https://en.wikipedia.org/wiki/Square_root_of_a_matrix) $B = \\sqrt{W}$, or more efficiently one can use $B$ to be the [Cholesky factor](https://en.wikipedia.org/wiki/Cholesky_decomposition) of $W$.) If we then plug this in to our least-square objective, we obtain:\n",
|
||||
"$$\n",
|
||||
"\\Vert b - Ax \\Vert_W = (b-Ax)^T B^T B (b-Ax) = \\Vert B(b-Ax) \\Vert_2 = \\Vert d - Cx \\Vert_2\n",
|
||||
"$$\n",
|
||||
"for $d = Bb$ and $C = BA$.\n",
|
||||
"\n",
|
||||
"**(b)** The normal equations now give:\n",
|
||||
"$$\n",
|
||||
"C^T C \\hat{x} = (BA)^T (BA) \\hat{x} = A^T \\underbrace{B^T B}_W A \\hat{x} \\\\ = C^T d = (BA)^T Bb = A^T \\underbrace{B^T B}_W b\n",
|
||||
"$$\n",
|
||||
"which matches the equations from class.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "88ab5b5e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Problem 7 (5+5 points)\n",
|
||||
"\n",
|
||||
"The most common form of least-squares is linear regression, i.e. fitting $m$ data points $(a_i, b_i)$ to a model of the form $b(a) = x_1 + a x_2$.\n",
|
||||
"\n",
|
||||
"Suppose the data points $b_i$ have independent random errors with equal variances $\\sigma^2$ (i.e. they are \"homoscedastic\"). (This is the case in which Gauss–Markov says that ordinary least-squares minimizes the variance.) In this case, many sources give simple explicit formulas for the variances of the fit coefficients $x_1$ and $x_2$\n",
|
||||
"\n",
|
||||
"$$\n",
|
||||
"\\text{variance of }\\hat{x}_1 = \\frac{\\sigma^2 \\sum_{i=1}^m a_i^2} {m \\sum_{j=1}^m (a_j - \\text{mean } a)^2}\n",
|
||||
"$$\n",
|
||||
"\n",
|
||||
"$$\n",
|
||||
"\\text{variance of }\\hat{x}_2 = \\frac{\\sigma^2 } {\\sum_{j=1}^m (a_j - \\text{mean } a)^2}\n",
|
||||
"$$\n",
|
||||
"\n",
|
||||
"**Derive these formulas** from the more general equation $W = (A^T V^{-1} A)^{-1}$ for the covariance matrix of $\\hat{x}$ that we found in class for weighted least squares (where $V$ is the covariance matrix of $b$).\n",
|
||||
"\n",
|
||||
"(The formula for the inverse of a 2x2 matrix will be helpful; google it.)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "940f45e9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Solution:\n",
|
||||
"\n",
|
||||
"First, we need the matrix $A$ that yields our linear least-square fitting problem for $Ax \\approx b$. For the model $x_1 + a x_2$, this is simply:\n",
|
||||
"$$\n",
|
||||
"A = \\begin{pmatrix} 1 & a_1 \\\\ 1 & a_2 \\\\ \\vdots & \\vdots \\\\ 1 & a_m \\end{pmatrix}\n",
|
||||
"$$\n",
|
||||
"Moreover, for homoscedastic data $b$, the variance is simply:\n",
|
||||
"$$\n",
|
||||
"V = E[(b - \\text{mean }b) (b - \\text{mean }b)^T] = \\sigma^2 I\n",
|
||||
"$$\n",
|
||||
"So, we have\n",
|
||||
"$$\n",
|
||||
"A^T V^{-1} A = \\sigma^{-2} A^T A = \\sigma^{-2} \\begin{pmatrix} m & \\sum_i a_i \\\\ \\sum_i a_i & \\sum_i a_i^2 \\end{pmatrix} = \\sigma^{-2} \\begin{pmatrix} m & m a_0 \\\\ m a_0 & \\sum_i a_i^2 \\end{pmatrix}\n",
|
||||
"$$\n",
|
||||
"where for convenience we define $a_0 = \\text{mean } a$.\n",
|
||||
"\n",
|
||||
"To invert this matrix and find $W$, we need the determinant of the $2\\times 2$ matrix $(\\cdots)$ in parentheses, which is:\n",
|
||||
"$$\n",
|
||||
"\\det \\begin{pmatrix} m & m a_0 \\\\ m a_0 & \\sum_i a_i^2 \\end{pmatrix} \n",
|
||||
"= m \\sum_i a_i^2 - m^2 a_0^2 = m^2 (\\text{mean}[a^2] - a_0) = m^2 \\text{mean}[(a-a_0)^2]\n",
|
||||
"$$\n",
|
||||
"where the last identity is exactly the scalar version of the identity from problem 5 (i.e. for \"$1\\times 1$ matrices\" $X=a$).\n",
|
||||
"\n",
|
||||
"Then, applying the well-known formula $\\begin{pmatrix} a & b \\\\ c & d \\end{pmatrix}^{-1} = \\frac{1}{ad-bc} \\begin{pmatrix} d & -b \\\\ -c & a \\end{pmatrix}$\n",
|
||||
"we obtain\n",
|
||||
"$$\n",
|
||||
"W = E[(\\hat{x} - \\text{mean }\\hat{x}) (\\hat{x} - \\text{mean }\\hat{x})^T] = \\sigma^2 (A^T A)^{-1} \\\\\n",
|
||||
"= \\frac{\\sigma^2}{m\\sum_i (a_i - a_0)^2} \n",
|
||||
"\\begin{pmatrix} \\sum_i a_i^2 & -m a_0 \\\\ -m a_0 & m \\end{pmatrix}\n",
|
||||
"= \\begin{pmatrix} \\text{variance of }\\hat{x}_1 & \\text{covariance} \\\\\n",
|
||||
"\\text{covariance} & \\text{variance of }\\hat{x}_2 \\end{pmatrix}\n",
|
||||
"$$\n",
|
||||
"The diagonal elements of the covariance matrix $W$ are the desired variances, and by inspection match the formulas above.\n",
|
||||
"\n",
|
||||
"(Note also that these variances should go $\\to 0$ as $m \\to \\infty$: more data gives a more precise fit!)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f45c3283",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Julia 1.8.3",
|
||||
"language": "julia",
|
||||
"name": "julia-1.8"
|
||||
},
|
||||
"language_info": {
|
||||
"file_extension": ".jl",
|
||||
"mimetype": "application/julia",
|
||||
"name": "julia",
|
||||
"version": "1.8.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
Loading…
x
Reference in New Issue
Block a user