Added derivation of Joseph equation and optimal K

2018-05-06 08:36:39 -07:00 · 2018-05-06 08:36:39 -07:00 · be136c1425
commit be136c1425
parent e9fe8621a0
1 changed files with 179 additions and 2 deletions
--- a/07-Kalman-Filter-Math.ipynb
+++ b/07-Kalman-Filter-Math.ipynb
@ -1198,6 +1198,180 @@
    "If you do this, 'lower right term' means the most rapidly changing term for each variable. If the state is $x=\\begin{bmatrix}x & \\dot x & \\ddot{x} & y & \\dot{y} & \\ddot{y}\\end{bmatrix}^\\mathsf{T}$ Then Q will be 6x6; the elements for both $\\ddot{x}$ and $\\ddot{y}$ will have to be set to non-zero in $\\mathbf Q$."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Stable Compution of the Posterior Covariance\n",
+    "\n",
+    "I've presented the equation to compute the posterior covariance as\n",
+    "\n",
+    "$$\\mathbf P = (\\mathbf I - \\mathbf{KH})\\mathbf{\\bar P}$$\n",
+    "\n",
+    "and while strictly speaking this is correct, this is not how I compute it in `FilterPy`, where I use the *Joseph* equation\n",
+    "\n",
+    "\n",
+    "$$\\mathbf P = (\\mathbf I-\\mathbf {KH})\\mathbf{\\bar P}(\\mathbf I-\\mathbf{KH})^\\mathsf T + \\mathbf{KRK}^\\mathsf T$$\n",
+    "\n",
+    "\n",
+    "I frequently get emails and/or GitHub issues raised, claiming the implementation is a bug. It is not a bug, and I use it for several reasons. First, the subtraction $(\\mathbf I - \\mathbf{KH})$ can lead to nonsymmetric matrices results due to floating point errors. Covariances must be symmetric, and so becoming nonsymmetric usually leads to the Kalman filter diverging, or even for the code to raise an exception because of the checks built into `NumPy`.\n",
+    "\n",
+    "A traditional way to preserve symmetry is the following formula:\n",
+    "\n",
+    "$$\\mathbf P = (\\mathbf P + \\mathbf P^\\mathsf T) / 2$$\n",
+    "\n",
+    "This is safe because $\\sigma_{ij} = \\sigma_{ji}$ for all covariances in the matrix. Hence this operation averages the error between the differences of the two values if they have diverged due to floating point errors. \n",
+    "\n",
+    "If you look at the Joseph form for the equation above, you'll see there is a similar $\\mathbf{ABA}^\\mathsf T$ pattern in both terms. So they both preserve symmetry. But where did this equation come from, and why do I use it instead of\n",
+    "\n",
+    "\n",
+    "$$\\mathbf P = (\\mathbf I - \\mathbf{KH})\\mathbf{\\bar P} \\\\\n",
+    "\\mathbf P = (\\mathbf P + \\mathbf P^\\mathsf T) / 2$$\n",
+    "\n",
+    "\n",
+    "Let's just derive the equation from first principles. It's not too bad, and you need to understand the derivation to understand the purpose of the equation, and, more importantly, diagnose issues if you filter diverges due to numerical instability. This derivation comes from Brown[4].\n",
+    "\n",
+    "First, some symbology. $\\mathbf x$ is the true state of our system. $\\mathbf{\\hat x}$ is the estimated state of our system - the posterior. And $\\mathbf{\\bar x}$ is the estimated prior of the system. \n",
+    "\n",
+    "\n",
+    "Given that, we can define our model to be\n",
+    "\n",
+    "$$\\mathbf x_{k+1} = \\mathbf F_k \\mathbf x_k + \\mathbf w_k \\\\\n",
+    "\\mathbf z_k = \\mathbf H_k \\mathbf x_k + \\mathbf v_k$$\n",
+    "\n",
+    "In words, the next state $\\mathbf x_{k+1}$ of the system is the current state $k$ moved by some process $\\mathbf F_k$ plus some noise $\\mathbf w_k$. \n",
+    "\n",
+    "Note that these are definitions. No system perfectly follows a mathematical model, so we model that with the noise term $\\mathbf w_k$. And no measurement is perfect due to sensor error, so we model that with $\\mathbf v_k$\n",
+    "\n",
+    "I'll dispense with the subscript $k$ since in the remainder of the derivation we will only consider values at step $k$, never step $k+1$.\n",
+    "\n",
+    "Now we define the estimation error as the difference between the true state and the estimated state\n",
+    "\n",
+    "$$ \\mathbf e = \\mathbf x - \\mathbf{\\hat x}$$\n",
+    "\n",
+    "Again, this is a definition; we don't know how to compute $\\mathbf e$, it is just the defined difference between the true and estimated state.\n",
+    "\n",
+    "This allows us to define the covariance of our estimate, which is defined as the expected value of $\\mathbf{ee}^\\mathsf T$:\n",
+    "\n",
+    "$$\\begin{aligned}\n",
+    "P &= E[\\mathbf e, \\mathbf e^\\mathsf T] \\\\\n",
+    "&= E[(\\mathbf x - \\mathbf{\\hat x})(\\mathbf x - \\mathbf{\\hat x})^\\mathsf T]\n",
+    "\\end{aligned}$$\n",
+    "\n",
+    "\n",
+    "Next, we define the posterior estimate as\n",
+    "\n",
+    "$$\\mathbf {\\hat x} = \\mathbf{\\bar x} + \\mathbf K(\\mathbf z - \\mathbf{H \\bar x})$$\n",
+    "\n",
+    "That looks like the equation from the Kalman filter, and for good reason. But as with the rest of the math so far, this is a **definition**. In particular, we have not defined $\\mathbf K$, and you shouldn't think of it as the Kalman gain, because we are solving this for *any* problem, not just for linear Kalman filters. Here, $\\mathbf K$ is just some unspecified blending value between 0 and 1."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we have our definitions, let's perform some substitution and algebra.\n",
+    "\n",
+    "The term $(\\mathbf x - \\mathbf{\\hat x})$ can be expanded by replacing $\\mathbf{\\hat x}$ with the definition above, yielding\n",
+    "\n",
+    "$$(\\mathbf x - \\mathbf{\\hat x}) = \\mathbf x - (\\mathbf{\\bar x} + \\mathbf K(\\mathbf z - \\mathbf{H \\bar x}))$$\n",
+    "\n",
+    "Now we replace $\\mathbf z$ with $\\mathbf H \\mathbf x + \\mathbf v$:\n",
+    "\n",
+    "$$\\begin{aligned}\n",
+    "(\\mathbf x - \\mathbf{\\hat x})\n",
+    "&= \\mathbf x - (\\mathbf{\\bar x} + \\mathbf K(\\mathbf z - \\mathbf{H \\bar x})) \\\\\n",
+    "&= \\mathbf x - (\\mathbf{\\bar x} + \\mathbf K(\\mathbf H \\mathbf x + \\mathbf v - \\mathbf{H \\bar x})) \\\\\n",
+    "&= (\\mathbf x - \\mathbf{\\bar x}) - \\mathbf K(\\mathbf H \\mathbf x + \\mathbf v - \\mathbf{H \\bar x}) \\\\\n",
+    "&= (\\mathbf x - \\mathbf{\\bar x}) - \\mathbf{KH}(\\mathbf x - \\mathbf{ \\bar x}) - \\mathbf{Kv} \\\\\n",
+    "&=  (\\mathbf I - \\mathbf{KH})(\\mathbf x - \\mathbf{\\bar x}) - \\mathbf{Kv}\n",
+    "\\end{aligned}$$\n",
+    "\n",
+    "Now we can solve for $\\mathbf P$ if we note that the expected value of $(\\mathbf x - \\mathbf{\\bar x})$ is the prior covariance $\\mathbf{\\bar P}$, and that the expected value of $\\mathbf v$ is $E[\\mathbf{vv}^\\mathbf T] = \\mathbf R$:\n",
+    "\n",
+    "$$\\begin{aligned}\n",
+    "\\mathbf P &= \n",
+    "   E\\big[[(\\mathbf I - \\mathbf{KH})(\\mathbf x - \\mathbf{\\bar x}) - \\mathbf{Kv})]\n",
+    "  [(\\mathbf I - \\mathbf{KH})(\\mathbf x - \\mathbf{\\bar x}) - \\mathbf{Kv}]^\\mathsf T\\big ] \\\\\n",
+    "  &= (\\mathbf I - \\mathbf{KH})\\mathbf{\\bar P}(\\mathbf I - \\mathbf{KH})^\\mathsf T + \\mathbf{KRK}^\\mathsf T\n",
+    "\\end{aligned}$$\n",
+    "\n",
+    "which is what we came here to prove.\n",
+    "\n",
+    "Note that this equation is valid for *any* $\\mathbf K$, not just the optimal $\\mathbf K$ computed by the Kalman filter. And that is why I use this equation. In practice the Kalman gain computed by the filter is *not* the optimal value both because the real world is never truly linear and Gaussian, and because of floating point errors induced by computation. This equation is far less likely to cause the Kalman filter to diverge in the face of real world conditions.\n",
+    "\n",
+    "Where did $\\mathbf P = (\\mathbf I - \\mathbf{KH})\\mathbf{\\bar P}$ come from, then? Let's finish the derivation, which is simple. Recall that the Kalman filter (optimal) gain is given by\n",
+    "\n",
+    "$$\\mathbf K = \\mathbf{\\bar P H^\\mathsf T}(\\mathbf{H \\bar P H}^\\mathsf T + \\mathbf R)^{-1}$$\n",
+    "\n",
+    "Now we substitute this into the equation we just derived:\n",
+    "\n",
+    "$$\\begin{aligned}\n",
+    "&= (\\mathbf I - \\mathbf{KH})\\mathbf{\\bar P}(\\mathbf I - \\mathbf{KH})^\\mathsf T + \\mathbf{KRK}^\\mathsf T\\\\\n",
+    "&= \\mathbf{\\bar P} - \\mathbf{KH}\\mathbf{\\bar P} - \\mathbf{\\bar PH}\\mathbf{K}^\\mathsf T + \\mathbf K(\\mathbf{HPH}^\\mathsf T + \\mathbf R)\\mathbf K^\\mathsf T \\\\\n",
+    "&= \\mathbf{\\bar P} - \\mathbf{KH}\\mathbf{\\bar P} - \\mathbf{\\bar PH}\\mathbf{K}^\\mathsf T + \\mathbf{\\bar P H^\\mathsf T}(\\mathbf{H \\bar P H}^\\mathsf T + \\mathbf R)^{-1}(\\mathbf{HPH}^\\mathsf T + \\mathbf R)\\mathbf K^\\mathsf T\\\\\n",
+    "&= \\mathbf{\\bar P} - \\mathbf{KH}\\mathbf{\\bar P} - \\mathbf{\\bar PH}\\mathbf{K}^\\mathsf T + \\mathbf{\\bar P H^\\mathsf T}\\mathbf K^\\mathsf T\\\\\n",
+    "&= \\mathbf{\\bar P} - \\mathbf{KH}\\mathbf{\\bar P}\\\\\n",
+    "&= (\\mathbf I - \\mathbf{KH})\\mathbf P\n",
+    "\\end{aligned}$$\n",
+    "\n",
+    "Therefore $\\mathbf P = (\\mathbf I - \\mathbf{KH})\\mathbf{\\bar P}$ is mathematically correct when the gain is optimal, but so is $(\\mathbf I - \\mathbf{KH})\\mathbf{\\bar P}(\\mathbf I - \\mathbf{KH})^\\mathsf T + \\mathbf{KRK}^\\mathsf T$. As we already discussed the latter is also correct when the gain is suboptimal, and it is also more numerically stable. Therefore I use this computation in FilterPy.\n",
+    "\n",
+    "It is quite possible that your filter still diverges, especially if it runs for hundreds or thousands of epochs. You will need to examine these equations. The literature provides yet other forms of this computation which may be more applicable to your problem. As always, if you are solving real engineering problems where failure could mean loss of equipment or life, you will need to move past this book and into the engineering literature. If you are working with 'toy' problems where failure is not damaging, if you detect divergence you can just reset the value of $\\mathbf P$ to some 'reasonable' value and keep on going. For example, you could zero out the non diagonal elements so the matrix only contains variances, and then maybe multiply by a constant somewhat larger than one to reflect the loss of information you just injected into the filter. Use your imagination, and test."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deriving the Kalman Gain Equation\n",
+    "\n",
+    "If you read the last section, you might as well read this one. With this we will have derived the Kalman filter equations.\n",
+    "\n",
+    "Note that this derivation is *not* using Bayes equations. I've seen at least four different ways to derive the Kalman filter equations; this derivation is typical to the literature, and follows from the last section. The source is again Brown [4].\n",
+    "\n",
+    "In the last section we used an unspecified scaling factor $\\mathbf K$ to derive the Joseph form of the covariance equation. If we want an optimal filter, we need to use calculus to minimize the errors in the equations. You should be familiar with this idea. If you want to find the minimum value of a function $f(x)$, you take the derivative and set it equal to zero: $\\frac{x}{dx}f(x) = 0$.\n",
+    "\n",
+    "In our problem the error is expressed by the covariance matrix $\\mathbf P$. In particular, the diagonal expresses the error (variance) of each element in the state vector. So, to find the optimal gain we want to take the derivative of the trace (sum) of the diagonal.\n",
+    "\n",
+    "Brown reminds us of two formulas involving the derivative of traces:\n",
+    "\n",
+    "$$\\frac{d\\, trace(\\mathbf{AB})}{d\\mathbf A} = \\mathbf B^\\mathsf T$$\n",
+    "\n",
+    "$$\\frac{d\\, trace(\\mathbf{ACA}^\\mathsf T)}{d\\mathbf A} = 2\\mathbf{AC}$$\n",
+    "\n",
+    "where $\\mathbf{AB}$ is square and $\\mathbf C$ is symmetric.\n",
+    "\n",
+    "\n",
+    "We expand out the Joseph equation to:\n",
+    "\n",
+    "$$\\mathbf P = \\mathbf{\\bar P} - \\mathbf{KH}\\mathbf{\\bar P} - \\mathbf{\\bar P}\\mathbf H^\\mathsf T \\mathbf K^\\mathsf T + \\mathbf K(\\mathbf H \\mathbf{\\bar P}\\mathbf H^\\mathsf T + \\mathbf R)\\mathbf K^\\mathsf T$$\n",
+    "\n",
+    "Now we need to the the derivative of the trace of $\\mathbf P$ with respect to $\\mathbf T$: $\\frac{d\\, trace(\\mathbf P)}{d\\mathbf K}$.\n",
+    "\n",
+    "The derivative of the trace the first term with respect to $\\mathbf K$ is $0$, since it does not have $\\mathbf K$ in the expression.\n",
+    "\n",
+    "The derivative of the trace of the second term is $(\\mathbf H\\mathbf{\\bar P})^\\mathsf T$.\n",
+    "\n",
+    "We can find the derivative of the trace of the third term by noticing that $\\mathbf{\\bar P}\\mathbf H^\\mathsf T \\mathbf K^\\mathsf T$ is the transpose of $\\mathbf{KH}\\mathbf{\\bar P}$. The trace of a matrix is equal to the trace of it's transpose, so it's derivative will be same as the second term.\n",
+    "\n",
+    "Finally, the derivative of the trace of the fourth term is $2\\mathbf K(\\mathbf H \\mathbf{\\bar P}\\mathbf H^\\mathsf T + \\mathbf R)$.\n",
+    "\n",
+    "This gives us the final value of \n",
+    "\n",
+    "$$\\frac{d\\, trace(\\mathbf P)}{d\\mathbf K} = -2(\\mathbf H\\mathbf{\\bar P})^\\mathsf T + 2\\mathbf K(\\mathbf H \\mathbf{\\bar P}\\mathbf H^\\mathsf T + \\mathbf R)$$\n",
+    "\n",
+    "We set this to zero and solve to find the equation for $\\mathbf K$ which minimizes the error:\n",
+    "\n",
+    "$$-2(\\mathbf H\\mathbf{\\bar P})^\\mathsf T + 2\\mathbf K(\\mathbf H \\mathbf{\\bar P}\\mathbf H^\\mathsf T + \\mathbf R) = 0\\\\\n",
+    "\\mathbf K(\\mathbf H \\mathbf{\\bar P}\\mathbf H^\\mathsf T + \\mathbf R) = (\\mathbf H\\mathbf{\\bar P})^\\mathsf T \\\\\n",
+    "\\mathbf K(\\mathbf H \\mathbf{\\bar P}\\mathbf H^\\mathsf T + \\mathbf R) = \\mathbf{\\bar P}\\mathbf H^\\mathsf T \\\\\n",
+    "\\mathbf K= \\mathbf{\\bar P}\\mathbf H^\\mathsf T (\\mathbf H \\mathbf{\\bar P}\\mathbf H^\\mathsf T + \\mathbf R)^{-1}\n",
+    "$$\n",
+    "\n",
+    "This derivation is not quite iron clad as I left out an argument about why minimizing the trace minimizes the total error, but I think it suffices for this book. Any of the standard texts will go into greater detail if you need it."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -1679,7 +1853,10 @@
    " * [2] C.F. van Loan, \"Computing Integrals Involving the Matrix Exponential,\" IEEE *Transactions Automatic Control*, June 1978.\n",
    " \n",
    " \n",
-    " * [3] Calvetti, D and Somersalo E, \"Introduction to Bayesian Scientific Computing: Ten Lectures on Subjective Computing,\", *Springer*, 2007."
+    " * [3] Calvetti, D and Somersalo E, \"Introduction to Bayesian Scientific Computing: Ten Lectures on Subjective Computing,\", *Springer*, 2007.\n",
+    " \n",
+    " * [4] Brown, R. G. and Hwang, P. Y.C., \"Introduction to Random Signals and Applied Kalman Filtering\", *Wiley and Sons*, Fourth Edition, p.143-147, 2012. \n",
+    " "
   ]
  }
 ],
@ -1700,7 +1877,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.3"
+   "version": "3.6.4"
  }
 },
 "nbformat": 4,