Fixed section headings, added math

Added section in math chapter for Bayesian statistics, and improved on the discussion of system dynamics. The PDF generator cannot deal with the markdown for section headings (using hash marks), so I needed to fix chapters that use them. Appendix B still uses them, but for now that is just a list of notes for myself, so I left that alone. This contains the new PDF generated from these changes.
2014-12-20 14:20:23 -08:00
parent 4d632bdc2d
commit c865b57a74
4 changed files with 326 additions and 77 deletions
--- a/07_Kalman_Filter_Math/Kalman_Filter_Math.ipynb
+++ b/07_Kalman_Filter_Math/Kalman_Filter_Math.ipynb
@@ -1,7 +1,7 @@
 {
 "metadata": {
  "name": "",
-  "signature": "sha256:85df0fc94fc4635a6e5ff492b47171d902e6302afab1b362a6d0272e085fd392"
+  "signature": "sha256:dedbd7f55073b104a3b3626e73108b6a60611b932a173ce508666a264357c866"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
@@ -278,7 +278,114 @@
      "\n",
      "Well, to be honest I have been choosing my problems carefully. For any arbitrary problem finding some of the matrices that we need to feed into the Kalman filter equations can be quite difficult. I haven't been *too tricky*, though. Equations like Newton's equations of motion can be trivially computed for Kalman filter applications, and they make up the bulk of the kind of problems that we want to solve. If you are a hobbyist, you can safely pass by this chapter for now, and perhaps forever. Some of the later chapters will assume the material in this chapter, but much of the work will still be accessible to you. \n",
      "\n",
-      "But, I urge everyone to at least read the first section, and to skim the rest. It is not much harder than what you have done - the difficulty comes in finding closed form expressions for specific problems, not understanding the math in this chapter. \n"
+      "But, I urge everyone to at least read the first section, and to skim the rest. It is not much harder than what you have done - the difficulty comes in finding closed form expressions for specific problems, not understanding the math in this chapter. "
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Bayesian Probabilitity"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The title of this book is *Kalman and Bayesian Filters in Python* but to date I have not touched on the Bayesian aspect much. There was enough going on in the earlier chapters that adding this form of reasoning about filters could be a distraction rather than a help. I now which to take some time to explain what Bayesian probability is and how a Kalman filter is in fact a Bayesian filter. This is not just a diversion. First of all, a lot of the Kalman filter literature uses this formulation when talking about filters, so you will need to understand what they are talking about. Second, this math plays a strong role in filtering design once we move past the Kalman filter. \n",
+      "\n",
+      "To do so we will go back to our first tracking problem - tracking a dog in a hallway. Recall the update step - we believed with some degree of precision that the dog was at position 7 (for example), and then receive a measurement that the dog is at position 7.5. We want to incorporate that measurement into our belief. In the *Discrete Bayes* chapter we used histograms to denote our estimates at each halway position, and in the \n",
+      "*One Dimensional Kalman Filters* we used Gaussians. Both are method of using *Bayesian* probability.\n",
+      "\n",
+      "Briefly, *Bayesian* probability is a branch of math that lets us evaluate a hypothesis or new data point given some uncertain information about the past. For example, suppose you are driving down your neighborhood street and see one of your neighbors at their door, using a key to let themselves in. Three doors down you see two people in masks breaking a window of another house. What might you conclude?\n",
+      "\n",
+      "It is likely that you would reason that in the first case your neighbors were getting home and unlocking their door to get inside. In the second case you at least strongly suspect a robbery is in progress. In the first case you would just proceed on, and in the second case you'd probably call the police.\n",
+      "\n",
+      "Of course, things are not always what they appear. Perhaps unbeknowst to you your neighbor sold their house that morning, and they were now breaking in to steal the new owner's belongings. In the second case, perhaps the owners of the house were at a costume event at the next house, they had a medical emergency with their child, realized they lost their keys, and were breaking into their own house to get the much needed medication. Those are both *unlikely* events, but possible. Adding a few additional pieces of information would allow you to determine the true state of affairs in all but the most complicated situations.\n",
+      "\n",
+      "These are instances of *Bayesian* reasoning. We take knowledge from the past and integrate in new information. You know that your neighbor owned their house yesterday, so it is still highly likely that they still own it today. You know that owners of houses normally have keys to the front door, and that the normal mode of entrance into your own house is not breaking windows, so the second case is *likely* to be a breaking and entering. The reasoning is not ironclad as shown by the alternative explanations, but it is likely."
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 3,
+     "metadata": {},
+     "source": [
+      "Bayes' theorem"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "*Bayes' theorem* mathematically formalizes the above reasoning. It is written as\n",
+      "\n",
+      "$$P(A|B) = \\frac{P(B | A)\\, P(A)}{P(B)}\\cdot$$\n",
+      "\n",
+      "\n",
+      "Before we do some computations, let's review what the terms mean. P(A) is called the *prior probability* of the event A, and is often just shortened to the *prior*. What is the prior? It is just the probabability of A being true *before* we incorporate new evidence. In our dog tracking problem above, the prior is the probability we assign to our belief that the dog is positioned at 7 before we make the measurement of 7.5. It is important to master this terminology if you expect to read a lot of the literature.\n",
+      "\n",
+      "$P(A|B)$ is the *conditional probability* that A is true given that B is true. For example, if it is true that your neighbor still owns their house, then it will be very likely that they are not breaking into their house. In Bayesian probability this is called the *posterior*, and it denotes our new belief after incorporating the measurement/knowledge of B. For our dog tracking problem the posterier is the probability given to the estimated position after incorporating the measurement 7.5. For the neighbor problem the posterier would be the probability of a break in after you find out that your neighbor sold their home last week.\n",
+      "\n",
+      "What math did we use for the dog tracking problem? Recall that we used this equation to compute the new mean and probability distribution\n",
+      "\n",
+      "$$\n",
+      "\\begin{aligned}\n",
+      "N(estimate) * N(measurement) &= \\\\\n",
+      "N(\\mu_1, \\sigma_1^2)*N(\\mu_2, \\sigma_2^2) &= N(\\frac{\\sigma_1^2 \\mu_2 + \\sigma_2^2 \\mu_1}{\\sigma_1^2 + \\sigma_2^2},\\frac{1}{\\frac{1}{\\sigma_1^2} + \\frac{1}{\\sigma_2^2}}) \\cdot\n",
+      "\\end{aligned}\n",
+      "$$ \n",
+      "\n",
+      "\n",
+      "Here $N(\\mu_1, \\sigma_1^2)$ is the old estimated position, so $\\sigma_1$ is an indiation of our *prior* probability. $N(\\mu_2, \\sigma_2^2)$ is the mean and variance of our measurement, and so the result can be thought of as the new position and probability distribution after incorporating the new measurement. In other words, our *posterior distribution* is \n",
+      "\n",
+      "$$\\frac{1}{{\\sigma_{estimate}}^2} + \\frac{1}{{\\sigma_{measurement}}^2}$$"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "This is still a little hard to compare to Bayes' equation because we are dealing with probability distributions rather than probabilities. So let's cast our minds back to the discrete Bayes chapter where we computed the probability that our dog was at any given position in the hallway. It looked like this:\n",
+      "\n",
+      "    def normalize(p):\n",
+      "        s = sum(p)\n",
+      "        for i in range (len(p)):\n",
+      "            p[i] = p[i] / s\n",
+      "\n",
+      "    def update(pos, measure, p_hit, p_miss):\n",
+      "        q = np.array(pos, dtype=float)\n",
+      "        for i in range(len(hallway)):\n",
+      "            if hallway[i] == measure:\n",
+      "                q[i] = pos[i] * p_hit\n",
+      "            else:\n",
+      "                q[i] = pos[i] * p_miss\n",
+      "        normalize(q)\n",
+      "        return q\n",
+      "\n",
+      "Let's rewrite this using our newly learned terminology.\n",
+      "\n",
+      "    def update(prior_probability, measure, prob_hit, prob_miss):\n",
+      "        posterior_probability = np.array(prior_probability, dtype=float)\n",
+      "        for i in range(len(hallway)):\n",
+      "            if hallway[i] == measure:\n",
+      "                posterior_probability[i] = prior_probability[i] * p_hit\n",
+      "            else:\n",
+      "                posterior_probability[i] = prior_probability[i] * p_miss\n",
+      "\n",
+      "        return posterior_probability / sum(posterior_probability)\n",
+      "\n",
+      "         \n",
+      "So what is this doing? It's multiplying the old belief that the dog is at position *i* (prior probability) with the probability that the measurement is correct for that position, and then dividing by the total probability for that new event.\n",
+      "\n",
+      "Now let's look at Bayes' equation again.\n",
+      "\n",
+      "$$P(A|B) = \\frac{P(B | A)\\, P(A)}{P(B)}\\cdot$$\n",
+      "\n",
+      "It's the same thing being calculated by the code. Multipy the prior ($P(A)$) by the probability of the measurement at each position ($P(B|A)$) and divide by the total probability for the event ($P(B)$).\n",
+      "\n",
+      "In other words the first half of the Discrete Bayes chapter developed Bayes' equation from a thought experiment. I could have just presented Bayes' equation and then given you the Python routine above to implement it, but chances are you would not have understood *why* Bayes' equation works. Presenting the equation first is the normal approach of Kalman filtering texts, and I always found it extremely nonintuitive. \n"
     ]
    },
    {
@@ -295,13 +402,13 @@
     "source": [
      "We need to start by understanding the underlying equations and assumptions that the Kalman filter uses. We are trying to model real world phenomena, so what do we have to consider?\n",
      "\n",
-      "First, each physical system has a process. For example, a car traveling at a certain velocity goes so far in a fixed amount of time, and it's velocity varies as a function of its acceleration. We describe that behavior with the well known Newtonian equations we learned in high school.\n",
+      "First, each physical system has a process. For example, a car traveling at a certain velocity goes so far in a fixed amount of time, and its velocity varies as a function of its acceleration. We describe that behavior with the well known Newtonian equations we learned in high school.\n",
      "\n",
      "\n",
      "$$\n",
      "\\begin{aligned}\n",
      "v&=at\\\\\n",
-      "x &= \\frac{1}{2}at_2 + v_0t + d_0\n",
+      "x &= \\frac{1}{2}at^2 + v_0t + d_0\n",
      "\\end{aligned}\n",
      "$$\n",
      "\n",
@@ -342,9 +449,9 @@
      "\n",
      "$$ f(\\mathbf{x}) = \\mathbf{Fx} + \\mathbf{Bu} + \\mathbf{w}$$\n",
      "\n",
-      "And that's it. That is the equation that Kalman set out to solve, and he found a way to compute an optimal solution if we assume certain properties of $w$.\n",
+      "And that's it. That is one of the equations that Kalman set out to solve, and he found a way to compute an optimal solution if we assume certain properties of $w$.\n",
      "\n",
-      "However, we took advantage of something I left mostly unstated in the last chapter. We were able to provide a definition for $\\mathbf{F}$ because we were able to take advantage of the exact solution that Newtonian equations provide us. However, if you have an engineering background you will realize what a small class of problems that covers.If you don't, I will explain it next, and provide you with several ways to compute $\\mathbf{F}$ for arbitrary systems."
+      "However, we took advantage of something I left mostly unstated in the last chapter. We were able to provide a definition for $\\mathbf{F}$ because we were able to take advantage of the exact solution that Newtonian equations provide us. However, if you have an engineering background you will realize what a small class of problems that covers. If you don't, I will explain it next, and provide you with several ways to compute $\\mathbf{F}$ for arbitrary systems."
     ]
    },
    {
@@ -359,9 +466,27 @@
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "Modelling dynamic systems is properly the topic of at least one undergraduate course in mathmatics; I took several. However, I can present enough of the theory to allow us to create the system equations for many different Kalman filters, and give you enough background to at least follow the mathematics in the literature. My goal is to get you to the stage where you can read Brown, Zarchan, Bar-Shalom, or any other book and understand them well enough to implement the algorithms. Even without reading those books, you should be able to design many common forms of Filters by following the examples. \n",
+      "Modelling dynamic systems is properly the topic of at least one undergraduate course in mathmatics. To an extent there is no substitute for a few semesters of ordinary and partial differential equations. If you are a hobbiest, or trying to solve one very specific filtering problem at work you probably do not have the time and/or inclination to devote a year or more to that education.\n",
      "\n",
-      "We model dynamic systems with a set of differential equations. For example, we already presented the Newtonian equation\n",
+      "However, I can present enough of the theory to allow us to create the system equations for many different Kalman filters, and give you enough background to at least follow the mathematics in the literature. My goal is to get you to the stage where you can read a Kalman filtering book or paper and understand it well enough to implement the algorithms. The background math is deep, but we end up using a few simple techniques over and over again in practice. \n",
+      "\n",
+      "Let's lay out the problem and discuss what the solution will be. We  model *dynamic systems* with a set of first order *differential equations*. This should not be a surprise as calculus is the math of of thing that vary. For example, we say that velocity is the derivative of distance with respect to time\n",
+      "\n",
+      "$$\\mathbf{v}= \\frac{d \\mathbf{x}}{d t} = \\dot{\\mathbf{x}}$$\n",
+      "\n",
+      "where $\\dot{\\mathbf{x}}$ is the notation for the derivative of $\\mathbf{x}$ with respect to t.\n",
+      "\n",
+      "\n",
+      "We need to use these equations for the predict step of the Kalman filter. Given the state of the system at time $t$ we want to predict its state at time $t + \\Delta t$. The Kalman filter matrices do not accept differential equations, so we need a mathematical technique that will find the solution to those equations at each time step. In general it is extremely difficult to find analytic solutions to systems of differential equations, so we will normally use *numerical* techniques to find accurate approximations for these equations."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Why This is Hard\n",
+      "\n",
+      "We model dynamic systems with a set of first order differential equations. For example, we already presented the Newtonian equation\n",
      "\n",
      "$$\\mathbf{v}=\\dot{\\mathbf{x}}$$\n",
      "\n",
@@ -374,14 +499,13 @@
      "if the behaviour of the system depends on time. However, if the system is *time invariant* the equations are of the form\n",
      "$$ f(x) = \\dot{x}$$\n",
      "\n",
-      "What does *time invariant* mean? Consider a home stereo. If you input a signal $x$ into it at time $t$, it will output some signal $f(x)$ a moment later. If you instead make the input at a later time the output signal will still be exactly the same, just shifted in time. This is different from, say, an aircraft. If you make a control input to the aircraft at a later time it's behavior will be different because it will have burned additonal fuel (and thus lost weight), drag may be different due to being at a different altitude, and so on.\n",
+      "What does *time invariant* mean? Consider a home stereo. If you input a signal $x$ into it at time $t$, it will output some signal $f(x)$. If you instead make the input at a later time $t + \\Delta t$ the output signal will still be exactly the same, just shifted in time. This is different from, say, an aircraft. If you make a control input to the aircraft at a later time it's behavior will be different because it will have burned additonal fuel (and thus lost weight), drag may be different due to being at a different altitude, and so on.\n",
      "\n",
      "We can solve these equations by integrating each side. The time variant equation is very straightforward. We essentially solved this problem with the Newtonian equations above, but let's be explicit and write it out. Starting with  $$\\dot{\\mathbf{x}}=\\mathbf{v}$$ we get the expected\n",
      "\n",
      "$$ \\int \\dot{\\mathbf{x}}\\mathrm{d}t = \\int \\mathbf{v} \\mathrm{d}t\\\\\n",
      "x = vt + x_0$$\n",
      "\n",
-      "\n",
      "However, integrating the time invariant equation is not so straightforward. \n",
      "\n",
      "$$ \\dot{x} = f(x) \\\\\n",
@@ -405,7 +529,7 @@
      "\n",
      "In other words, we need to find the inverse of $F$. This is not at all trivial, and a significant amount of course work in a STEM education is devoted to finding tricky, analytic solutions to this problem, backed by several centuries of research. \n",
      "\n",
-      "In the end, however, they are tricks, and many simple forms of $f(x)$ either have no closed form solution, or pose extreme difficulties. Instead, the practicing engineer turns to numerical methods to find a solution to her problems. I would suggest that students would be better served by learning fewer analytic mathematical tricks and instead focusing on learning numerical methods."
+      "In the end, however, they are tricks, and many simple forms of $f(x)$ either have no closed form solution, or pose extreme difficulties. Instead, the practicing engineer turns to numerical methods to find a solution to her problems. I would suggest that students would be better served by learning fewer analytic mathematical tricks and instead focusing on learning numerical methods, but that is the topic for another book."
     ]
    },
    {
@@ -420,7 +544,7 @@
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "So let me leap over quite a bit of mathematics and present the typical numerical techniques used in Kalman filter design. \n",
+      " If you already have the mathematical training in solving partial differential equations you may be able to put it to use; I am not assuming that sort of background.  So let me skip over quite a bit of mathematics and present the typical numerical techniques used in Kalman filter design. \n",
      "\n",
      "First, we express the system equations in state-space form (i.e. using linear algebra equations) with\n",
      "\n",
@@ -430,20 +554,19 @@
      "\n",
      "$$x(t) = \\Phi(t-t_0)x(t_0)$$\n",
      "\n",
-      "In other words, we just want to compute the value of $x$ at time $t$ by multipying its previous value by some matrix $\\Phi$.\n",
+      "In other words, we just want to compute the value of $x$ at time $t$ by multipying its previous value by some matrix $\\Phi$. This is not trivial to do because the original equations do not include time \n",
      "\n",
-      "Broadly speaking there are three ways to find $\\Phi$. *Linear Time Invarient Theory*, known as LTI System Theory, gives us a way to find $\\Phi$ using the inverse Laplace transform. You are either nodding your head now, or completely lost. Don't worry, I will not be using the Lapace transform in this book except in this paragraph, as the computation is quite difficult to perform in practice, and there are better techniques we can avail ourselves to. LTI system theory tells us that \n",
-      "\n",
-      "$$ \\Phi(t) = \\mathcal{L}^{-1}[(s\\mathbf{I} - \\mathbf{F})^{-1}]$$\n",
-      "\n",
-      "I have no intention of going into this other than to say that the inverse Lapace transform converts a signal into the frequency (time) domain, but finding a solution to the equation above is non-trivial. If you are interested, the Wikipedia article on LTI system theory provides an introduction [1].\n",
-      "\n",
-      "\n",
-      "The second technique is to use a Taylor-series expansion: \n",
+      "Broadly speaking there are three ways to find $\\Phi$. The technique most often used with Kalman filters is to use a Taylor-series expansion: \n",
      "\n",
      "$$ \\Phi(t) = e^{\\mathbf{F}t} = \\mathbf{I} + \\mathbf{F}t  + \\frac{(\\mathbf{F}t)^2}{2!} + \\frac{(\\mathbf{F}t)^3}{3!} + ... $$\n",
      "\n",
-      "This is much easier to compute, and is the typical approach used in Kalman filter design when the filter is reasonably small. If you are wondering where $e$ came from, I again point to wikipedia - this time the matrix exponential article [2]. Here the important point is to recognize the very simple and regular form this equation takes. We will put this form into use in the next chapter, so I will not belabor its use here. \n",
+      "This is much easy to compute, and thus is the typical approach used in Kalman filter design when the filter is reasonably small. If you are wondering where $e$ came from, I point you to the wikipedia article on the matrix exponential [1]. Here the important point is to recognize the very simple and regular form this equation takes. We will put this form into use in the next chapter, so I will not belabor its use here. \n",
+      "\n",
+      "*Linear Time Invarient Theory*, also known as LTI System Theory, gives us a way to find $\\Phi$ using the inverse Laplace transform. You are either nodding your head now, or completely lost. Don't worry, I will not be using the Lapace transform in this book except in this paragraph, as the computation is quite difficult to perform in practice. LTI system theory tells us that \n",
+      "\n",
+      "$$ \\Phi(t) = \\mathcal{L}^{-1}[(s\\mathbf{I} - \\mathbf{F})^{-1}]$$\n",
+      "\n",
+      "I have no intention of going into this other than to say that the inverse Lapace transform converts a signal into the frequency (time) domain, but finding a solution to the equation above is non-trivial. If you are interested, the Wikipedia article on LTI system theory provides an introduction [2].\n",
      "\n",
      "Finally, there are numerical techniques to find $\\Phi$. As filters get larger finding analytical solutions becomes very tedious (though packages like Sympy make it easier). C. F. van Loan [3] has developed a technique that finds both $\\Phi$ and $Q$ numerically.\n",
      "\n",
@@ -523,6 +646,45 @@
     ],
     "prompt_number": 2
    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Forming First Order Equations from Higher Order Equations\n",
+      "\n",
+      "In the sections above I spoke of *first order* differential equations; these are equations with only first derivatives. However, physical systems often require second or higher order equations. Any higher order system of equations can be converted to a first order set of equations by defining extra variables for the first order terms and then solving. Let's do an example. \n",
+      "\n",
+      "Given the system $\\ddot{x} - 6\\dot{x} + 9x = t$ find the first order equations.\n",
+      "\n",
+      "\n",
+      "The first step is to isolate the highest order term onto one side of the equation .\n",
+      "\n",
+      "$$\\ddot{x} = 6\\dot{x} - 9x + t$$\n",
+      "\n",
+      "We define two new variables:\n",
+      "\n",
+      "$$ x_1(t) = x \\\\\n",
+      "x_2(t) = \\dot{x}\n",
+      "$$\n",
+      "\n",
+      "Now we will substitute these into the original equation and solve, giving us a set of first order equations in terms of these new variables.\n",
+      "\n",
+      "First, we know that $\\dot{x}_1 = x_2$ and that $\\dot{x}_2 = \\ddot{x}$. Therefore\n",
+      "\n",
+      "$$\\begin{aligned}\n",
+      "\\dot{x}_2 &= \\ddot{x} \\\\\n",
+      "          &= 6\\dot{x} - 9x + t\\\\\n",
+      "          &= 6x_2-9x_1 + t\n",
+      "\\end{aligned}$$\n",
+      "\n",
+      "Therefore our first order system of equations is\n",
+      "\n",
+      "$$\\begin{aligned}\\dot{x}_1 &= x_2 \\\\\n",
+      "\\dot{x}_2 &= 6x_2-9x_1 + t\\end{aligned}$$\n",
+      "\n",
+      "If you practice this a bit you will become adept at it. Just isolate the highest term, define a new variable and its derivatives, and then substitute."
+     ]
+    },
    {
     "cell_type": "heading",
     "level": 2,
@@ -553,7 +715,7 @@
      "\\end{aligned}\n",
      "$$\n",
      "\n",
-      "I will start with the update step, as that is what we started with in the one dimensional Kalman filter case. Our first equation is\n",
+      "I will start with the update step, as that is what we started with in the one dimensional Kalman filter case. The first equation is\n",
      "\n",
      "$$\n",
      "\\mathbf{y} = \\mathbf{z} - \\mathbf{H x}\\tag{3}\n",
@@ -599,25 +761,9 @@
      "$$\n",
      "Unfortunately it is a fair amount of linear algebra to derive this. The derivation can be quite elegant, and I urge you to look it up if you have the mathematical education to follow it. But $\\mathbf{K}$ is just the *Kalman gain* - the ratio of how much measurement vs prediction we should use to create the new estimate. $\\mathbf{R}$ is the *measurement noise*, and $\\mathbf{P}$ is our *uncertainty covariance matrix* from the prediction step.\n",
      "\n",
-      "** author's note: the following aside probably belongs elsewhere in the book**\n",
-      "\n",
-      "> As an aside, most textbooks are more exact with the notation, in Gelb[1] for example, *Pk(+)* is used to denote the uncertainty covariance for the prediction step, and *Pk(-)* for the uncertainty covariance for the update step. Other texts use subscripts with 'k|k-1', superscipt $^-$, and many other variations. As a programmer I find all of that fairly unreadable; I am used to thinking about variables changing state as a program runs, and do not use a different variable name for each new computation. There is no agreed upon format, so each author makes different choices. I find it challenging to switch quickly between books an papers, and so have adopted this admittedly less precise notation. Mathematicians will write scathing emails to me, but I hope the programmers and students will rejoice.\n",
-      "\n",
-      "> If you are a programmer trying to understand a paper's math equations, I strongly recommend just removing all of the superscripts, subscripts, and diacriticals, replacing them with a single letter. If you work with equations like this every day this is superflous advice, but when I read I am usually trying to understand the flow of computation. To me it is far more understandable to remember that $P$ in this step represents the updated value of $P$ computed in the last step, as opposed to trying to remember what $P_{k-1}(+)$ denotes, and what its relation to $P_k(-)$ is, if any.\n",
-      "\n",
-      "> For example, for the equation of $\\mathbf{S}$ above, Wikipedia uses\n",
-      "\n",
-      "> $$\\textbf{S}_k = \\textbf{H}_k \\textbf{P}_{k\\mid k-1} \\textbf{H}_k^\\mathsf{T} + \\textbf{R}_k\n",
-      "$$\n",
-      "\n",
-      "> Is that more exact? Absolutely. Is it easier or harder to read? You'll need to answer that for yourself.\n",
-      "\n",
-      "> For reference, the Appendix **Symbols and Notations** lists the symbology used by the major authors in the field.\n",
-      "\n",
-      "\n",
      "So let's work through this expression by expression. Start with $\\mathbf{HPH}^\\mathsf{T}$. The linear equation $\\mathbf{ABA}^T$ can be thought of as changing the basis of $\\mathbf{B}$ to $\\mathbf{A}$. So $\\mathbf{HPH}^\\mathsf{T}$ is taking the covariance $\\mathbf{P}$ and putting it in measurement ($\\mathbf{H}$) space. \n",
      "\n",
-      "In English, consider the problem of reading a temperature with a thermometer that provices readings in volts. Our state is in terms of temperature, but we are now doing calculations in *measurement space* - volts. So we need to convert $\\mathbf{P}$ from applying to temperatures to volts. The linear algebra form $\\textbf{H}\\textbf{P}\\textbf{H}$ takes $\\mathbf{P}$ to the basis used by $\\mathbf{H}$, namely volts. \n",
+      "In English, consider the problem of reading a temperature with a thermometer that provices readings in volts. Our state is in terms of temperature, but we are now doing calculations in *measurement space* - volts. So we need to convert $\\mathbf{P}$ from applying to temperatures to volts. The linear algebra form $\\textbf{H}\\textbf{P}\\textbf{H}^\\mathsf{T}$ takes $\\mathbf{P}$ to the basis used by $\\mathbf{H}$, namely volts. \n",
      "\n",
      "Then, once in measurement space, we can add the measurement noise $\\mathbf{R}$ to it. Hence, the expression for the uncertainty once we include the measurement is:\n",
      "\n",
@@ -640,13 +786,13 @@
      "\n",
      "\n",
      "$$\n",
-      "\\textbf{K} = \\frac{uncertainty_{prediction}}{uncertainty_{measurement}}\n",
+      "\\textbf{K} = \\frac{uncertainty_{prediction}}{uncertainty_{measurement}}\\textbf{H}^\\mathsf{T}\n",
      "$$\n",
      "\n",
      "\n",
      "In other words, the *Kalman gain* equation is doing nothing more than computing a ratio based on how much we trust the prediction vs the measurement. If we are confident in our measurements and unconfident in our predictions $\\mathbf{K}$ will favor the measurement, and vice versa. The equation is complicated because we are doing this in multiple dimensions via matrices, but the concept is simple - scale by a ratio.\n",
      "\n",
-      "Without going into the derivation of $\\mathbf{K}$, I'll say that this equation is the result of finding a value of $\\mathbf{K}$ that optimizes the *mean-square estimation error*. It does this by finding the minimal values for $\\mathbf{P}$ along it's diagonal. Recall that the diagonal of $\\mathbf{P}$ is just the variance for each state variable. So, this equation for $\\mathbf{K}$ ensures that the Kalman filter output is optimal. To put this in concrete terms, for our dog tracking problem this means that the estimates for both position and velocity will be optimal - a value of $\\mathbf{K}$ that made the position extremely accurate but the velocity very inaccurate would be rejected in favor of a $\\mathbf{K}$ that made both position and velocity just somewhat accurate."
+      "Without going into the derivation of $\\mathbf{K}$, I'll say that this equation is the result of finding a value of $\\mathbf{K}$ that optimizes the *mean-square estimation error*. It does this by finding the minimal values for $\\mathbf{P}$ along its diagonal. Recall that the diagonal of $\\mathbf{P}$ is just the variance for each state variable. So, this equation for $\\mathbf{K}$ ensures that the Kalman filter output is optimal. To put this in concrete terms, for our dog tracking problem this means that the estimates for both position and velocity will be optimal - a value of $\\mathbf{K}$ that made the position extremely accurate but the velocity very inaccurate would be rejected in favor of a $\\mathbf{K}$ that made both position and velocity just somewhat accurate."
     ]
    },
    {
@@ -1172,6 +1318,27 @@
      "while not correct, is often a useful approximation. If you do this you will have to perform quite a few studies to guarantee that your filter works in a variety of situations. Given the availability of functions to compute the correct values of $\\mathbf{Q}$ for you I would strongly recommend not using approximations. Perhaps it is justified for quick-and-dirty filters, or on embedded devices where you need to wring out every last bit of performance, and seek to minimize the number of matrix operations required. "
     ]
    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "\n",
+      "** author's note: the following aside probably belongs elsewhere in the book**\n",
+      "\n",
+      "> As an aside, most textbooks are more exact with the notation, in Gelb[1] for example, *Pk(+)* is used to denote the uncertainty covariance for the prediction step, and *Pk(-)* for the uncertainty covariance for the update step. Other texts use subscripts with 'k|k-1', superscipt $^-$, and many other variations. As a programmer I find all of that fairly unreadable; I am used to thinking about variables changing state as a program runs, and do not use a different variable name for each new computation. There is no agreed upon format, so each author makes different choices. I find it challenging to switch quickly between books an papers, and so have adopted this admittedly less precise notation. Mathematicians will write scathing emails to me, but I hope the programmers and students will rejoice.\n",
+      "\n",
+      "> If you are a programmer trying to understand a paper's math equations, I strongly recommend just removing all of the superscripts, subscripts, and diacriticals, replacing them with a single letter. If you work with equations like this every day this is superflous advice, but when I read I am usually trying to understand the flow of computation. To me it is far more understandable to remember that $P$ in this step represents the updated value of $P$ computed in the last step, as opposed to trying to remember what $P_{k-1}(+)$ denotes, and what its relation to $P_k(-)$ is, if any.\n",
+      "\n",
+      "> For example, for the equation of $\\mathbf{S}$ above, Wikipedia uses\n",
+      "\n",
+      "> $$\\textbf{S}_k = \\textbf{H}_k \\textbf{P}_{k\\mid k-1} \\textbf{H}_k^\\mathsf{T} + \\textbf{R}_k\n",
+      "$$\n",
+      "\n",
+      "> Is that more exact? Absolutely. Is it easier or harder to read? You'll need to answer that for yourself.\n",
+      "\n",
+      "> For reference, the Appendix **Symbols and Notations** lists the symbology used by the major authors in the field."
+     ]
+    },
    {
     "cell_type": "heading",
     "level": 2,
@@ -1184,9 +1351,9 @@
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      " * [1] *LTI System Theory* http://en.wikipedia.org/wiki/LTI_system_theory\n",
-      " \n",
-      " * [2] *Matrix Exponential* http://en.wikipedia.org/wiki/Matrix_exponential \n",
+      " * [1] *Matrix Exponential* http://en.wikipedia.org/wiki/Matrix_exponential \n",
+      "\n",
+      " * [2] *LTI System Theory* http://en.wikipedia.org/wiki/LTI_system_theory\n",
      " \n",
      " * [3] C.F. van Loan, \"Computing Integrals Involving the Matrix Exponential,\" IEEE Transactions Automatic Control, June 1978."
     ]
--- a/11_Ensemble_Kalman_Filter/Ensemble_Kalman_Filters.ipynb
+++ b/11_Ensemble_Kalman_Filter/Ensemble_Kalman_Filters.ipynb
@@ -320,12 +320,18 @@
     ],
     "prompt_number": 2
    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "The Algorithm"
+     ]
+    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "## The Algorithm\n",
-      "\n",
      "As I already stated, when the filter is initialized a large number of sigma points are drawn from the initial state ($\\mathbf{x}$) and covariance ($\\mathbf{P}$). From there the algorithm proceeds very similarly to the UKF. During the prediction step the sigma points are passed through the state transition function, and then perturbed by adding a bit of noise to account for the process noise. During the update step the sigma points are translated into measurement space by passing them through the measurement function, they are perturbed by a small amount to account for the measurement noise. The Kalman gain is computed from the \n",
      "\n",
      "We already mentioned the main difference between the UKF and EnKF - the UKF choses the sigma points deterministically. There is another difference, implied by the algorithm above. With the UKF we generate new sigma points during each predict step, and after passing the points through the nonlinear function we reconstitute them into a mean and covariance by using the *unscented transform*. The EnKF just keeps propagating the originally created sigma points; we only need to compute a mean and covariance as outputs for the filter! \n",
@@ -491,12 +497,18 @@
      "        P = self.P - dot3(K, P_zz, K.T)"
     ]
    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Implementation and Example"
+     ]
+    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "## Implementation and Example\n",
-      "\n",
      "I have implemented an EnKF in the `FilterPy` library. It is in many ways a toy. Filtering with a large number of sigma points gives us very slow performance. Furthermore, there are many minor variations on the algorithm in the literature. I wrote this mostly because I was interested in learning a bit about the filter. I have not used it for a real world problem, and I can give no advice on using the filter for the large problems for which it is suited. Therefore I will refine my comments to implementing a very simple filter. I will use it to track an object in one dimension, and compare the output to a linear Kalman filter. This is a filter we have designed many times already in this book, so I will design it with little comment. Our state vector will be\n",
      "\n",
      "$$\\mathbf{x} = \\begin{bmatrix}x\\\\ \\dot{x}\\end{bmatrix}$$\n",
@@ -613,12 +625,18 @@
      "It can be a bit difficult to see, but the KF and EnKF start off slightly different, but soon converge to producing nearly the same values. The EnKF is a suboptimal filter, so it will not produce the optimal solution that the KF produces. However, I deliberately chose $N$ to be quite small (20) to guarantee that the EnKF output is quite suboptimal. If I chose a more reasonable number such as 2000 you would be unable to see the difference between the two filter outputs on this graph."
     ]
    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Outstanding Questions"
+     ]
+    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "## Outstanding Questions\n",
-      "\n",
      "All of this should be considered as *my* questions, not lingering questions in the literature. However, I am copying equations directly from well known sources in the field, and they do not address the discrepencies.\n",
      "\n",
      "First, in Brown [2] we have all sums multipied by $\\frac{1}{N}$, as in \n",
@@ -654,9 +672,21 @@
      "\n",
      "To me the Crassidis equation seems logical, and it produces a filter that performs like the linear KF for linear problems, so that is the formulation that I have chosen. \n",
      "\n",
-      "I am not comfortable saying either book is wrong; it is quite possible that I missed some point that makes each set of equations work. I can say that when I implemented them as written I did not get a filter that worked. I define \"work\" as performs essentially the same as the linear KF for linear problems. Between reading implementation notes on the web and reasoning about various issues I have chosen the implementation in this chapter, which does in fact seem to work correctly. I have yet to explore the significant amount of original literature that will likely definitively explain the discrepencies. I would like to leave this here in some form even if I do find an explanation that reconciles the various differences, as if I got confused by these books than probably others will as well.\n",
-      "\n",
-      "## References\n",
+      "I am not comfortable saying either book is wrong; it is quite possible that I missed some point that makes each set of equations work. I can say that when I implemented them as written I did not get a filter that worked. I define \"work\" as performs essentially the same as the linear KF for linear problems. Between reading implementation notes on the web and reasoning about various issues I have chosen the implementation in this chapter, which does in fact seem to work correctly. I have yet to explore the significant amount of original literature that will likely definitively explain the discrepencies. I would like to leave this here in some form even if I do find an explanation that reconciles the various differences, as if I got confused by these books than probably others will as well."
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "References"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
      "- [1] Mackenzie, Dana. *Ensemble Kalman Filters Bring Weather Models Up to Date* Siam News,  Volume 36, Number 8, October 2003. http://www.siam.org/pdf/news/362.pdf\n",
      "\n",
      "- [2]  Brown, Robert Grover, and Patrick Y.C. Hwang. *Introduction to Random Signals and Applied Kalman Filtering, With MATLAB\u00ae excercises and solutions.* Wiley, 2012.\n",
--- a/14_Smoothing/Smoothing.ipynb
+++ b/14_Smoothing/Smoothing.ipynb
@@ -267,12 +267,18 @@
     ],
     "prompt_number": 1
    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Introduction"
+     ]
+    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "## Introduction\n",
-      "\n",
      "It has probably struck you by now that the performance of the Kalman filter is not optimal when you consider future data. For example, suppose we are tracking an aircraft, and the latest measurement is far from the current track, like so (I'll only consider 1 dimension for simplicity):\n",
      "\n",
      "   10.1 10.2 9.8 10.1 10.2 10.3 10.1 9.9 10.2 10.0 9.9 12.4"
@@ -381,10 +387,21 @@
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "Since this type of smoothing requires knowing data from \"the future\", there are some applications for the Kalman filter where these observations are not helpful. For example, if we are using a Kalman filter as our navigation filter for an aircraft we have little interest in where we have been. While we could use a smoother to create a smooth history for the plane, we probably are not interested in it. However if we can afford a bit of latency some smoothers only require a few measurements into the future prode better results. And, of course any problem where we can batch collect the data and then run the Kalman filter on the data will be able to take maximum advantage of this type of algorithm.\n",
-      "\n",
-      "## Types of Smoothers\n",
-      "\n",
+      "Since this type of smoothing requires knowing data from \"the future\", there are some applications for the Kalman filter where these observations are not helpful. For example, if we are using a Kalman filter as our navigation filter for an aircraft we have little interest in where we have been. While we could use a smoother to create a smooth history for the plane, we probably are not interested in it. However if we can afford a bit of latency some smoothers only require a few measurements into the future prode better results. And, of course any problem where we can batch collect the data and then run the Kalman filter on the data will be able to take maximum advantage of this type of algorithm."
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Types of Smoothers"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
      "There are three broad classes of Kalman smoothers that produce better tracking in these situations.\n",
      "\n",
      "* Fixed Point Smoothing\n",
@@ -405,23 +422,51 @@
      "\n",
      "Fixed lag smoothing only requires you to store a window of data, and processing requirements are modest because only that window is processed for each new measurement. The drawback is that the filter's output always lags the input, and the smoothing is not as pronounced as is possible with fixed interval smoothing.\n",
      "\n",
-      "Fixed interval smoothing produces the most smoothed output at the cost of having to be batch processed. Most algorithms use some sort of forwards/backwards algorithm that is only twice as slow as a recursive Kalamn filter. \n",
-      "\n",
-      "## Fixed Point Smoothing\n",
-      "\n",
-      "not done\n",
-      "\n",
-      "## Fixed Lag Smoothing\n",
-      "\n",
-      "not done"
+      "Fixed interval smoothing produces the most smoothed output at the cost of having to be batch processed. Most algorithms use some sort of forwards/backwards algorithm that is only twice as slow as a recursive Kalamn filter. "
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Fixed Point Smoothing"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "not done"
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Fixed Lag Smoothing"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "not done"
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Fixed Interval Smoothing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "## Fixed Interval Smoothing\n",
-      "\n",
      "There are several fixed lag smoothers available in the literature. I have chosen to implement the smoother invented by Rauch, Tung, and Striebel because of its ease of implementation and efficiency of computation. This smoother is commonly known as an RTS smoother, and that is what we will call it\n",
      "\n",
      "Derivation of the RTS smoother runs to several pages of densely packed math, and to be honest I have never read it through. I'm certainly not going to inflict it on you. I don't think anyone but thesis writers really need to understand the derivation. Instead I will briefly present the algorithm, equations, and then move directly to implementation and demonstration of the smoother.\n",
@@ -548,15 +593,21 @@
     ],
     "prompt_number": 9
    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "References"
+     ]
+    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
-      "## References\n",
-      "\n",
      "[1] H. Rauch, F. Tung, and C. Striebel. \"Maximum likelihood estimates of linear dynamic systems,\" *AIAA Journal*, **3**(8), pp. 1445-1450 (August 1965).\n",
      "\n",
-      "http://arc.aiaa.org/doi/abs/10.2514/3.3166\n"
+      "http://arc.aiaa.org/doi/abs/10.2514/3.3166"
     ]
    }
   ],
--- a/Appendix_A_Installation/Appendix_Installation.ipynb
+++ b/Appendix_A_Installation/Appendix_Installation.ipynb
@@ -1,7 +1,7 @@
 {
 "metadata": {
  "name": "",
-  "signature": "sha256:5ca766a2f349f8f9ea63fc2504c164a60dfe8bddd92c84cf7387db12cd8f1c4a"
+  "signature": "sha256:1fb959fd02dabdaeb6a9cb01639ad09df217d9b45cc7ce9a1f51d00310274c26"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
@@ -282,10 +282,11 @@
     ]
    },
    {
-     "cell_type": "markdown",
+     "cell_type": "heading",
+     "level": 2,
     "metadata": {},
     "source": [
-      "## SymPy"
+      "SymPy"
     ]
    },
    {