Added links to source material.

Linked (mostly to Wikipedia) to web material for terms that are newly defined.
2016-01-27 21:19:16 -08:00 · 2016-01-27 21:19:16 -08:00 · e5fa61bbc5
commit e5fa61bbc5
parent aae7bbb368
4 changed files with 133 additions and 282 deletions
--- a/01-g-h-filter.ipynb
+++ b/01-g-h-filter.ipynb
@ -346,7 +346,7 @@
    "\n",
    "The first two choices are plausible, but we have no reason to favor one scale over the other. Why would we choose to believe A instead of B? We have no reason for such a belief. The third and fourth choices are irrational. The scales are admittedly not very accurate, but there is no reason at all to choose a number outside of the range of what they both measured. The final choice is the only reasonable one. If both scales are inaccurate, and as likely to give a result above my actual weight as below it, more often than not probably the answer is somewhere between A and B. \n",
    "\n",
-    "In mathematics this concept is formalized as *expected value*, and we will cover it in depth later. For now ask yourself what would be the 'usual' thing to happen if we took one million readings. Some of the times both scales will read too low, sometimes both will read too high, and the rest of the time they will straddle the actual weight. If they straddle the actual weight then certainly we should choose a number between A and B. If they don't straddle then we don't know if they are both too high or low, but by choosing a number between A and B we at least mitigate the effect of the worst measurement. For example, suppose our actual weight is 180 lbs. 160 lbs is a big error. But if we choose a weight between 160 lbs and 170 lbs our estimate will be better than 160 lbs. The same argument holds if both scales returned a value greater than the actual weight.\n",
+    "In mathematics this concept is formalized as [*expected value*](https://en.wikipedia.org/wiki/Expected_value), and we will cover it in depth later. For now ask yourself what would be the 'usual' thing to happen if we took one million readings. Some of the times both scales will read too low, sometimes both will read too high, and the rest of the time they will straddle the actual weight. If they straddle the actual weight then certainly we should choose a number between A and B. If they don't straddle then we don't know if they are both too high or low, but by choosing a number between A and B we at least mitigate the effect of the worst measurement. For example, suppose our actual weight is 180 lbs. 160 lbs is a big error. But if we choose a weight between 160 lbs and 170 lbs our estimate will be better than 160 lbs. The same argument holds if both scales returned a value greater than the actual weight.\n",
    "\n",
    "We will deal with this more formally later, but for now I hope it is clear that our best estimate is the average of A and B. $\\frac{160+170}{2} = 165$.\n",
    "\n",
@ -984,7 +984,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This algorithm is known as the g-h filter. $g$ and $h$ refer to the two scaling factors that we used in our example. $g$ is the scaling we used for the measurement (weight in our example), and $h$ is the scaling for the change in measurement over time (lbs/day in our example)."
+    "This algorithm is known as the [g-h filter](https://en.wikipedia.org/wiki/Alpha_beta_filter). $g$ and $h$ refer to the two scaling factors that we used in our example. $g$ is the scaling we used for the measurement (weight in our example), and $h$ is the scaling for the change in measurement over time (lbs/day in our example)."
   ]
  },
  {
@ -2506,7 +2506,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "FilterPy is an open source filtering library that I wrote. It has all of the filters in this book, along with others. It is rather easy to program your own g-h filter, but as we progress we will rely on FilterPy more. As a quick introduction, let's look at the g-h filter in FilterPy.\n",
+    "[FilterPy](https://github.com/rlabbe/filterpy) is an open source filtering library that I wrote. It has all of the filters in this book, along with others. It is rather easy to program your own g-h filter, but as we progress we will rely on FilterPy more. As a quick introduction, let's look at the g-h filter in FilterPy.\n",
    "\n",
    "If you do not have FilterPy installed just issue the following command from the command line.\n",
    "\n",
--- a/02-Discrete-Bayes.ipynb
+++ b/02-Discrete-Bayes.ipynb
@ -323,13 +323,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Tracking a Dog"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Tracking a Dog\n",
+    "\n",
    "Let's begin with a simple problem. We have a dog friendly workspace, and so people bring their dogs to work. Occasionally the dogs wander out of offices and down the halls. We want to be able to track them. So during a hackathon somebody invented a sonar sensor to attach to the dog's collar. It emits a signal, listens for the echo, and based on how quickly an echo comes back we can tell whether the dog is in front of an open doorway or not. It also senses when the dog walks, and reports in which direction the dog has moved. It connects to the network via wifi and sends an update once a second.\n",
    "\n",
    "I want to track my dog Simon, so I attach the device to his collar and then fire up Python, ready to write code to track him through the building. At first blush this may appear impossible. If I start listening to the sensor of Simon's collar I might read **door**, **hall**, **hall**, and so on. How can I use that information to determine where Simon is?\n",
@ -366,9 +361,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In Bayesian statistics this is called a *prior*. It is the probability prior to incorporating measurements or other information. More completely, this is called the *prior probability distribution*. A *probability distribution* is a collection of all possible probabilities for an event. Probability distributions always to sum to 1 because something had to happen; the distribution lists all possible events and the probability of each.\n",
+    "In [Bayesian statistics](https://en.wikipedia.org/wiki/Bayesian_probability) this is called a [*prior*](https://en.wikipedia.org/wiki/Prior_probability). It is the probability prior to incorporating measurements or other information. More completely, this is called the *prior probability distribution*. A [*probability distribution*](https://en.wikipedia.org/wiki/Probability_distribution) is a collection of all possible probabilities for an event. Probability distributions always to sum to 1 because something had to happen; the distribution lists all possible events and the probability of each.\n",
    "\n",
-    "I'm sure you've used probabilities before - as in \"the probability of rain today is 30%\". The last paragraph sounds like more of that. But Bayesian statistics was a revolution in probability because it treats probability as a belief about a single event. Let's take an example. I know that if I flip a fair coin infinitely many times I will get 50% heads and 50% tails. This is called *frequentist statistics* to distinguish it from Bayesian statistics. Computations are based on the frequency in which events occur.\n",
+    "I'm sure you've used probabilities before - as in \"the probability of rain today is 30%\". The last paragraph sounds like more of that. But Bayesian statistics was a revolution in probability because it treats probability as a belief about a single event. Let's take an example. I know that if I flip a fair coin infinitely many times I will get 50% heads and 50% tails. This is called [*frequentist statistics*](https://en.wikipedia.org/wiki/Frequentist_inference) to distinguish it from Bayesian statistics. Computations are based on the frequency in which events occur.\n",
    "\n",
    "I flip the coin one more time and let it land. Which way do I believe it landed? Frequentist probability has nothing to say about that; it will merely state that 50% of coin flips land as heads. In some ways it is meaningless to to assign a probability to the current state of the coin. It is either heads or tails, we just don't know which. Bayes treats this as a belief about a single event - the strength of my belief or knowledge that this specific coin flip is heads is 50%.\n",
    "\n",
@ -429,14 +424,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This distribution is called a *categorical distribution*, which is a discrete distribution describing the probability of observing $n$ outcomes. It is a *multimodal distribution* because we have multiple beliefs about the position of our dog. Of course we are not saying that we think he is simultaneously in three different locations, merely that we have narrowed down our knowledge to one of these three locations. My (Bayesian) belief is that there is a 33.3% chance of being at door 0, 33.3% at door 1, and a 33.3% chance of being at door 8.\n",
+    "This distribution is called a [*categorical distribution*](https://en.wikipedia.org/wiki/Categorical_distribution), which is a discrete distribution describing the probability of observing $n$ outcomes. It is a [*multimodal distribution*](https://en.wikipedia.org/wiki/Multimodal_distribution) because we have multiple beliefs about the position of our dog. Of course we are not saying that we think he is simultaneously in three different locations, merely that we have narrowed down our knowledge to one of these three locations. My (Bayesian) belief is that there is a 33.3% chance of being at door 0, 33.3% at door 1, and a 33.3% chance of being at door 8.\n",
    "\n",
    "This is an improvement in two ways. I've rejected a number of hallway positions as impossible, and the strength of my belief in the remaining positions has increased from 10% to 33%. This will always happen. As our knowledge improves the probabilities will get closer to 100%.\n",
    "\n",
-    "A few words about the *mode* of a distribution. Given a set of numbers, such as {1, 2, 2, 2, 3, 3, 4}, the *mode* is the number that occurs most often. For this set the mode is 2. A set can contain more than one mode. The set {1, 2, 2, 2, 3, 3, 4, 4, 4} contains the modes 2 and 4, because both occur three times. We say the former set is *unimodal*, and the latter is *multimodal*.\n",
-    "\n",
-    "Another term used for this distribution is a *histogram*. Histograms graphically depict the distribution of a set of numbers. The bar chart above is a histogram.\n",
+    "A few words about the [*mode*](https://en.wikipedia.org/wiki/Mode_(statistics&#41;)\n",
+    "of a distribution. Given a set of numbers, such as {1, 2, 2, 2, 3, 3, 4}, the *mode* is the number that occurs most often. For this set the mode is 2. A set can contain more than one mode. The set {1, 2, 2, 2, 3, 3, 4, 4, 4} contains the modes 2 and 4, because both occur three times. We say the former set is [*unimodal*](https://en.wikipedia.org/wiki/Unimodality), and the latter is *multimodal*.\n",
    "\n",
+    "Another term used for this distribution is a [*histogram*](https://en.wikipedia.org/wiki/Histogram). Histograms graphically depict the distribution of a set of numbers. The bar chart above is a histogram.\n",
    "\n",
    "I hand coded the `belief` array in the code above. How would we implement this in code? We represent doors with 1, and walls as 0, so we will multiply the hallway variable by the percentage, like so;"
   ]
@ -465,13 +460,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Extracting Information from Sensor Readings"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Extracting Information from Sensor Readings\n",
+    "\n",
    "Let's put Python aside and think about the problem a bit. Suppose we were to read the following from Simon's sensor:\n",
    "\n",
    "  * door\n",
@ -528,7 +518,7 @@
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
-    "scrolled": false
+    "scrolled": true
   },
   "outputs": [
    {
@ -693,7 +683,7 @@
   "source": [
    " We can see from the output that the sum is now 1.0, and that the probability of a door vs wall is still three times larger. The result also fits our intuition that the probability of a door must be less than 0.333, and that the probability of a wall must be greater than 0.0. Finally, it should fit our intuition that we have not yet been given any information that would allow us to distinguish between any given door or wall position, so all door positions should have the same value, and the same should be true for wall positions.\n",
    " \n",
-    "This result is called the *posterior*, which is short for *posterior probability distribution*. All this means is a probability distribution *after* incorporating the measurement information (posterior means 'after' in this context). To review, the *prior* is the probability distribution before including the measurement's information. Another term is the *likelihood* - this is the probability for the *evidence* (the measurement in this case) being true. That gives us this equation:\n",
+    "This result is called the [*posterior*](https://en.wikipedia.org/wiki/Posterior_probability), which is short for *posterior probability distribution*. All this means is a probability distribution *after* incorporating the measurement information (posterior means 'after' in this context). To review, the *prior* is the probability distribution before including the measurement's information. Another term is the [*likelihood*](https://en.wikipedia.org/wiki/Likelihood_function) - this is the probability for the [*evidence*](https://en.wikipedia.org/wiki/Marginal_likelihood) (the measurement in this case) being true. That gives us this equation:\n",
    "\n",
    "$$\\mathtt{posterior} = \\frac{\\mathtt{prior}\\times \\mathtt{likelihood}}{\\mathtt{normalization}}$$ \n",
    "\n",
@ -775,13 +765,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Incorporating Movement"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Incorporating Movement\n",
+    "\n",
    "Recall how quickly we were able to find an exact solution when we incorporated a series of measurements and movement updates. However, that occurred in a fictional world of perfect sensors. Might we be able to find an exact solution with noisy sensors?\n",
    "\n",
    "Unfortunately, the answer is no. Even if the sensor readings perfectly match an extremely complicated hallway map, we cannot be 100% certain that the dog is in a specific position - there is, after all, the possibility that every sensor reading was wrong! Naturally, in a more typical situation most sensor readings will be correct, and we might be close to 100% sure of our answer, but never 100% sure. This may seem complicated, but lets go ahead and program the math.\n",
@ -957,13 +942,8 @@
   "source": [
    "Here the results are more complicated, but you should still be able to work it out in your head. The 0.04 is due to the possibility that the 0.4 belief undershot by 1. The 0.38 is due to the following: the 80% chance that we moved 2 positions (0.4 $\\times$ 0.8) and the 10% chance that we undershot (0.6 $\\times$ 0.1). Overshooting plays no role here because if we overshot both 0.4 and 0.6 would be past this position. **I strongly suggest working some examples until all of this is very clear, as so much of what follows depends on understanding this step.**\n",
    "\n",
-    "If you look at the probabilities after performing the update you probably feel dismay. In the example above we started with probabilities of 0.4 and 0.6 in two positions; after performing the update the probabilities are not only lowered, but they are strewn out across the map."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "If you look at the probabilities after performing the update you probably feel dismay. In the example above we started with probabilities of 0.4 and 0.6 in two positions; after performing the update the probabilities are not only lowered, but they are strewn out across the map.\n",
+    "\n",
    "This is not a coincidence, or the result of a carefully chosen example - it is always true of the evolution (predict step). If the sensor is noisy we lose some information on every prediction. Suppose we were to perform the prediction an infinite number of times - what would the result be? If we lose information on every step, we must eventually end up with no information at all, and our probabilities will be equally distributed across the `belief` array. Let's try this with 500 iterations."
   ]
  },
@ -1015,7 +995,7 @@
    "\n",
    "We made the assumption that the movement error is at most one position. But it is possible for the error to be two, three, or more positions. \n",
    "\n",
-    "This is easily solved with *convolution*. Convolution modifies one function with another function. In our case we are modifying a probability distribution with the error function of the sensor. The implementation of `predict_move()` is a convolution, though we did not call it that. Formally, convolution is defined as\n",
+    "This is easily solved with [*convolution*](https://en.wikipedia.org/wiki/Convolution). Convolution modifies one function with another function. In our case we are modifying a probability distribution with the error function of the sensor. The implementation of `predict_move()` is a convolution, though we did not call it that. Formally, convolution is defined as\n",
    "\n",
    "$$ (f \\ast g) (t) = \\int_0^t \\!f(\\tau) \\, g(t-\\tau) \\, \\mathrm{d}\\tau$$\n",
    "\n",
@ -1027,7 +1007,7 @@
    "\n",
    "Comparison shows that `predict_move()` is computing this equation.\n",
    "\n",
-    "Khan Academy [4] has a good introduction to convolution, and Wikipedia has some excellent animations of convolutions [5]. But the general idea is already clear. You slide an array called the *kernel* across another array, multiplying the neighbors of the current cell with the values of the second array. In our example above we used 0.8 for the probability of moving to the correct location, 0.1 for undershooting, and 0.1 for overshooting. We make a kernel of this with the array `[0.1, 0.8, 0.1]`. All we need to do is write a loop that goes over each element of our array, multiplying by the kernel, and summing the results. To emphasize that the belief is a probability distribution I have named it `pdf`."
+    "[Khan Academy](https://www.khanacademy.org/math/differential-equations/laplace-transform/convolution-integral/v/introduction-to-the-convolution) [4] has a good introduction to convolution, and Wikipedia has some excellent animations of convolutions [5]. But the general idea is already clear. You slide an array called the *kernel* across another array, multiplying the neighbors of the current cell with the values of the second array. In our example above we used 0.8 for the probability of moving to the correct location, 0.1 for undershooting, and 0.1 for overshooting. We make a kernel of this with the array `[0.1, 0.8, 0.1]`. All we need to do is write a loop that goes over each element of our array, multiplying by the kernel, and summing the results. To emphasize that the belief is a probability distribution I have named it `pdf`."
   ]
  },
  {
@ -1352,9 +1332,9 @@
    "The filter equations are:\n",
    "\n",
    "$$\\begin{aligned} \\bar {\\mathbf x} &= \\mathbf x \\ast f_{\\mathbf x}(\\bullet)\\, \\, &\\text{Predict Step} \\\\\n",
-    "\\mathbf x &= \\|L \\cdot \\bar{\\mathbf x}\\|\\, \\, &\\text{Update Step}\\end{aligned}$$\n",
+    "\\mathbf x &= \\|\\mathcal L \\cdot \\bar{\\mathbf x}\\|\\, \\, &\\text{Update Step}\\end{aligned}$$\n",
    "\n",
-    "$L$ is the usual way to write the Likelihood function, so I use that. The $\\|\\bullet\\|$ notation denotes the norm.\n",
+    "$\\mathcal L$ is the usual way to write the likelihood function, so I use that. The $\\|\\bullet\\|$ notation denotes the norm.\n",
    "\n",
    "We can express this in pseudocode.\n",
    "\n",
@ -1376,13 +1356,8 @@
    "\n",
    "When we cover the Kalman filter we will use this exact same algorithm; only the details of the computation will differ. \n",
    "\n",
-    "Algorithms in this form are sometimes called *predictor correctors*. We make a prediction, then correct them."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "Algorithms in this form are sometimes called *predictor correctors*. We make a prediction, then correct them.\n",
+    "\n",
    "Finally, for those viewing this in a Jupyter Notebook or on the web, here is an animation of that algorithm.\n",
    "<img src=\"animations/02_simulate.gif\">"
   ]
@ -1508,13 +1483,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Drawbacks and Limitations"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Drawbacks and Limitations\n",
+    "\n",
    "Do not be mislead by the simplicity of the examples I chose. This is a robust and complete filter, and you may use the code in real world solutions. If you need a multimodal, discrete filter, this filter works.\n",
    "\n",
    "With that said, this filter it is not used often because it has several limitations. Getting around those limitations is the motivation behind the chapters in the rest of this book.\n",
@ -1805,7 +1775,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We developed the math in this chapter merely by reasoning about the information we have at each moment. In the process we discovered *Bayes Theorem*. Bayes theorem tells us how to compute the probability of an event given previous information. That is exactly what we have been doing in this chapter. With luck our code should match the Bayes Theorem equation! \n",
+    "We developed the math in this chapter merely by reasoning about the information we have at each moment. In the process we discovered [*Bayes Theorem*](https://en.wikipedia.org/wiki/Bayes%27_theorem). Bayes theorem tells us how to compute the probability of an event given previous information. That is exactly what we have been doing in this chapter. With luck our code should match the Bayes Theorem equation! \n",
    "\n",
    "We implemented the `update()` function with this probability calculation:\n",
    "\n",
@ -1819,7 +1789,7 @@
    "\n",
    "If you are not familiar with this notation, let's review. $P(A)$ means the probability of event $A$. If $A$ is the event of a fair coin landing heads, then $P(A) = 0.5$.\n",
    "\n",
-    "$P(A \\mid B)$ is called a *conditional probability*. That is, it represents the probability of $A$ happening *if* $B$ happened. For example, it is more likely to rain today if it also rained yesterday because rain systems usually last more than one day. We'd write the probability of it raining today given that it rained yesterday as $P(\\mathtt{rain\\_today} \\mid \\mathtt{rain\\_yesterday})$.\n",
+    "$P(A \\mid B)$ is called a [*conditional probability*](https://en.wikipedia.org/wiki/Conditional_probability). That is, it represents the probability of $A$ happening *if* $B$ happened. For example, it is more likely to rain today if it also rained yesterday because rain systems usually last more than one day. We'd write the probability of it raining today given that it rained yesterday as $P(\\mathtt{rain\\_today} \\mid \\mathtt{rain\\_yesterday})$.\n",
    "\n",
    "In the Bayes theorem equation above $B$ is the *evidence*, $P(A)$ is the *prior*, $P(B \\mid A)$ is the *likelihood*, and $P(A \\mid B)$ is the *posterior*. By substituting the mathematical terms with the corresponding words  you can see that Bayes theorem matches out update equation. Let's rewrite the equation in terms of our problem. We will use $x_i$ for the position at *i*, and $Z$ for the measurement. Hence, we want to know $P(x_i \\mid Z)$, that is, the probability of the dog being at $x_i$ given the measurement $Z$. \n",
    "\n",
@ -1841,7 +1811,7 @@
    "\n",
    "$$P(A \\mid B) = \\frac{P(B \\mid A)\\, P(A)}{\\int P(B \\mid A) P(B) \\mathtt{d}y}\\cdot$$\n",
    "\n",
-    "In practice the denominator can be fiendishly difficult to solve analytically (a recent opinion piece for the Royal Statistical Society called it a \"dog's breakfast\" [8].  Filtering textbooks are filled with integral laden equations which you cannot be expected to solve. We will learn more techniques to handle this in the **Particle Filters** chapter. Until then, recognize that in practice it is just a normalization term over which we can sum. What I'm trying to say is that when you are faced with a page of integrals, just think of them of sums, and relate them back to this chapter, and often the difficulties will fade. Ask yourself \"why are we summing these values\", and \"why am I dividing by this term\". Surprisingly often the answer is readily apparent."
+    "In practice the denominator can be fiendishly difficult to solve analytically (a recent opinion piece for the Royal Statistical Society [called it](http://www.statslife.org.uk/opinion/2405-we-need-to-rethink-how-we-teach-statistics-from-the-ground-up) a \"dog's breakfast\" [8].  Filtering textbooks are filled with integral laden equations which you cannot be expected to solve. We will learn more techniques to handle this in the **Particle Filters** chapter. Until then, recognize that in practice it is just a normalization term over which we can sum. What I'm trying to say is that when you are faced with a page of integrals, just think of them of sums, and relate them back to this chapter, and often the difficulties will fade. Ask yourself \"why are we summing these values\", and \"why am I dividing by this term\". Surprisingly often the answer is readily apparent."
   ]
  },
  {
@ -1852,7 +1822,7 @@
   "source": [
    "## Total Probability Theorem\n",
    "\n",
-    "We know now the formal mathematics behind the `update()` function; what about the `predict()` function? `predict()` implements the *total probability theorem*. Let's recall what `predict()` computed. It computed the probability of being at any given position given the probability of all the possible movement events. Let's express that as an equation. The probability of being at any position $i$ at time $t$ can be written as $P(X_i^t)$. We computed that as the sum of the prior at time $t-1$ $P(X_j^{t-1})$ multiplied by the probability of moving from cell $x_j$ to $x_i$. That is\n",
+    "We know now the formal mathematics behind the `update()` function; what about the `predict()` function? `predict()` implements the [*total probability theorem*](https://en.wikipedia.org/wiki/Law_of_total_probability). Let's recall what `predict()` computed. It computed the probability of being at any given position given the probability of all the possible movement events. Let's express that as an equation. The probability of being at any position $i$ at time $t$ can be written as $P(X_i^t)$. We computed that as the sum of the prior at time $t-1$ $P(X_j^{t-1})$ multiplied by the probability of moving from cell $x_j$ to $x_i$. That is\n",
    "\n",
    "$$P(X_i^t) = \\sum_j P(X_j^{t-1})  P(x_i | x_j)$$\n",
    "\n",
@ -1870,13 +1840,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Summary"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Summary\n",
+    "\n",
    "The code is very short, but the result is impressive! We have implemented a form of a Bayesian filter. We have learned how to start with no information and derive information from noisy sensors. Even though the sensors in this chapter are very noisy (most sensors are more than 80% accurate, for example) we quickly converge on the most likely position for our dog. We have learned how the predict step always degrades our knowledge, but the addition of another measurement, even when it might have noise in it, improves our knowledge, allowing us to converge on the most likely result.\n",
    "\n",
    "This book is mostly about the Kalman filter. The math it uses is different, but the logic is exactly the same as used in this chapter. It uses Bayesian reasoning to form estimates from a combination of measurements and process models. \n",
@ -1888,13 +1853,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## References"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## References\n",
+    "\n",
    " * [1] D. Fox, W. Burgard, and S. Thrun. \"Monte carlo localization: Efficient position estimation for mobile robots.\" In *Journal of Artifical Intelligence Research*, 1999.\n",
    " \n",
    " http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume11/fox99a-html/jair-localize.html\n",
--- a/03-Gaussians.ipynb
+++ b/03-Gaussians.ipynb
@ -310,13 +310,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Introduction"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Introduction\n",
+    "\n",
    "The last chapter ended by discussing some of the drawbacks of the Discrete Bayesian filter. For many tracking and filtering problems our desire is to have a filter that is *unimodal* and *continuous*. That is, we want to model our system using floating point math (continuous) and to have only one belief represented (unimodal). For example, we want to say an aircraft is at (12.34, -95.54, 2389.5) where that is latitude, longitude, and altitude. We do not want our filter to tell us \"it might be at (1.65, -78.01, 2100.45) or it might be at (34.36, -98.23, 2543.79).\" That doesn't match our physical intuition of how the world works, and as we discussed, it can be prohibitively expensive to compute the multimodal case. And, of course, multiple position estimates makes navigating impossible.\n",
    "\n",
    "We desire a unimodal, continuous way to represent probabilities that models how the real world works, and that is computationally efficient to calculate. As you might guess from the chapter name, Gaussian distributions provide all of these features."
@ -333,9 +328,9 @@
    "\n",
    "Each time you roll a die the *outcome* will be between 1 and 6. If we rolled a fair die a million times we'd expect to get 1 1/6 of the time. Thus we say the *probability*, or *odds* of the outcome 1 is 1/6. Likewise, if I asked you the chance of 1 being the result of the next roll you'd reply 1/6.  \n",
    "\n",
-    "This combination of values and associated probabilities is called a *random variable*. *Random* does not mean the process is nondeterministic, only that we lack information. The result of a die toss is deterministic, but we lack enough information to compute the result. We don't know what will happen, except probabilistically.\n",
+    "This combination of values and associated probabilities is called a [*random variable*](https://en.wikipedia.org/wiki/Random_variable). *Random* does not mean the process is nondeterministic, only that we lack information. The result of a die toss is deterministic, but we lack enough information to compute the result. We don't know what will happen, except probabilistically.\n",
    "\n",
-    "While we are defining things, the range of values is called the *sample space*. For a die the sample space is {1, 2, 3, 4, 5, 6}. For a coin the sample space is {H, T}. *Space* is a mathematical term which means a set with structure. The sample space for the die is a subset of the natural numbers in the range of 1 to 6.\n",
+    "While we are defining things, the range of values is called the [*sample space*](https://en.wikipedia.org/wiki/Sample_space). For a die the sample space is {1, 2, 3, 4, 5, 6}. For a coin the sample space is {H, T}. *Space* is a mathematical term which means a set with structure. The sample space for the die is a subset of the natural numbers in the range of 1 to 6.\n",
    "\n",
    "Another example of a random variable is the heights of students in a university. Here the sample space is a range of values in the real numbers between two limits defined by biology.\n",
    "\n",
@ -353,7 +348,7 @@
    "## Probability Distribution\n",
    "\n",
    "\n",
-    "The *probability distribution* gives the probability for the random variable to take any value in a sample space. For example, for a fair six sided die we might say:\n",
+    "The [*probability distribution*](https://en.wikipedia.org/wiki/Probability_distribution) gives the probability for the random variable to take any value in a sample space. For example, for a fair six sided die we might say:\n",
    "\n",
    "|Value|Probability|\n",
    "|-----|-----------|\n",
@ -396,7 +391,7 @@
   "source": [
    "### The Mean, Median, and Mode of a Random Variable\n",
    "\n",
-    "Given a set of data we often want to know a representative or average value for that set. There are many measures for this, and the concept is called a *measure of central tendency*. For example we will want to know the *average* height of the students. We all know how to find the average, but let me belabor the point so I can introduce more formal notation and terminology. Another word for average is the *mean*. We compute the mean by summing the values and dividing by the number of values. If the heights of the students in meters is \n",
+    "Given a set of data we often want to know a representative or average value for that set. There are many measures for this, and the concept is called a [*measure of central tendency*](https://en.wikipedia.org/wiki/Central_tendency). For example we will want to know the *average* height of the students. We all know how to find the average, but let me belabor the point so I can introduce more formal notation and terminology. Another word for average is the *mean*. We compute the mean by summing the values and dividing by the number of values. If the heights of the students in meters is \n",
    "\n",
    "$$X = \\{1.8, 2.0, 1.7, 1.9, 1.6\\}$$\n",
    "\n",
@ -470,7 +465,7 @@
   "source": [
    "## Expected Value of a Random Variable\n",
    "\n",
-    "The *expected value* of a random variable is the average value it would have if we took an infinite number of samples of it and then averaged those samples together. Let's say we have $x=[1,3,5]$ and each value is equally probable. What would we *expect* $x$ to have, on average?\n",
+    "The [*expected value*](https://en.wikipedia.org/wiki/Expected_value) of a random variable is the average value it would have if we took an infinite number of samples of it and then averaged those samples together. Let's say we have $x=[1,3,5]$ and each value is equally probable. What would we *expect* $x$ to have, on average?\n",
    "\n",
    "It would be the average of 1, 3, and 5, of course, which is 3. That should make sense; we would expect equal numbers of 1, 3, and 5 to occur, so $(1+3+5)/3=3$ is clearly the average of that infinite series of samples. In other words, here the expected value is the *mean* of the sample space.\n",
    "\n",
@ -551,16 +546,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The mean of each class is 1.8 meters, but notice that there is a much greater amount of variation in the heights in the second class than in the first class, and that there is no variation at all in the third class."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "The mean of each class is 1.8 meters, but notice that there is a much greater amount of variation in the heights in the second class than in the first class, and that there is no variation at all in the third class.\n",
+    "\n",
    "The mean tells us something about the data, but not the whole story. We want to be able to specify how much *variation* there is between the heights of the students. You can imagine a number of reasons for this. Perhaps a school district needs to order 5,000 desks, and they want to be sure they buy sizes that accommodate the range of heights of the students. \n",
    "\n",
-    "Statistics has formalized this concept of measuring variation into the notion of *standard deviation* and *variance*. The equation for computing the *variance* is\n",
+    "Statistics has formalized this concept of measuring variation into the notion of [*standard deviation*](https://en.wikipedia.org/wiki/Standard_deviation) and [*variance*](https://en.wikipedia.org/wiki/Variance). The equation for computing the variance is\n",
    "\n",
    "$$\\mathit{VAR}(X) = E[(X - \\mu)^2]$$\n",
    "\n",
@ -648,13 +638,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And, of course, $0.1414^2 = 0.02$, which agrees with our earlier computation of the variance."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "And, of course, $0.1414^2 = 0.02$, which agrees with our earlier computation of the variance.\n",
+    "\n",
    "What does the standard deviation signify? It tells us how much the heights vary amongst themselves. \"How much\" is not a mathematical term. We will be able to define it much more precisely once we introduce the concept of a Gaussian in the next section. For now I'll say that for many things 68% of all values lie within one standard deviation of the mean. In other words we can conclude that for a random class 68% of the students will have heights between 1.66 (1.8-0.1414) meters and 1.94 (1.8+0.1414) meters. \n",
    "\n",
    "We can view this in a plot:"
@ -737,13 +722,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can see by eye that roughly 68% of the heights lie within $\\pm1\\sigma$ of the mean 1.8."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "We can see by eye that roughly 68% of the heights lie within $\\pm1\\sigma$ of the mean 1.8.\n",
+    "\n",
    "We'll discuss this in greater depth soon. For now let's compute the standard deviation for \n",
    "\n",
    "$$Y = [2.2, 1.5, 2.3, 1.7, 1.3]$$\n",
@ -897,22 +877,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Gaussians"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We are now ready to learn about Gaussians. Let's remind ourselves of the motivation for this chapter.\n",
+    "## Gaussians\n",
+    "\n",
+    "We are now ready to learn about [Gaussians](https://en.wikipedia.org/wiki/Gaussian_function). Let's remind ourselves of the motivation for this chapter.\n",
+    "\n",
+    "> We desire a unimodal, continuous way to represent probabilities that models how the real world works, and that is computationally efficient to calculate.\n",
    "\n",
-    "> We desire a unimodal, continuous way to represent probabilities that models how the real world works, and that is computationally efficient to calculate."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
    "Let's look at a graph of a Gaussian distribution to get a sense of what we are talking about."
   ]
  },
@ -944,7 +914,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This curves is a *probability density function* or *pdf* for short. It shows the relative likelihood  for the random variable to take on a value. In the chart above, a student is somewhat more likely to have a height near 1.8 m than 1.7 m, and far more likely to have a height of 1.9 m vs 1.1 m.\n",
+    "This curve is a [*probability density function*](https://en.wikipedia.org/wiki/Probability_density_function) or *pdf* for short. It shows the relative likelihood  for the random variable to take on a value. In the chart above, a student is somewhat more likely to have a height near 1.8 m than 1.7 m, and far more likely to have a height of 1.9 m vs 1.1 m.\n",
    "\n",
    "> I explain how to plot Gaussians, and much more, in the Notebook *Computing_and_Plotting_PDFs* in the \n",
    "Supporting_Notebooks folder. You can read it online [here](https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python/blob/master/Supporting_Notebooks/Computing_and_plotting_PDFs.ipynb) [1].\n",
@ -985,13 +955,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Nomenclature"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Nomenclature\n",
+    "\n",
    "A bit of nomenclature before we continue - this chart depicts the *probability density* of a *random variable* having any value between ($-\\infty..\\infty)$. What does that mean? Imagine we take an infinite number of infinitely precise measurements of the speed of automobiles on a section of highway. We could then plot the results by showing the relative number of cars going past at any given speed. If the average was 120 kph, it might look like this:"
   ]
  },
@ -1033,26 +998,16 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Gaussian Distributions"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Gaussian Distributions\n",
+    "\n",
    "Let's explore how Gaussians work. A Gaussian is a *continuous probability distribution* that is completely described with two parameters, the mean ($\\mu$) and the variance ($\\sigma^2$). It is defined as:\n",
    "\n",
    "$$ \n",
    "f(x, \\mu, \\sigma) = \\frac{1}{\\sigma\\sqrt{2\\pi}} \\exp\\big [{-\\frac{(x-\\mu)^2}{2\\sigma^2} }\\big ]\n",
    "$$\n",
    "\n",
-    "$\\exp[x]$ is notation for $e^x$."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "$\\exp[x]$ is notation for $e^x$.\n",
+    "\n",
    "<p> Don't be dissuaded by the equation if you haven't seen it before; you will not need to memorize or manipulate it. The computation of this function is stored in `stats.py` with the function `gaussian(x, mean, var)`.\n",
    "\n",
    "> **Optional:** Let's remind ourselves how to look at a function stored in a file by using the *%load* magic. If you type *%load -s gaussian stats.py* into a code cell and then press CTRL-Enter, the notebook will create a new input cell and load the function into it.\n",
@ -1067,13 +1022,8 @@
    "    return (np.exp((-0.5*(np.asarray(x)-mean)**2)/var) /\n",
    "            math.sqrt(2*math.pi*var))\n",
    "\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "```\n",
+    "\n",
    "<p><p><p><p>We will plot a Gaussian with a mean of 22 $(\\mu=22)$, with a variance of 4 $(\\sigma^2=4)$, and then discuss what this means. "
   ]
  },
@ -1105,19 +1055,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "What does this curve *mean*? Assume we have a thermometer which reads 22°C. No thermometer is perfectly accurate, and so we expect that each reading will be slightly off the actual value. However, a theorem called  *Central Limit Theorem* states that if we make many measurements that the measurements will be normally distributed. When we look at this chart we can \"sort of\" think of it as representing the probability of the thermometer reading a particular value given the actual temperature of 22°C. \n",
+    "What does this curve *mean*? Assume we have a thermometer which reads 22°C. No thermometer is perfectly accurate, and so we expect that each reading will be slightly off the actual value. However, a theorem called  [*Central Limit Theorem*](https://en.wikipedia.org/wiki/Central_limit_theorem) states that if we make many measurements that the measurements will be normally distributed. When we look at this chart we can \"sort of\" think of it as representing the probability of the thermometer reading a particular value given the actual temperature of 22°C. \n",
    "\n",
    "Recall that a Gaussian distribution is *continuous*. Think of an infinitely long straight line - what is the probability that a point you pick randomly is at 2. Clearly 0%, as there is an infinite number of choices to choose from. The same is true for normal distributions; in the graph above the probability of being *exactly* 2°C is 0% because there are an infinite number of values the reading can take.\n",
    "\n",
    "What is this curve? It is something we call the *probability density function.* The area under the curve at any region gives you the probability of those values. So, for example, if you compute the area under the curve between 20 and 22 the resulting area will be the probability of the temperature reading being between those two temperatures. \n",
    "\n",
-    "We can think of this in Bayesian terms or frequentist terms. As a Bayesian, if the thermometer reads exactly 22°C, then our belief is described by the curve - our belief that the actual (system) temperature is near 22 is very high, and our belief that the actual temperature is near 18 is very low. As a frequentist we would say that if we took 1 billion temperature measurements of a system at exactly 22°C, then a histogram of the measurements would look like this curve. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "We can think of this in Bayesian terms or frequentist terms. As a Bayesian, if the thermometer reads exactly 22°C, then our belief is described by the curve - our belief that the actual (system) temperature is near 22 is very high, and our belief that the actual temperature is near 18 is very low. As a frequentist we would say that if we took 1 billion temperature measurements of a system at exactly 22°C, then a histogram of the measurements would look like this curve. \n",
+    "\n",
    "How do you compute the probability, or area under the curve? You integrate the equation for the Gaussian \n",
    "\n",
    "$$ \\int^{x_1}_{x_0}  \\frac{1}{\\sigma\\sqrt{2\\pi}} e^{-\\frac{1}{2}{(x-\\mu)^2}/\\sigma^2 } dx$$\n",
@ -1165,13 +1110,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## The Variance and Belief"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## The Variance and Belief\n",
+    "\n",
    "Since this is a probability density distribution it is required that the area under the curve always equals one. This should be intuitively clear — the area under the curve represents all possible outcomes, *something* happened, and the probability of *something happening* is one, so the density must sum to one. We can prove this ourselves with a bit of code. (If you are mathematically inclined, integrate the Gaussian equation from $-\\infty$ to $\\infty$)"
   ]
  },
@ -1243,19 +1183,19 @@
    "\n",
    "An equivalent formation for a Gaussian is $\\mathcal{N}(\\mu,1/\\tau)$ where $\\mu$ is the *mean* and $\\tau$ the *precision*. $1/\\tau = \\sigma^2$; it is the reciprocal of the variance. While we do not use this formulation in this book, it underscores that the variance is a measure of how precise our data is. A small variance yields large precision — our measurement is very precise. Conversely, a large variance yields low precision — our belief is spread out across a large area. You should become comfortable with thinking about Gaussians in these equivalent forms. In Bayesian terms Gaussians reflect our *belief* about a measurement, they express the *precision* of the measurement, and they express how much *variance* there is in the measurements. These are all different ways of stating the same fact.\n",
    "\n",
-    "I'm getting ahead of myself, but in the next chapters we will use Gaussians to express our belief in things like the estimated position of the object we are tracking, or the accuracy of the sensors we are using.\n",
-    "\n",
-    "## The  68-95-99.7 Rule\n",
-    "\n",
-    "It is worth spending a few words on standard deviation now. The standard deviation is a measure of how much variation from the mean exists. For Gaussian distributions, 68% of all the data falls within one standard deviation ($\\pm1\\sigma$) of the mean, 95% falls within two standard deviations ($\\pm2\\sigma$), and 99.7% within three ($\\pm3\\sigma$). This is often called the 68-95-99.7 rule. If you were told that the average test score in a class was 71 with a standard deviation of 9.4, you could conclude that 95% of the students received a score between 52.2 and 89.8 if the distribution is normal (that is calculated with $71 \\pm (2 * 9.4)$). \n",
-    "\n",
-    "Finally, these are not arbitrary numbers. If the Gaussian for our position is $\\mu=22$ meters, then the standard deviation also has units meters. Thus $\\sigma=0.2$ implies that 68% of the measurements range from 21.8 to 22.2 meters. Variance is the standard deviation squared, thus $\\sigma^2 = .04$ meters$^2$."
+    "I'm getting ahead of myself, but in the next chapters we will use Gaussians to express our belief in things like the estimated position of the object we are tracking, or the accuracy of the sensors we are using."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
+    "## The  68-95-99.7 Rule\n",
+    "\n",
+    "It is worth spending a few words on standard deviation now. The standard deviation is a measure of how much variation from the mean exists. For Gaussian distributions, 68% of all the data falls within one standard deviation ($\\pm1\\sigma$) of the mean, 95% falls within two standard deviations ($\\pm2\\sigma$), and 99.7% within three ($\\pm3\\sigma$). This is often called the [68-95-99.7 rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule). If you were told that the average test score in a class was 71 with a standard deviation of 9.4, you could conclude that 95% of the students received a score between 52.2 and 89.8 if the distribution is normal (that is calculated with $71 \\pm (2 * 9.4)$). \n",
+    "\n",
+    "Finally, these are not arbitrary numbers. If the Gaussian for our position is $\\mu=22$ meters, then the standard deviation also has units meters. Thus $\\sigma=0.2$ implies that 68% of the measurements range from 21.8 to 22.2 meters. Variance is the standard deviation squared, thus $\\sigma^2 = .04$ meters$^2$.\n",
+    "\n",
    "The following graph depicts the relationship between the standard deviation and the normal distribution. "
   ]
  },
@ -1287,13 +1227,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Interactive Gaussians"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Interactive Gaussians\n",
+    "\n",
    "For those that are reading this in a Jupyter Notebook, here is an interactive version of the Gaussian plots. Use the sliders to modify $\\mu$ and $\\sigma^2$. Adjusting $\\mu$ will move the graph to the left and right because you are adjusting the mean, and adjusting $\\sigma^2$ will make the bell curve thicker and thinner."
   ]
  },
@ -1342,20 +1277,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Computational Properties of Gaussians"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Computational Properties of Gaussians\n",
+    "\n",
    "A remarkable property of Gaussians is that the product of two independent Gaussians is another Gaussian! The sum is not Gaussian, but proportional to a Gaussian.\n",
    "\n",
    "The discrete Bayes filter works by multiplying and adding probabilities. I'm getting ahead of myself, but the Kalman filter uses Gaussians instead of probabilities, but the rest of the algorithm remains the same. This means we will need to multiply and add Gaussians. \n",
    "\n",
    "The Gaussian is a nonlinear function, and typically if you multiply a nonlinear equation with itself you end up with a different equation. For example, the shape of `sin(x)sin(x)` is very different from `sin(x)`. But the result of multiplying two Gaussians is yet another Gaussian. This is a fundamental property, and a key reason why Kalman filters are computationally feasible. Said another way, Kalman filters use Gaussians *because* they are computationally nice. \n",
    "\n",
-    "The remainder of this section is optiona. I will derive the equations for the sum and product of two Gaussians. You will not need to understand this material to understand the rest of the book, so long as you accept the results. "
+    "The remainder of this section is optional. I will derive the equations for the sum and product of two Gaussians. You will not need to understand this material to understand the rest of the book, so long as you accept the results. "
   ]
  },
  {
@ -1482,14 +1412,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Computing Probabilities with scipy.stats"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this chapter I used code from FilterPy to compute and plot Gaussians. I did that to give you a chance to look at the code and see how these functions are implemented.  However, Python comes with \"batteries included\" as the saying goes, and it comes with a wide range of statistics functions in the module `scipy.stats`. So let's walk through how to use scipy.stats to compute statistics and probabilities.\n",
+    "## Computing Probabilities with scipy.stats\n",
+    "\n",
+    "In this chapter I used code from [FilterPy](https://github.com/rlabbe/filterpy) to compute and plot Gaussians. I did that to give you a chance to look at the code and see how these functions are implemented.  However, Python comes with \"batteries included\" as the saying goes, and it comes with a wide range of statistics functions in the module `scipy.stats`. So let's walk through how to use scipy.stats to compute statistics and probabilities.\n",
    "\n",
    "The `scipy.stats` module contains a number of objects which you can use to compute attributes of various probability distributions. The full documentation for this module is here: http://http://docs.scipy.org/doc/scipy/reference/stats.html. We will focus on the  norm variable, which implements the normal distribution. Let's look at some code that uses `scipy.stats.norm` to compute a Gaussian, and compare its value to the value returned by the `gaussian()` function from FilterPy."
   ]
@ -1582,7 +1507,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can get the *cumulative distribution function (CDF)*, which is the probability that a randomly drawn value from the distribution is less than or equal to $x$."
+    "We can get the [*cumulative distribution function (CDF)*](https://en.wikipedia.org/wiki/Cumulative_distribution_function), which is the probability that a randomly drawn value from the distribution is less than or equal to $x$."
   ]
  },
  {
@ -1639,13 +1564,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Fat Tails"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Fat Tails\n",
+    "\n",
    "Earlier I mentioned the *central limit theorem*, which states that under certain conditions the arithmetic sum of any independent random variable will be normally distributed, regardless of how the random variables are distributed. This is important to us because nature is full of distributions which are not normal, but when we apply the central limit theorem over large populations we end up with normal distributions. \n",
    "\n",
    "However, a key part of the proof is “under certain conditions”. These conditions often do not hold for the physical world. The resulting distributions are called *fat tailed*. Tails is a colloquial term for the far left and right side parts of the curve where the probability density is close to zero.\n",
@ -1685,9 +1605,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The area under the curve cannot equal 1, so it is not a probability distribution. What actually happens is that more students than predicted by a normal distribution get scores nearer the upper end of the range (for example), and that tail becomes “fat”. Also, the test is probably not able to perfectly distinguish incredibly minute differences in skill in the students, so the distribution to the left of the mean is also probably a bit bunched up in places. The resulting distribution is called a *fat tail distribution*. \n",
+    "The area under the curve cannot equal 1, so it is not a probability distribution. What actually happens is that more students than predicted by a normal distribution get scores nearer the upper end of the range (for example), and that tail becomes “fat”. Also, the test is probably not able to perfectly distinguish incredibly minute differences in skill in the students, so the distribution to the left of the mean is also probably a bit bunched up in places. The resulting distribution is called a [*fat tail distribution*](https://en.wikipedia.org/wiki/Fat-tailed_distribution). \n",
    "\n",
-    "Kalman filters use sensors to measure the world. The errors in a sensor's measurements are rarely truly Gaussian. It is far too early to be talking about the difficulties that this presents to the Kalman filter designer. It is worth keeping in the back of your mind the fact that the Kalman filter math is based on an idealized model of the world.  For now I will present a bit of code that I will be using later in the book to form fat tail distributions to simulate various processes and sensors. This distribution is called the *Student's $t$-distribution*. \n",
+    "Kalman filters use sensors to measure the world. The errors in a sensor's measurements are rarely truly Gaussian. It is far too early to be talking about the difficulties that this presents to the Kalman filter designer. It is worth keeping in the back of your mind the fact that the Kalman filter math is based on an idealized model of the world.  For now I will present a bit of code that I will be using later in the book to form fat tail distributions to simulate various processes and sensors. This distribution is called the [*Student's $t$-distribution*](https://en.wikipedia.org/wiki/Student%27s_t-distribution). \n",
    "\n",
    "Let's say I want to model a sensor that has some white noise in the output. For simplicity, let's say the signal is a constant 10, and the standard deviation of the noise is 2. We can use the function `numpy.random.randn()` to get a random number with a mean of 0 and a standard deviation of 1. I can simulate this with:"
   ]
@ -1812,13 +1732,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Summary and Key Points"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "## Summary and Key Points\n",
+    "\n",
    "This chapter is a poor introduction to statistics in general. I've only covered the concepts that  needed to use Gaussians in the remainder of the book, no more. What I've covered will not get you very far if you intend to read the Kalman filter literature. If this is a new topic to you I suggest reading a statistics textbook. I've always liked the Schaum series for self study, and Alan Downey's *Think Stats* [5] is also very good. \n",
    "\n",
    "The following points **must** be understood by you before we continue:\n",
--- a/04-One-Dimensional-Kalman-Filters.ipynb
+++ b/04-One-Dimensional-Kalman-Filters.ipynb