Removed overuse of word 'just'.

I just use it just all the time, and it is just unnecessary.
2015-07-11 11:40:17 -07:00
parent 9449955e14
commit 3dadc7f538
11 changed files with 143 additions and 142 deletions
--- a/01_g-h_filter.ipynb
+++ b/01_g-h_filter.ipynb
@@ -281,7 +281,7 @@
    "\n",
    "Another coworker hears the commotion and comes over to find out what has you so excited. You explain the invention and once again step onto the scale, and proudly proclaim the result: \"161 lbs.\" And then you hesitate, confused.\n",
    "\n",
-    "\"It read 172 lbs just a few seconds ago\" you complain to your coworker. \n",
+    "\"It read 172 lbs a few seconds ago\" you complain to your coworker. \n",
    "\n",
    "\"I never said it was accurate,\" she replies.\n",
    "\n",
@@ -308,7 +308,7 @@
    "\n",
    "In mathematics this concept is formalized as *expected value*, and we will cover it in depth later. For now ask yourself what would be the 'usual' thing to happen if we made one million separate readings. Some of the times both scales will read too low, sometimes both will read too high, and the rest of the time they will straddle the actual weight. If they straddle the actual weight then certainly we should choose a number between A and B. If they don't straddle then we don't know if they are both too high or low, but by choosing a number between A and B we at least mitigate the effect of the worst measurement. For example, suppose our actual weight is 180 lbs. 160 lbs is a big error. But if we choose a weight between 160 lbs and 170 lbs our estimate will be better than 160 lbs. The same argument holds if both scales returned a value greater than the actual weight.\n",
    "\n",
-    "We will deal with this more formally later, but for now I hope it is clear that our best estimate is just the average of A and B. $\\frac{160+170}{2} = 165$.\n",
+    "We will deal with this more formally later, but for now I hope it is clear that our best estimate is the average of A and B. $\\frac{160+170}{2} = 165$.\n",
    "\n",
    "We can look at this graphically. I have plotted the measurements of A and B with an assumed error of $\\pm$ 8 lbs. The overlap falls between 160 and 170 so the only weight that makes sense must lie within 160 and 170 pounds."
   ]
@@ -343,7 +343,7 @@
   "source": [
    "So 165 lbs looks like a reasonable estimate, but there is more information here that we might be able to take advantage of. The only weights that are possible lie in the intersection between the error bars of A and B. For example, a weight of 161 lbs is impossible because scale B could not give a reading of 170 lbs with a maximum error of 8 pounds. Likewise a weight of 171 lbs is impossible because scale A could not give a reading of 160 lbs with a maximum error of 8 lbs. In this example the only possible weights lie in the range of 162 to 168 lbs.\n",
    "\n",
-    "That doesn't yet allow us to find a better weight estimate, but let's play 'what if' some more. What if we are now told that A is three times more accurate than B? Consider the 5 options we listed above. It still makes no sense to choose a number outside the range of A and B, so we will not consider those. It perhaps seems more compelling to choose A as our estimate - after all, we know it is more accurate, why not just use it instead of B? Can B possibly improve our knowledge over A alone?\n",
+    "That doesn't yet allow us to find a better weight estimate, but let's play 'what if' some more. What if we are now told that A is three times more accurate than B? Consider the 5 options we listed above. It still makes no sense to choose a number outside the range of A and B, so we will not consider those. It perhaps seems more compelling to choose A as our estimate - after all, we know it is more accurate, why not use it instead of B? Can B possibly improve our knowledge over A alone?\n",
    "\n",
    "The answer, perhaps counter intuitively, is yes, it can. First, let's look at the same measurements of A=160 and B=170, but with the error of A $\\pm$ 3 lbs and the error of B is 3 times as much, $\\pm$ 9 lbs."
   ]
@@ -375,7 +375,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The overlap of the error bars of A and B are the only possible true weight. This overlap is smaller than the error in A alone. More importantly, in this case we can see that the overlap doesn't include 160 lbs or 165 lbs. If we only used the measurement from A because it is more accurate than B we would give an estimate of 160 lbs. If we just averaged A and B together we would get 165 lbs. Neither of those weights are possible given our knowledge of the accuracy of the scales. By including the measurement of B we would give an estimate somewhere between 161 lbs and 163 lbs, the limits of the intersections of the two error bars.\n",
+    "The overlap of the error bars of A and B are the only possible true weight. This overlap is smaller than the error in A alone. More importantly, in this case we can see that the overlap doesn't include 160 lbs or 165 lbs. If we only used the measurement from A because it is more accurate than B we would give an estimate of 160 lbs. If we average A and B we would get 165 lbs. Neither of those weights are possible given our knowledge of the accuracy of the scales. By including the measurement of B we would give an estimate somewhere between 161 lbs and 163 lbs, the limits of the intersections of the two error bars.\n",
    "\n",
    "Let's take this to the extreme limits.  Assume we know scale A is accurate to 1 lb. In other words, if we truly weigh 170 lbs, it could report 169, 170, or 171 lbs. We also know that scale B is accurate to 9 lbs. We do a weighing on each scale, and get A=160, and B=170. What should we estimate our weight to be? Let's look at that graphically."
   ]
@@ -413,7 +413,7 @@
    "\n",
    "However, we have strayed from our problem. No customer is going to want to buy multiple scales, and besides, we initially started with an assumption that all scales were equally (in)accurate. This insight of using all measurements regardless of accuracy will play a large role later, so don't forget it.\n",
    "\n",
-    "What if I have one scale, but I weigh myself many times? We concluded that if we had two scales of equal accuracy we should average the results of their measurements. What if I weigh myself 10,000 times with one scale? We have already stated that the scale is equally likely to return a number too large as it is to return one that is too small. It is not that hard to prove that the average of a large number of weights will be very close to the actual weight, but let's just write a simulation for now."
+    "What if I have one scale, but I weigh myself many times? We concluded that if we had two scales of equal accuracy we should average the results of their measurements. What if I weigh myself 10,000 times with one scale? We have already stated that the scale is equally likely to return a number too large as it is to return one that is too small. It is not that hard to prove that the average of a large number of weights will be very close to the actual weight, but let's write a simulation for now."
   ]
  },
  {
@@ -444,7 +444,7 @@
   "source": [
    "The exact number printed depends on your random number generator, but it should be very close to 165.\n",
    "\n",
-    "This code makes one assumption that probably isn't true - that the scale is just as likely to read 160 as 165 for a true weight of 165 lbs. This is almost never true. Real sensors are more likely to get readings nearer the true value, and are less and less likely to get readings the further away from the true value it gets. We will cover this in detail in the Gaussian chapter. For now, I will use without further explanation the `numpy.random.normal()` function, which will produce more values nearer 165 lbs, and fewer further away. Take it on faith for now that this will produce noisy measurements very similar to how a real scale would."
+    "This code makes one assumption that probably isn't true - that the scale is as likely to read 160 as 165 for a true weight of 165 lbs. This is almost never true. Real sensors are more likely to get readings nearer the true value, and are less and less likely to get readings the further away from the true value it gets. We will cover this in detail in the Gaussian chapter. For now, I will use without further explanation the `numpy.random.normal()` function, which will produce more values nearer 165 lbs, and fewer further away. Take it on faith for now that this will produce noisy measurements very similar to how a real scale would."
   ]
  },
  {
@@ -515,7 +515,7 @@
   "source": [
    "As we can see there is an extreme range of weight changes that could be explained by these three measurements. Shall we give up? No. Recall that we are talking about measuring a humans' weight. There is no way for a human to weigh 180 lbs on day 1, and 160 lbs on day 3. or to lose 30 lbs in one day only to gain it back the next (we will assume no amputations or other trauma has happened to the person). The behavior of the physical system we are measuring should influence how we interpret the measurements. \n",
    " \n",
-    "Suppose I take a different scale, and I get the following measurements: 169, 170, 169, 171, 170, 171, 169, 170, 169, 170. What does your intuition tell you? It is possible, for example, that you gained 1 lb each day, and the noisy measurements just happens to look like you stayed the same weight. Equally, you could have lost 1 lb a day and gotten the same readings. But is that likely? How likely is it to flip a coin and get 10 heads in a row? Not very likely. We can't prove it based solely on these readings, but it seems pretty likely that my weight held steady. In the chart below I've plotted the measurements with error bars, and a likely true weight in dashed green. This dashed line is not meant to be the 'correct' answer to this problem, just one that is reasonable and could be explained by the measurement."
+    "Suppose I take a different scale, and I get the following measurements: 169, 170, 169, 171, 170, 171, 169, 170, 169, 170. What does your intuition tell you? It is possible, for example, that you gained 1 lb each day, and the noisy measurements just happens to look like you stayed the same weight. Equally, you could have lost 1 lb a day and gotten the same readings. But is that likely? How likely is it to flip a coin and get 10 heads in a row? Not very likely. We can't prove it based solely on these readings, but it seems pretty likely that my weight held steady. In the chart below I've plotted the measurements with error bars, and a likely true weight in dashed green. This dashed line is not meant to be the 'correct' answer to this problem, merely one that is reasonable and could be explained by the measurement."
   ]
  },
  {
@@ -575,7 +575,7 @@
   "source": [
    "Does it 'seem' likely that I lost weight and this is just really noisy data? Not really. Does it seem likely that I held the same weight? Again, no. This data trends upwards over time; not evenly, but definitely upwards. We can't be sure, but that surely looks like a weight gain, and a significant weight gain at that. Let's test this assumption with some more plots. It is often easier to 'eyeball' data in a chart versus a table.\n",
    "\n",
-    "So let's look at two hypotheses. First, let's assume our weight did not change. To get that number we agreed that we should just average all the measurements. Let's look at that."
+    "So let's look at two hypotheses. First, let's assume our weight did not change. To get that number we agreed that we should average the measurements. Let's look at that."
   ]
  },
  {
@@ -606,7 +606,7 @@
   "source": [
    "That doesn't look very convincing. In fact, we can see that there is no horizontal line that we could draw that is inside all of the error bars.\n",
    "\n",
-    "Now, let's assume we we gained weight. How much? I don't know, but numpy does! We just want to draw a line through the measurements that looks 'about' right. numpy has functions that will do this according to a rule called \"least squares fit\". Let's not worry about the details of that computation, or why we are writing our own filter if numpy provides one, and just plot the results."
+    "Now, let's assume we we gained weight. How much? I don't know, but numpy does! We want to draw a line through the measurements that looks 'about' right. numpy has functions that will do this according to a rule called \"least squares fit\". Let's not worry about the details of that computation, or why we are writing our own filter if numpy provides one, and plot the results."
   ]
  },
  {
@@ -640,9 +640,9 @@
    "\n",
    "\"But is it impossible?\" pipes up a coworker.\n",
    "\n",
-    "Let's try something crazy. Let's just assume that I know I am gaining about one lb a day. It doesn't matter how I know that right now, just assume I know it is approximately correct. Maybe I am eating a 6000 calorie a day diet, which would result in such a weight gain. Or maybe there is another way to estimate the weight gain. Let's just see if we can make use of such information if it was available without worrying about the source of that information just yet.\n",
+    "Let's try something crazy. Let's assume that I know I am gaining about one lb a day. It doesn't matter how I know that right now, assume I know it is approximately correct. Maybe I am eating a 6000 calorie a day diet, which would result in such a weight gain. Or maybe there is another way to estimate the weight gain. Let's see if we can make use of such information if it was available without worrying about the source of that information yet.\n",
    "\n",
-    "The first measurement was 158. We have no way of knowing any different, so let's just accept that as our estimate. If our weight today is 158, what will it be tomorrow? Well, we think we are gaining weight at 1 lb/day, so our prediction is 159, like so:"
+    "The first measurement was 158. We have no way of knowing any different, so let's accept that as our estimate. If our weight today is 158, what will it be tomorrow? Well, we think we are gaining weight at 1 lb/day, so our prediction is 159, like so:"
   ]
  },
  {
@@ -671,7 +671,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Okay, but what good is this? Sure, we could just assume the 1 lb/day is accurate, and just predict our weight for 10 days, but then why use a scale at all if we don't incorporate its readings? So let's look at the next measurement. We step on the scale again and it displays 164.2 lbs."
+    "Okay, but what good is this? Sure, we could assume the 1 lb/day is accurate, and predict our weight for the next 10 days, but then why use a scale at all if we don't incorporate its readings? So let's look at the next measurement. We step on the scale again and it displays 164.2 lbs."
   ]
  },
  {
@@ -749,11 +749,11 @@
    "\n",
    "The difference between the measurement and prediction is called the *residual*, which is depicted by the black vertical line in the plot above. This will become an important value to use later on, as it is an exact computation of the difference between measurements and the filter's output. Smaller residuals imply better performance.\n",
    "\n",
-    "Let's just code that up and see the result when we test it against the series of weights from above. We have to take into account one other factor. Weight gain has units of lbs/time, so to be general we will need to add a time step $t$, which we will set to 1 (day). \n",
+    "Let's code that and see the results when we test it against the series of weights from above. We have to take into account one other factor. Weight gain has units of lbs/time, so to be general we will need to add a time step $t$, which we will set to 1 (day). \n",
    "\n",
    "I hand generated the weight data to correspond to a true starting weight of 160 lbs, and a weight gain of 1 lb per day. In other words on the first day (day zero) the true weight is 160lbs, on the second day (day one, the first day of weighing) the true weight is 161 lbs, and so on. \n",
    "\n",
-    "We need to make a guess for the initial weight to feed into the filter. It is too early to talk about strategies for making that initial guess/estimate, so for now I will just assume 159 lbs."
+    "We need to make a guess for the initial weight to feed into the filter. It is too early to talk about strategies for making that initial guess/estimate, so for now I will assume 159 lbs."
   ]
  },
  {
@@ -878,7 +878,7 @@
   "source": [
    "That is not so impressive. Clearly a filter that requires us to correctly guess a rate of change is not very useful. Even if our initial guess was correct, the filter will fail as soon as that rate of change changes. If I stop overeating the filter will have extremely difficulty in adjusting to that change. \n",
    "\n",
-    "But, 'what if'? What if instead of just leaving the weight gain at the initial guess of 1 lb (or whatever), we compute it from the existing measurements and estimates. On day one our estimate for the weight is:\n",
+    "But, 'what if'? What if instead of leaving the weight gain at the initial guess of 1 lb (or whatever), we compute it from the existing measurements and estimates. On day one our estimate for the weight is:\n",
    "\n",
    "$$\n",
    "(160 + 1) + \\frac{4}{10}(158-161) = 159.8\n",
@@ -886,7 +886,7 @@
    "\n",
    "On the next day we measure 164.2, which implies a weight gain of 4.4 lbs (since 164.2 - 159.8 = 4.4), not 1. Can we use this information somehow? It seems plausible. After all, the weight measurement itself is based on a real world measurement of our weight, so there is useful information. Our estimate of our weight gain may not be perfect, but it is surely better than just guessing our gain is 1 lb. Data is better than a guess, even if it is noisy.\n",
    "\n",
-    "So, should we just set the new gain/day to 4.4 lbs? Yesterday we though the weight gain was 1 lb, today we think it is 4.4 lbs. We have two numbers, and want to combine them somehow. Hmm, sounds like our same problem again. Let's use our same tool, and the only tool we have so far - pick a value part way between the two. This time I will use another arbitrarily chosen number, $\\frac{1}{3}$. The equation is identical as for the weight estimate except we have to incorporate time because this is a rate (gain/day):\n",
+    "So, should we set the new gain/day to 4.4 lbs? Yesterday we though the weight gain was 1 lb, today we think it is 4.4 lbs. We have two numbers, and want to combine them somehow. Hmm, sounds like our same problem again. Let's use our same tool, and the only tool we have so far - pick a value part way between the two. This time I will use another arbitrarily chosen number, $\\frac{1}{3}$. The equation is identical as for the weight estimate except we have to incorporate time because this is a rate (gain/day):\n",
    "\n",
    "$$new gain = old gain + \\frac{1}{3}\\frac{measurement - predicted~weight}{1~ day}\n",
    "$$"
@@ -980,7 +980,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This filter is the basis for a huge number of filters, including the Kalman filter. In other words, the Kalman filter is a form of the g-h filter, which I will prove later in the book. So is the Least Squares filter, which you may have heard of, and so is the Benedict-Bordner filter, which you probably have not. Each filter has a different way of assigning values to *g* and *h*, but otherwise the algorithms are identical. For example, the $\\alpha$-$\\beta$ filter just assigns a constant to *g* and *h*, constrained to a certain range of values. Other filters such as the Kalman will vary *g* and *h* dynamically at each time step.\n",
+    "This filter is the basis for a huge number of filters, including the Kalman filter. In other words, the Kalman filter is a form of the g-h filter, which I will prove later in the book. So is the Least Squares filter, which you may have heard of, and so is the Benedict-Bordner filter, which you probably have not. Each filter has a different way of assigning values to *g* and *h*, but otherwise the algorithms are identical. For example, the $\\alpha$-$\\beta$ filter assigns a constant to *g* and *h*, constrained to a certain range of values. Other filters such as the Kalman will vary *g* and *h* dynamically at each time step.\n",
    "\n",
    "**Let me repeat the key points as they are so important**. If you do not understand these you will not understand the rest of the book. If you do understand them, then the rest of the book will unfold naturally for you as mathematically elaborations to various 'what if' questions we will ask about *g* and *h*. The math may look profoundly different, but the algorithm will be exactly the same.\n",
    "\n",
@@ -1025,7 +1025,7 @@
    "\n",
    "Now consider trying to track a child's balloon in a hurricane. We have no legitimate model that would allow us to predict the balloon's behavior except over very brief time scales (we know the balloon cannot go 10 miles in 1 second, for example). In this case we would design a filter that emphasized the measurements over the predictions.\n",
    "\n",
-    "Most of this book is devoted to expressing the concerns in the last three paragraphs mathematically, which then allows us to find an optimal solution. In this chapter we will merely be assigning different values to *g* and *h* in a more intuitive, and thus less optimal way. But the fundamental idea is just to blend somewhat inaccurate measurements with somewhat inaccurate models of how the systems behaves to get a filtered estimate that is better than either information source by itself.\n",
+    "Most of this book is devoted to expressing the concerns in the last three paragraphs mathematically, which then allows us to find an optimal solution. In this chapter we will merely be assigning different values to *g* and *h* in a more intuitive, and thus less optimal way. But the fundamental idea is to blend somewhat inaccurate measurements with somewhat inaccurate models of how the systems behaves to get a filtered estimate that is better than either information source by itself.\n",
    "\n",
    "We can express this as an algorithm:\n",
    "\n",
@@ -1078,7 +1078,7 @@
   "source": [
    "## NumPy arrays\n",
    "\n",
-    "We will find NumPy's array data structure to be indispensible through the rest of the book, so let's take the time to learn about them now. I will teach you just enough to get started; refer to NumPy's documentation if you want to become an expert with them.\n",
+    "We will find NumPy's array data structure to be indispensible through the rest of the book, so let's take the time to learn about them now. I will teach you enough to get started; refer to NumPy's documentation if you want to become an expert with them.\n",
    "\n",
    "`numpy.array` implements a one or more dimensional array. Its type is `numpy.ndarray`, and we will refer to this as an ndarray for short. You can construct it with any list like object. The following constructs a 1-D array from a list:"
   ]
@@ -1955,11 +1955,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now let's try a practical example. Earlier in the chapter we talked about tracking a train. Trains are heavy and slow, thus they cannot change speed quickly. They are on a track, so they cannot change direction except by slowing to a stop and then reversing course. Hence, we can conclude that if we already know the train's approximate position and velocity then we can predict its position in the near future with a great deal of accuracy. A train just cannot change its velocity much in a second or two. \n",
+    "Now let's try a practical example. Earlier in the chapter we talked about tracking a train. Trains are heavy and slow, thus they cannot change speed quickly. They are on a track, so they cannot change direction except by slowing to a stop and then reversing course. Hence, we can conclude that if we already know the train's approximate position and velocity then we can predict its position in the near future with a great deal of accuracy. A train cannot change its velocity much in a second or two. \n",
    "\n",
    "So let's write a filter for a train. Its position is expressed as its position on the track in relation to some fixed point which we say is 0 km. I.e., a position of 1 means that the train is 1 km away from the fixed point. Velocity is expressed as meters per second. We perform measurement of position once per second, and the error is $\\pm$ 500 meters. How should we implement our filter?\n",
    "\n",
-    "First, let's just simulate the situation without a filter. We will assume that the train is currently at kilometer 23, and moving at 15 m/s. We can code this as \n",
+    "First, let's simulate the situation without a filter. We will assume that the train is currently at kilometer 23, and moving at 15 m/s. We can code this as \n",
    "\n",
    "    pos = 23*1000\n",
    "    vel = 15\n",
--- a/02_Discrete_Bayes.ipynb
+++ b/02_Discrete_Bayes.ipynb
@@ -328,7 +328,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In Bayesian statistics this is called our **prior**. It basically means the probability prior to incorporating measurements or other information. More completely, this is called the **prior probability distribution**, but that is a mouthful and so it is normally shorted to *prior*. A **probability distribution** is just a collection of all possible probabilities for an event. Probability distributions always to sum to 1 because *something* had to happen; the distribution just lists all the different *somethings* and the probability of each.\n",
+    "In Bayesian statistics this is called our **prior**. It basically means the probability prior to incorporating measurements or other information. More completely, this is called the **prior probability distribution**, but that is a mouthful and so it is normally shorted to *prior*. A **probability distribution** is a collection of all possible probabilities for an event. Probability distributions always to sum to 1 because *something* had to happen; the distribution lists all the different *somethings* and the probability of each.\n",
    "\n",
    "I'm sure you've used probabilities before - as in \"the probability of rain today is 30%\". The last paragraph sounds like more of that. But Bayesian statistics was a revolution in probability because it treats the probability as a *belief* about a single event. Let's take an example. I know if I flip a fair coin a infinitely many times 50% of the flips will be heads and 50% tails. That is standard probability, not Bayesian, and is called **frequentist statistics** to distinguish it from Bayes. Now, let's say I flip the coin one more time. Which way do I believe it landed? Frequentist probablility has nothing to say about that; it will merely state that 50% of coin flips land as heads. Bayes treats this as a belief about a single event - the strength of my belief that *this specific* coin flip is heads is 50%.\n",
    "\n",
@@ -479,7 +479,7 @@
    "\n",
    "At first this may seem like an insurmountable problem. If the sensor is noisy it casts doubt on every piece of data. How can we conclude anything if we are always unsure?\n",
    "\n",
-    "The answer, as with the problem above, is probabilities. We are already comfortable assigning a probabilistic belief about the location of the dog; now we just have to incorporate the additional uncertainty caused by the sensor noise. Lets say we get a reading of 'door'. We already said that the sensor is three times as likely to be correct as incorrect, so we should scale the probability distribution by 3 where there is a door. If we do that the result will no longer be a probability distribution, but we will learn how to correct that in a moment.\n",
+    "The answer, as with the problem above, is probabilities. We are already comfortable assigning a probabilistic belief about the location of the dog; now we have to incorporate the additional uncertainty caused by the sensor noise. Lets say we get a reading of 'door'. We already said that the sensor is three times as likely to be correct as incorrect, so we should scale the probability distribution by 3 where there is a door. If we do that the result will no longer be a probability distribution, but we will learn how to correct that in a moment.\n",
    "\n",
    "Let's look at that in Python code. Here I use the variable `z` to denote the measurement as that is the customary choice in the literature (`y` is also commonly used). "
   ]
@@ -613,11 +613,11 @@
   "source": [
    "Recall how quickly we were able to find an exact solution to our dog's position when we incorporated a series of measurements and movement updates. However, that occurred in a fictional world of perfect sensors. Might we be able to find an exact solution even in the presence of noisy sensors?\n",
    "\n",
-    "Unfortunately, the answer is no. Even if the sensor readings perfectly match an extremely complicated hallway map we could not say that we are 100% sure that the dog is in a specific position - there is, after all, the possibility that every sensor reading was wrong! Naturally, in a more typical situation most sensor readings will be correct, and we might be close to 100% sure of our answer, but never 100% sure. This may seem complicated, but lets just go ahead and program the math, which as we have seen is quite simple.\n",
+    "Unfortunately, the answer is no. Even if the sensor readings perfectly match an extremely complicated hallway map we could not say that we are 100% sure that the dog is in a specific position - there is, after all, the possibility that every sensor reading was wrong! Naturally, in a more typical situation most sensor readings will be correct, and we might be close to 100% sure of our answer, but never 100% sure. This may seem complicated, but lets go ahead and program the math, which as we have seen is quite simple.\n",
    "\n",
    "First let's deal with the simple case - assume the movement sensor is perfect, and it reports that the dog has moved one space to the right. How would we alter our `belief` array?\n",
    "\n",
-    "I hope after a moment's thought it is clear that we should just shift all the values one space to the right. If we previously thought there was  a 50% chance of Simon being at position 3, then after the move to the right we should believe that there is a 50% chance he is at position 4. So let's implement that. Recall that the hallway is circular, so we will use modulo arithmetic to perform the shift correctly."
+    "I hope after a moment's thought it is clear that we should shift all the values one space to the right. If we previously thought there was  a 50% chance of Simon being at position 3, then after the move to the right we should believe that there is a 50% chance he is at position 4. So let's implement that. Recall that the hallway is circular, so we will use modulo arithmetic to perform the shift correctly."
   ]
  },
  {
@@ -672,7 +672,7 @@
   "source": [
    "We can see that we correctly shifted all values one position to the right, wrapping from the end of the array back to the beginning.\n",
    "\n",
-    "You will often see this referred to as the state **evolution**, which is short for **time evolution** [7]. This term refers to how the state of the system changes over time. For filters, this time is usually discrete. For our dog tracking, the system state is the position of the dog, so the state evolution is just the position after a discrete amount of time has passed. I haven't defined 'state' or 'system' yet, but I hope it is clear in context. The **system** is what we are trying to model or filter, and the state is it's current configuration or value. Except in simulations we rarely know the actual state, so we say our filters produce the **estimated state** of the system. In practice this often just gets referred to as the state, so be careful to understand if the reference is to the state of the filter (which is the estimated state), or the state of the system (which is the actual state).\n",
+    "You will often see this referred to as the state **evolution**, which is short for **time evolution** [7]. This term refers to how the state of the system changes over time. For filters, this time is usually discrete. For our dog tracking, the system state is the position of the dog, so the state evolution is the position after a discrete amount of time has passed. I haven't defined 'state' or 'system' yet, but I hope it is clear in context. The **system** is what we are trying to model or filter, and the state is it's current configuration or value. Except in simulations we rarely know the actual state, so we say our filters produce the **estimated state** of the system. In practice this often gets called the state, so be careful to understand if the reference is to the state of the filter (which is the estimated state), or the state of the system (which is the actual state).\n",
    "\n",
    "More terminology - this prediction becomes our new *prior*. Time has moved forward and we made a prediction without benefit of knowing the measurements. "
   ]
@@ -688,7 +688,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We want to solve real world problems, and we have already stated that all sensors have noise. Therefore the code above must be wrong since it assumes we have perfect measurements. What if the sensor reported that our dog moved one space, but he actually moved two spaces, or zero? Once again this may initially sound like an insurmountable problem, but let's just model it in math. Since this is just an example, we will create a pretty simple noise model for the sensor - later in the book we will handle far more sophisticated errors.\n",
+    "We want to solve real world problems, and we have already stated that all sensors have noise. Therefore the code above must be wrong since it assumes we have perfect measurements. What if the sensor reported that our dog moved one space, but he actually moved two spaces, or zero? Once again this may initially sound like an insurmountable problem, but let's model it and see what happens. Since this is an example we will create a simple noise model for the sensor - later in the book we will handle more difficult errors.\n",
    "\n",
    "We will say that when the sensor sends a movement update, it is 80% likely to be right, and it is 10% likely to overshoot one position to the right, and 10% likely to undershoot to the left. That is, if we say the movement was 4 (meaning 4 spaces to the right), the dog is 80% likely to have moved 4 spaces to the right, 10% to have moved 3 spaces, and 10% to have moved 5 spaces.\n",
    "\n",
@@ -878,7 +878,7 @@
    "\n",
    "where $f\\ast g$ is the notation for convolving f by g. It does not mean multiply.\n",
    "\n",
-    "Integrals are for continuous functions, but we are using discrete functions, so we just replace the integral with a summation, and the parenthesis with array brackets.\n",
+    "Integrals are for continuous functions, but we are using discrete functions, so we replace the integral with a summation, and the parenthesis with array brackets.\n",
    "\n",
    "$$ (f \\ast g) [t] = \\sum\\limits_{\\tau=0}^t \\!f[\\tau] \\, g[t-\\tau]$$\n",
    "\n",
@@ -1196,7 +1196,7 @@
   "source": [
    "Here things have degraded a bit due to the long string of wall positions in the map. We cannot be as sure where we are when there is an undifferentiated line of wall positions, so naturally our probabilities spread out a bit.\n",
    "\n",
-    "I spread the computation across several cells, but I am just iteratively calling `predict()` and `update()`. This chart from the **g-h Filter** chapter illustrates the algorithm."
+    "I spread the computation across several cells, iteratively calling `predict()` and `update()`. This chart from the **g-h Filter** chapter illustrates the algorithm."
   ]
  },
  {
@@ -1226,7 +1226,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This filter is just a form of the g-h filter. Here we are using the percentages for the errors to implicitly compute the *g* and *h* parameters. We could express the discrete Bayes algorithm as a g-h filter, but that would obscure the logic of this filter.\n",
+    "This filter is a form of the g-h filter. Here we are using the percentages for the errors to implicitly compute the *g* and *h* parameters. We could express the discrete Bayes algorithm as a g-h filter, but that would obscure the logic of this filter.\n",
    "\n",
    "We can express this in pseudocode. I am going to u that won't immediately strike you as being identical to what we have done in this chapter.\n",
    "\n",
@@ -1246,7 +1246,7 @@
    "    3. Determine whether whether the measurement matches each state\n",
    "    4. Update state belief if it matches the measurement\n",
    "\n",
-    "When we cover the Kalman filter we will use this exact same algorithm; just the details of the computation will differ. "
+    "When we cover the Kalman filter we will use this exact same algorithm; only the details of the computation will differ. "
   ]
  },
  {
@@ -1517,7 +1517,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With that we are ready to write the simulation. We will put it in a function so that we can run it with different assumptions. I will assume that the robot always starts at the beginning of the track. The track is implemented as being only 10 units long, but this could just represent a track of length, say 10,000, with the magnet pattern repeated every 10 units. I set the magnet pattern to first have a gap of 1 unit, then 2 units, then 3 units to help the algorithm distinguish where the robot is. "
+    "With that we are ready to write the simulation. We will put it in a function so that we can run it with different assumptions. I will assume that the robot always starts at the beginning of the track. The track is implemented as being 10 units long, but think of it as a track of length, say 10,000, with the magnet pattern repeated every 10 units. 10 makes it easier to plot and inspect. I set the magnet pattern to first have a gap of 1 unit, then 2 units, then 3 units to help the algorithm distinguish where the robot is. "
   ]
  },
  {
@@ -1753,7 +1753,7 @@
    "\n",
    "$$P(x_i|Z) = \\frac{P(Z|x_i) P(x_i)}{P(Z)}$$\n",
    "\n",
-    "That looks ugly, but it is actually quite simple. Let's just figure out what each term on the right means. First is $P(Z|x_i)$. This is the probability for the measurement at every cell $x_i$. $P(x_i)$ is the *prior* - our belief before incorporating the measurements. We multiply those together. This is just the unnormalized multiplication in the `update()` function, where `belief` is our prior $P(x_i)$.\n",
+    "That looks ugly, but it is actually quite simple. Let's figure out what each term on the right means. First is $P(Z|x_i)$. This is the probability for the measurement at every cell $x_i$. $P(x_i)$ is the *prior* - our belief before incorporating the measurements. We multiply those together. This is just the unnormalized multiplication in the `update()` function, where `belief` is our prior $P(x_i)$.\n",
    "    \n",
    "\n",
    "    for i, val in enumerate(map_):\n",
@@ -1768,7 +1768,7 @@
    "\n",
    "$$ \\mathtt{posterior} = \\frac{\\mathtt{prior}\\times \\mathtt{evidence}}{\\mathtt{normalization}}$$ \n",
    "\n",
-    "That is just the Bayes theorem written in words instead of mathematical symbols. I could have just given you Bayes theorem and then written a function, but I doubt that would have been illuminating unless you already know Bayesian statistics. Instead, we figured out what to do just by reasoning about the situation, and so of course the resulting code ended up implementing Bayes theorem. Students spend a lot of time struggling to understand this theorem; I hope you found it relatively straightforward."
+    "That is the Bayes theorem written in words instead of mathematical symbols. I could have given you Bayes theorem and then written a function, but I doubt that would have been illuminating unless you already know Bayesian statistics. Instead, we figured out what to do just by reasoning about the situation, and so of course the resulting code ended up implementing Bayes theorem. Students spend a lot of time struggling to understand this theorem; I hope you found it relatively straightforward."
   ]
  },
  {
@@ -1783,7 +1783,7 @@
    "\n",
    "$$P(X_i^t) = \\sum_j P(X_i^{t-1})  P(x_i | x_j)$$\n",
    "\n",
-    "That equation is called the *total probability theorem*. Quoting from Wikipedia [6] \"It expresses the total probability of an outcome which can be realized via several distinct events\". Again, I could have just given you that equation and implemented `predict()`, but your chances of understanding why the equation works would be slim. As a reminder, here is the code that computes this equation\n",
+    "That equation is called the *total probability theorem*. Quoting from Wikipedia [6] \"It expresses the total probability of an outcome which can be realized via several distinct events\". I could have given you that equation and implemented `predict()`, but your chances of understanding why the equation works would be slim. As a reminder, here is the code that computes this equation\n",
    "\n",
    "    for i in range(N):\n",
    "        for k in range (kN):\n",
--- a/03_Gaussians.ipynb
+++ b/03_Gaussians.ipynb
@@ -404,7 +404,7 @@
    "\n",
    "where $\\sigma$ is the notation for the standard deviation and $\\mu$ is the mean.\n",
    "\n",
-    "If this is the first time you have seen this it may not have a lot of meaning for you. But let's just work through that with the data from the three classes to be sure we understand the formula. We just subtract the mean of x from each value of x, square it, take the average of those, and then take the square root of the result.  The mean of $[1.8, 2.0, 1.7, 1.9, 1.6]$ is 1.8, so we compute the standard deviation as\n",
+    "If this is the first time you have seen this it may not have a lot of meaning for you. But let's work through that with the data from the three classes to be sure we understand the formula. We subtract the mean of x from each value of x, square it, take the average of those, and then take the square root of the result.  The mean of $[1.8, 2.0, 1.7, 1.9, 1.6]$ is 1.8, so we compute the standard deviation as\n",
    "\n",
    "$$ \n",
    "\\begin{aligned}\n",
@@ -657,7 +657,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Many texts alternatively use *VAR(x)* to denote the variance of x, and I opted to use that here just to expose you to that convention."
+    "Many texts alternatively use *VAR(x)* to denote the variance of x."
   ]
  },
  {
@@ -666,7 +666,7 @@
   "source": [
    "### Why the Square of the Differences\n",
    "\n",
-    "As an aside, why are we taking the *square* of the difference? I could go into a lot of math, but let's just look at this in a simple way. Here is a chart of the values of x plotted against the mean for $x=[3,-3,3,-3]$\n"
+    "As an aside, why are we taking the *square* of the difference? I could go into a lot of math, but let's look at this in a simple way. Here is a chart of the values of x plotted against the mean for $x=[3,-3,3,-3]$\n"
   ]
  },
  {
@@ -869,7 +869,7 @@
    "\n",
    "You may object that human heights or automobile speeds cannot be less than zero, let alone $-\\infty$ or $-\\infty$. This is true, but this is a common limitation of mathematical modeling. \"The map is not the territory\" is a common expression, and it is true for Bayesian filtering and statistics. The Gaussian distribution above somewhat closely models the distribution of the measured automobile speeds, but being a model it is necessarily imperfect. The difference between model and reality will come up again and again in these filters. Gaussians are used in many branches of mathematics, not because they perfectly model reality, but because they are easier to use than any other choice. Even in this book Gaussians will fail to model reality, forcing us to computationally expensive alternative. \n",
    "\n",
-    "You will see these distributions called *Gaussian distributions* or *normal distributions*.  *Gaussian* and *normal* both mean the same thing in this context, and are used interchangeably. I will use both throughout this book as different sources will use either term, so I want you to be used to seeing both. Finally, as in this paragraph, it is typical to shorten the name and just talk about a *Gaussian* or *normal* - these are both typical shortcut names for the *Gaussian distribution*. "
+    "You will see these distributions called *Gaussian distributions* or *normal distributions*.  *Gaussian* and *normal* both mean the same thing in this context, and are used interchangeably. I will use both throughout this book as different sources will use either term, so I want you to be used to seeing both. Finally, as in this paragraph, it is typical to shorten the name and talk about a *Gaussian* or *normal* - these are both typical shortcut names for the *Gaussian distribution*. "
   ]
  },
  {
--- a/04_One_Dimensional_Kalman_Filters.ipynb
+++ b/04_One_Dimensional_Kalman_Filters.ipynb
@@ -276,7 +276,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we understand the discrete Bayes filter and Gaussians we are prepared to implement a 1D Kalman filter. We will do this exactly as we did the discrete Bayes filter - rather than starting with equations we will just develop the code step by step based on reasoning about the problem."
+    "Now that we understand the discrete Bayes filter and Gaussians we are prepared to implement a 1D Kalman filter. We will do this exactly as we did the discrete Bayes filter - rather than starting with equations we will develop the code step by step based on reasoning about the problem. "
   ]
  },
  {
@@ -290,7 +290,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As in the Discrete Bayes chapter we will be tracking a moving object in a long hallway at work, such as a dog or robot. However, assume that in our latest hackathon someone created an RFID tracker that provides a reasonably accurate position for our dog. Suppose the hallway is 100 meters long. The sensor returns the distance of the dog from the left end of the hallway in meters. So, 23.4 would mean the dog is 23.4 meters from the left end of the hallway.\n",
+    "As in the *Discrete Bayes Filter* chapter we will be tracking a moving object in a long hallway at work, such as a dog or robot. However, assume that in our latest hackathon someone created an RFID tracker that provides a reasonably accurate position for our dog. Suppose the hallway is 100 meters long. The sensor returns the distance of the dog from the left end of the hallway in meters. So, 23.4 would mean the dog is 23.4 meters from the left end of the hallway.\n",
    "\n",
    "Naturally, the sensor is not perfect. A reading of 23.4 could correspond to a real position of 23.7, or 23.0. However, it is very unlikely to correspond to a real position of say 47.6. Testing during the hackathon confirmed this result - the sensor is 'reasonably' accurate, and while it had errors, the errors are small. Furthermore, the errors seemed to be evenly distributed on both sides of the measurement; a true position of 23 m would be equally likely to be measured as 22.9 as 23.1.\n",
    "\n",
@@ -819,7 +819,7 @@
    "\n",
    "If we think of the Gaussians as two measurements, this makes sense. If I measure twice and get 23 meters each time, I should conclude that the length is close to 23 meters. Thus the mean should be 23. I am more confident with two measurements than with one, so the variance of the result should be smaller. \n",
    "\n",
-    "\"Measure twice, cut once\" is a useful saying and practice due to this fact!  The Gaussian is just a mathematical model of this physical fact, so we should expect the math to follow our physical process. \n",
+    "\"Measure twice, cut once\" is a useful saying and practice due to this fact!  The Gaussian is a mathematical model of this physical fact, so we should expect the math to follow our physical process. \n",
    "\n",
    "Now let's multiply two Gaussians (or equivalently, two measurements) that are partially separated. In other words, their means will be different, but their variances will be the same. What do you think the result will be? Think about it, and then look at the graph."
   ]
@@ -866,7 +866,7 @@
   "source": [
    "Another beautiful result! If I handed you a measuring tape and asked you to measure the distance from table to a wall, and you got 23 meters, and then a friend make the same measurement and got 25 meters, your best guess must be 24 meters if you trust the skills of your friends equally.\n",
    "\n",
-    "That is fairly counter-intuitive in more complicated situations, so let's consider it further. Perhaps a more reasonable assumption would be that either you or your coworker just made a mistake, and the true distance is either 23 or 25, but certainly not 24. Surely that is possible. However, suppose the two measurements you reported as 24.01 and 23.99. In that case you would agree that in this case the best guess for the correct value is 24?  Which interpretation we choose depends on the properties of the sensors we are using.\n",
+    "That is fairly counter-intuitive in more complicated situations, so let's consider it further. Perhaps a more reasonable assumption would be that either you or your coworker made a mistake, and the true distance is either 23 or 25, but certainly not 24. Surely that is possible. However, suppose the two measurements you reported as 24.01 and 23.99. In that case you would agree that in this case the best guess for the correct value is 24?  Which interpretation we choose depends on the properties of the sensors we are using.\n",
    "\n",
    "This topic is fairly deep, and I will explore it once we have completed our Kalman filter. For now I will merely say that the Kalman filter requires the interpretation that measurements are accurate, with Gaussian noise, and that a large error caused by misreading a measuring tape is not Gaussian noise.\n",
    "\n",
@@ -923,7 +923,7 @@
    "\n",
    "Of course, this example is quite exaggerated. The width of the Gaussians is fairly narrow, so this combination of measurements is extremely unlikely. It is hard to eyeball this, but the measurements are well over $3\\sigma$ apart, so the probability of this happening is less than 1%. Still, it can happen, and the math is correct. In practice this is the sort of thing we use to decide if \n",
    "\n",
-    "Let's look at the math again to convince ourselves that the physical interpretation of the Gaussian equations makes sense. I'm going to switch back to using priors and measurements. The math and reasoning is the same whether you are using a prior and incorporating a measurement, or just trying to compute the mean and variance of two measurements. For Kalman filters we will be doing a lot more of the former than the latter, so let's just get used to it.\n",
+    "Let's look at the math again to convince ourselves that the physical interpretation of the Gaussian equations makes sense. I'm going to switch back to using priors and measurements. The math and reasoning is the same whether you are using a prior and incorporating a measurement, or just trying to compute the mean and variance of two measurements. For Kalman filters we will be doing a lot more of the former than the latter, so let's get used to it.\n",
    "\n",
    "$$\n",
    "\\mu=\\frac{\\sigma_\\mathtt{prior}^2 \\mu_\\mathtt{z} + \\sigma_\\mathtt{z}^2 \\mu_\\mathtt{prior}} {\\sigma_\\mathtt{prior}^2 + \\sigma_\\mathtt{z}^2}\n",
@@ -934,13 +934,13 @@
    "$$\\mu=\\frac{\\sigma_\\mathtt{z}^2(\\mu_\\mathtt{prior} + \\mu_\\mathtt{z})}{2\\sigma_\\mathtt{z}^2}\\\\\n",
    "= \\frac{1}{2}(\\mu_\\mathtt{prior} + \\mu_\\mathtt{z})$$\n",
    "\n",
-    "which is just the average of the two means. If we look at the extreme cases, assume the first scale is very much more accurate than than the second one. At the limit, we can set \n",
+    "which is the average of the two means. If we look at the extreme cases, assume the first scale is very much more accurate than than the second one. At the limit, we can set \n",
    "$\\sigma_\\mathtt{prior}^2=0$, yielding\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mu&=\\frac{0*\\mu_\\mathtt{z} + \\sigma_\\mathtt{z}^2 \\mu_\\mathtt{prior}} { \\sigma_\\mathtt{z}^2}, \\\\\n",
-    "\\text{or just}\\\\\n",
+    "\\text{or}\\\\\n",
    "\\mu&=\\mu_\\mathtt{prior}\n",
    "\\end{aligned}\n",
    "$$\n",
@@ -950,7 +950,7 @@
    "$$\n",
    "\\begin{aligned}\n",
    "\\mu&=\\frac{9 \\sigma_\\mathtt{z}^2 \\mu_2 + \\sigma_\\mathtt{z}^2 \\mu_\\mathtt{prior}} {9 \\sigma_\\mathtt{z}^2 + \\sigma_\\mathtt{z}^2} \\\\\n",
-    "\\text{or just}\\\\\n",
+    "\\text{or}\\\\\n",
    "\\mu&= \\frac{1}{10} \\mu_\\mathtt{prior} + \\frac{9}{10} \\mu_\\mathtt{z}\n",
    "\\end{aligned}\n",
    "$$\n",
@@ -1152,7 +1152,7 @@
    "                \n",
    "In a nutshell, we shift the probability vector by the amount we believe the animal moved, and adjust the probability. How do we do that with Gaussians?\n",
    "\n",
-    "It turns out that we just add Gaussians. Think of the case without Gaussians. I think my dog is at 7.3 meters, and he moves 2.6 meters to right, where is he now? Obviously, 7.3 + 2.6 = 9.9. He is at 9.9 meters. Abstractly, the algorithm is `new_pos = old_pos + dist_moved`. It does not matter if we use floating point numbers or Gaussians for these values, the algorithm must be the same. \n",
+    "It turns out that we add them. Think of the case without Gaussians. I think my dog is at 7.3 meters, and he moves 2.6 meters to right, where is he now? Obviously, 7.3 + 2.6 = 9.9. He is at 9.9 meters. Abstractly, the algorithm is `new_pos = old_pos + dist_moved`. It does not matter if we use floating point numbers or Gaussians for these values, the algorithm must be the same. \n",
    "\n",
    "How is addition for Gaussians performed? It turns out to be very simple:\n",
    "\n",
@@ -1174,7 +1174,7 @@
    "All we do is add the means and the variance separately! Does that make sense? Think of the physical representation of this abstract equation.\n",
    "${\\mu}_\\mathtt{before}$ is the old position, and ${\\mu}_\\mathtt{movement}$ is how far we moved. Surely it makes sense that our new position is $\\mu_\\mathtt{before} + \\mu_\\mathtt{movement}$.\n",
    "\n",
-    "What about the variance? It is perhaps harder to form an intuition about this. However, recall that with the `predict()` function for the discrete Bayes filter we always lost information - our confidence after the update was lower than our confidence before the update. We don't really know where the dog is moving, so the confidence should get smaller (variance gets larger). $\\sigma^2_\\mathtt{movement}$ is just the amount of uncertainty added to the system due to the imperfect prediction about the movement, and so we would just add that to the existing uncertainty. \n",
+    "What about the variance? It is perhaps harder to form an intuition about this. However, recall that with the `predict()` function for the discrete Bayes filter we always lost information - our confidence after the update was lower than our confidence before the update. We don't really know where the dog is moving, so the confidence should get smaller (variance gets larger). $\\sigma^2_\\mathtt{movement}$ is the amount of uncertainty added to the system due to the imperfect prediction about the movement, and so we would add that to the existing uncertainty. \n",
    "\n",
    "I assure you that the equation for Gaussian addition is correct, and derived by basic algebra. Therefore it is reasonable to expect that if we are using Gaussians to model physical events, the results must correctly describe those events.\n",
    "\n",
@@ -1320,7 +1320,7 @@
    "\n",
    "    velocity = 1.\n",
    "    \n",
-    "So how do I know the velocity? Magic? Consider it a prediction, or perhaps we have a secondary velocity sensor. If this is a robot then this would be a control input to the robot. In the next chapter we will learn how to handle situations where you don't have a velocity sensor or input, so please just accept this simplification for now.\n",
+    "So how do I know the velocity? Magic? Consider it a prediction, or perhaps we have a secondary velocity sensor. If this is a robot then this would be a control input to the robot. In the next chapter we will learn how to handle situations where you don't have a velocity sensor or input, so please accept this simplification for now.\n",
    "\n",
    "    process_variance = 1.\n",
    "    sensor_variance = 2.\n",
@@ -1343,7 +1343,7 @@
    "    \n",
    "It may seem very 'convenient' to set the simulator to the same position as our guess, and it is. Do not fret. In the next example we will see the effect of a wildly inaccurate guess for the dog's initial position. We move the dog for 10 steps forward and store the measurements in `zs` via a list comprehension.\n",
    "\n",
-    "It is traditional to call the measurement $z$, and so I follow that convention here. When I was learning this topic I found the standard names very obscure. However, if you wish to read the literature you will have to become used to them, so I will not use a more readable variable name such as `m` or `measure`. It is just something you have to memorize if you want to work with Kalman filters. As we continue I will introduce more of the standard naming into the code. The `s` at the end of the name is just a plural - I'm saying there are many `z`s in the list.\n",
+    "It is traditional to call the measurement $z$, and so I follow that convention here. When I was learning this topic I found the standard names very obscure. However, if you wish to read the literature you will have to become used to them, so I will not use a more readable variable name such as `m` or `measure`. It is something you have to memorize if you want to work with Kalman filters. As we continue I will introduce more of the standard naming into the code. The `s` at the end of the name is a plural - I'm saying there are many `z`s in the list.\n",
    "\n",
    "The next code allocates an array to store the output of the filtered positions. \n",
    "\n",
@@ -1354,7 +1354,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we just enter our `predict() ... update()` loop.\n",
+    "Now we enter our `predict() ... update()` loop.\n",
    "\n",
    "    for i in range(N):\n",
    "        pos = predict(pos=pos[0], variance=pos[1], \n",
@@ -1379,7 +1379,7 @@
    "\n",
    "Now look at the variance: 1.9901 m$^2$. It has dropped tremendously from 401 m$^2$. Why? Well, the RFID has a reasonably small variance of 2.0 m$^2$, so we trust it far more than our previous belief.\n",
    "\n",
-    "Now the software just loops, calling `predict()` and `update()` in turn. By the end the final estimated position is 15.0529 vs the actual position of 14.8383. The variance has converged to 1.0 m$^2$. \n",
+    "Now the software loops, calling `predict()` and `update()` in turn. By the end the final estimated position is 15.0529 vs the actual position of 14.8383. The variance has converged to 1.0 m$^2$. \n",
    "\n",
    "Now look at the plot. The noisy measurements are plotted in with a dotted red line, and the filter results are in the solid blue line. Both are quite noisy, but notice how much noisier the measurements (red line) are. This is your first Kalman filter and it seems to work!\n",
    "\n",
@@ -1564,7 +1564,7 @@
    "\n",
    "$$h = \\frac{COV (x,\\dot{x})}{\\sigma^2_\\mathtt{z}}$$\n",
    "\n",
-    "We haven't discussed the covariance (denoted by $COV$), so you may not fully understand this equation. Just think of the numerator as the variance between the state and its velocity. Recall that $g$ is used to scale the state, and $h$ is used to scale the velocity of the state, and this equation should seem reasonable. In the next chapter it will become very clear.\n",
+    "We haven't discussed the covariance (denoted by $COV$), so you may not fully understand this equation. Think of the numerator as the variance between the state and its velocity. Recall that $g$ is used to scale the state, and $h$ is used to scale the velocity of the state, and this equation should seem reasonable. In the next chapter it will become very clear.\n",
    "\n",
    "The takeaway point is that *g* and *h* are specified fully by the variance and covariances of the measurement and predictions at time *n*. That is all the Kalman filter is. It is easier to think of the noise in the measurement and the process model, so we don't normally use the g-h formulation, but we could. It's the same thing written differently. The Kalman filter is a g-h filter."
   ]
@@ -1607,7 +1607,7 @@
    "            self.measurement_variance = measurement_variance\n",
    "            self.movement_variance = movement_variance\n",
    "\n",
-    "That works, but I am going to use different naming. The Kalman filter literature uses math, not code, so it uses single letter variables, so I will start exposing you to it now. At first it seems impossibly terse, but as you become familiar with the nomenclature you'll see that the math formulas in the textbooks will have an exact one-to-one correspondence with the code. Unfortunately there is not a lot of meaning behind the naming unless you have training in control theory; you will just have to memorize them. If you do not make this effort you will never be able to read the literature.\n",
+    "That works, but I am going to use different naming. The Kalman filter literature uses math, not code, so it uses single letter variables, so I will start exposing you to it now. At first it seems impossibly terse, but as you become familiar with the nomenclature you'll see that the math formulas in the textbooks will have an exact one-to-one correspondence with the code. Unfortunately there is not a lot of meaning behind the naming unless you have training in control theory; you will have to memorize them. If you do not make this effort you will never be able to read the literature.\n",
    "\n",
    "So, we use `x` for the state (estimated value of the filter) and `P` for the variance of the state. `R` is the measurement error, and `Q` is the process (prediction) error. This gives us:\n",
    "\n",
@@ -1731,9 +1731,7 @@
    "\n",
    "**This is extremely important to understand**: Every filter in this book implements the **same algorithm**, just with different mathematical details. We make a prediction, make a measurement, and then choose the new estimate to be somewhere between those two. The math can become challenging in later chapters, but the idea is easy to understand.\n",
    "\n",
-    "It is important to see past the details of the equations of a specific filter and understand *what* the equations are calculating and *why*. This is important not just so you can understand the filters in this book, but so that you can understand filters that you will see in other situations. There are a tremendous number of filters that we do not describe - the SVD filter, the unscented particle filter, the H$_\\infty$ filter, the ensemble filter, and so many more. They all use different math to implement the same algorithm. The choice of math affects the quality of results, not the underlying ideas.\n",
-    "\n",
-    "If you understand why this algorithm works you will be able to understand most any filter you encounter.  In other words, this is not just the algorithm for the Kalman filter, it is the algorithm used for any Bayesian filter, and for many non-Bayesian ones as well.\n",
+    "It is important to see past the details of the equations of a specific filter and understand *what* the equations are calculating and *why*. This is important not only so you can understand the filters in this book, but so that you can understand filters that you will see in other situations. There are a tremendous number of filters that we do not describe - the SVD filter, the unscented particle filter, the H$_\\infty$ filter, the ensemble filter, and so many more. They all use different math to implement the same algorithm. The choice of math affects the quality of results, not the underlying ideas.\n",
    "\n",
    "Here it is:\n",
    "\n",
@@ -1812,7 +1810,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We generate white noise with a given variance using the equation `randn() * variance`. The specification gives us the standard deviation of the noise, not the variance, but recall that variance is just the square of the standard deviation. Hence we raise 0.13 to the second power."
+    "We generate white noise with a given variance using the equation `randn() * variance`. The specification gives us the standard deviation of the noise, not the variance, but recall that variance is the square of the standard deviation. Hence we raise 0.13 to the second power."
   ]
  },
  {
@@ -1821,7 +1819,7 @@
   "source": [
    "Now we need to write the Kalman filter processing loop. As with our previous problem, we need to perform a cycle of predicting and updating. The sensing step probably seems clear - call `volt()` to get the measurement, pass the result into `update()` function, but what about the predict step? We do not have a sensor to detect 'movement' in the voltage, and for any small duration we expect the voltage to remain constant. How shall we handle this?\n",
    "\n",
-    "As always, we will trust in the math. We have no known movement, so we will set that to zero. However, that means that we are predicting that the temperature will never change. If that is true, then over time we should become extremely confident in our results. Once the filter has enough measurements it will become very confident that it can predict the subsequent temperatures, and this will lead it to ignoring measurements that result due to an actual temperature change. This is called a *smug* filter, and is something you want to avoid. So we will add a bit of error to our prediction step to tell the filter not to discount changes in voltage over time. In the code below I set `process_variance = .05**2`. This is just the expected variance in the change of voltage over each time step. I chose this value merely to be able to show how the variance changes through the update and predict steps. For an real sensor you would set this value for the actual amount of change you expect. For example, this would be an extremely small number if it is a thermometer for ambient air temperature in a house, and a high number if this is a thermocouple in a chemical reaction chamber. We will say more about selecting the actual value in the next chapter. \n",
+    "As always, we will trust in the math. We have no known movement, so we will set that to zero. However, that means that we are predicting that the temperature will never change. If that is true, then over time we should become extremely confident in our results. Once the filter has enough measurements it will become very confident that it can predict the subsequent temperatures, and this will lead it to ignoring measurements that result due to an actual temperature change. This is called a *smug* filter, and is something you want to avoid. So we will add a bit of error to our prediction step to tell the filter not to discount changes in voltage over time. In the code below I set `process_variance = .05**2`. This is the expected variance in the change of voltage over each time step. I chose this value merely to be able to show how the variance changes through the update and predict steps. For an real sensor you would set this value for the actual amount of change you expect. For example, this would be an extremely small number if it is a thermometer for ambient air temperature in a house, and a high number if this is a thermocouple in a chemical reaction chamber. We will say more about selecting the actual value in the next chapter. \n",
    "\n",
    "Let's see what happens. "
   ]
@@ -1924,9 +1922,9 @@
   "source": [
    "The first plot shows the individual sensor measurements vs the filter output. Despite a lot of noise in the sensor we quickly discover the approximate voltage of the sensor. In the run I just completed at the time of authorship, the last voltage output from the filter is $16.213$, which is quite close to the $16.4$ used by the `volt()` function. On other runs I have gotten larger and smaller results.\n",
    "\n",
-    "Spec sheets are just what they sound like - specifications. Any individual sensor will exhibit different performance based on normal manufacturing variations. Values are often maximums - the spec is a guarantee that the performance will be at least that good. So, our sensor might have standard deviation of 1.8. If you buy an expensive piece of equipment it often comes with a sheet of paper displaying the test results of your specific item; this is usually very trustworthy. On the other hand, if this is a cheap sensor it is likely it received little to no testing prior to being sold. Manufacturers typically test a small subset of their output to verify that a sample falls within the desired performance range. If you have a critical application you will need to read the specification sheet carefully to figure out exactly what they mean by their ranges. Do they guarantee their number is a maximum, or is it, say, the $3\\sigma$ error rate? Is every item tested? Is the variance normal, or some other distribution? Finally, manufacturing is not perfect. Your part might be defective and not match the performance on the sheet.\n",
+    "Spec sheets are what they sound like - specifications. Any individual sensor will exhibit different performance based on normal manufacturing variations. Values are often maximums - the spec is a guarantee that the performance will be at least that good. So, our sensor might have standard deviation of 1.8. If you buy an expensive piece of equipment it often comes with a sheet of paper displaying the test results of your specific item; this is usually very trustworthy. On the other hand, if this is a cheap sensor it is likely it received little to no testing prior to being sold. Manufacturers typically test a small subset of their output to verify that a sample falls within the desired performance range. If you have a critical application you will need to read the specification sheet carefully to figure out exactly what they mean by their ranges. Do they guarantee their number is a maximum, or is it, say, the $3\\sigma$ error rate? Is every item tested? Is the variance normal, or some other distribution? Finally, manufacturing is not perfect. Your part might be defective and not match the performance on the sheet.\n",
    "\n",
-    "For example, I just randomly looked up a data sheet for an airflow sensor. There is a field *Repeatability*, with the value $\\pm 0.50\\%$. Is this a Gaussian? Is there a bias? For example, perhaps the repeatability is nearly 0.0% at low temperatures, and always nearly  +0.50 at high temperatures. Data sheets for electrical components often contain a section of \"Typical Performance Characteristics\". These are used to capture information that cannot be easily conveyed in a table. For example, I am looking at a chart showing output voltage vs current for a LM555 timer. There are three curves showing the performance at different temperatures. The response is ideally linear, but all three lines are curved. This clarifies that errors in voltage outputs are probably not Gaussian - in this chip's case higher temperatures leads to lower voltage output, and the voltage output is quite nonlinear if the input current is very high. \n",
+    "For example, I am looking at a data sheet for an airflow sensor. There is a field *Repeatability*, with the value $\\pm 0.50\\%$. Is this a Gaussian? Is there a bias? For example, perhaps the repeatability is nearly 0.0% at low temperatures, and always nearly  +0.50 at high temperatures. Data sheets for electrical components often contain a section of \"Typical Performance Characteristics\". These are used to capture information that cannot be easily conveyed in a table. For example, I am looking at a chart showing output voltage vs current for a LM555 timer. There are three curves showing the performance at different temperatures. The response is ideally linear, but all three lines are curved. This clarifies that errors in voltage outputs are probably not Gaussian - in this chip's case higher temperatures leads to lower voltage output, and the voltage output is quite nonlinear if the input current is very high. \n",
    "\n",
    "As you might guess, modeling the performance of your sensors is one of the harder parts of creating a Kalman filter that performs well. "
   ]
@@ -2224,7 +2222,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Again the answer is yes! Because we are relatively sure about our belief in the sensor ($\\sigma=30$) even after the first step we have changed our belief in the first position from 1000 to somewhere around 60.0 or so. After another 5-10 measurements we have converged to the correct value! So this is how we get around the chicken and egg problem of initial guesses. In practice we would probably just assign the first measurement from the sensor as the initial value, but you can see it doesn't matter much if we wildly guess at the initial conditions - the Kalman filter still converges so long as the filter variances are chosen to match the actual process and measurement variances."
+    "Again the answer is yes! Because we are relatively sure about our belief in the sensor ($\\sigma=30$) even after the first step we have changed our belief in the first position from 1000 to somewhere around 60.0 or so. After another 5-10 measurements we have converged to the correct value! So this is how we get around the chicken and egg problem of initial guesses. In practice we would likely assign the first measurement from the sensor as the initial value, but you can see it doesn't matter much if we wildly guess at the initial conditions - the Kalman filter still converges so long as the filter variances are chosen to match the actual process and measurement variances."
   ]
  },
  {
@@ -2284,7 +2282,7 @@
   "source": [
    "This time the filter does struggle. Notice that the previous example only computed 100 updates, whereas this example uses 1000. By my eye it takes the filter 400 or so iterations to become reasonable accurate, but maybe over 600 before the results are good. Kalman filters are good, but we cannot expect miracles. If we have extremely noisy data and extremely bad initial conditions, this is as good as it gets.\n",
    "\n",
-    "Finally, let's make the suggest change of making our initial position guess just be the first sensor measurement."
+    "Finally, let's make the suggest change of making our initial position guess be the first sensor measurement."
   ]
  },
  {
@@ -2346,7 +2344,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Implement the Kalman filter using IPython Notebook's animation features to allow you to modify the various constants in real time using sliders. Refer to the section **Interactive Gaussians** in the Gaussian chapter to see how to do this. You will use the `interact()` function to call a calculation and plotting function. Each parameter passed into `interact()` automatically gets a slider created for it. I have built the boilerplate for this; just fill in the required code."
+    "Implement the Kalman filter using IPython Notebook's animation features to allow you to modify the various constants in real time using sliders. Refer to the section **Interactive Gaussians** in the Gaussian chapter to see how to do this. You will use the `interact()` function to call a calculation and plotting function. Each parameter passed into `interact()` automatically gets a slider created for it. I have built the boilerplate for this; you fill in the required code."
   ]
  },
  {
@@ -2532,7 +2530,7 @@
    "\n",
    "If we recall the g-h filter chapter we can understand what is happening here. The structure of the g-h filter requires that the filter output chooses a value part way between the prediction and measurement. A varying signal like this one is always accelerating, whereas our process model assumes constant velocity, so the filter is mathematically guaranteed to always lag the input signal. \n",
    "\n",
-    "Maybe we just didn't adjust things 'quite right'. After all, the output looks like a sin wave, it is just offset some. Let's test this assumption."
+    "Maybe we didn't adjust things 'quite right'. After all, the output looks like an offset sin wave. Let's test this assumption."
   ]
  },
  {
@@ -2677,7 +2675,7 @@
    "$$P(x|z) \\propto \\exp \\Big[-\\frac{1}{2}\\frac{(x-\\frac{\\sigma_z^2\\mu_z + \\sigma_p^2z}{\\sigma_p^2+\\sigma_z^2})^2}{\\frac{\\sigma_z^2\\sigma_p^2}{\\sigma_p^2+\\sigma_z^2}}\\Big ]\n",
    "$$\n",
    "\n",
-    "A Gaussian is just\n",
+    "A Gaussian is\n",
    "\n",
    "$$N(m,\\, s^2) \\propto \\exp\\Big [-\\frac{1}{2}\\frac{(x - m)^2}{s^2}\\Big ]$$\n",
    "\n",
--- a/06_Kalman_Filter_Math.ipynb
+++ b/06_Kalman_Filter_Math.ipynb
@@ -299,13 +299,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The title of this book is *Kalman and Bayesian Filters in Python* but to date I have not touched on the Bayesian aspect much. There was enough going on in the earlier chapters that adding this form of reasoning about filters could be a distraction rather than a help. I now wish to take some time to explain what Bayesian probability is and how a Kalman filter is in fact a Bayesian filter. This is not just a diversion. First of all, a lot of the Kalman filter literature uses this formulation when talking about filters, so you will need to understand what they are talking about. Second, this math plays a strong role in filtering design once we move past the Kalman filter. \n",
+    "The title of this book is *Kalman and Bayesian Filters in Python* but to date I have not touched on the Bayesian aspect much. There was enough going on in the earlier chapters that adding this form of reasoning about filters could be a distraction rather than a help. I now wish to take some time to explain what Bayesian probability is and how a Kalman filter is in fact a Bayesian filter. This is not a diversion. First of all, a lot of the Kalman filter literature uses this formulation when talking about filters, so you will need to understand what they are talking about. Second, this math plays a strong role in filtering design once we move past the Kalman filter. \n",
    "\n",
    "To do so we will go back to our first tracking problem - tracking a dog in a hallway. Recall the update step - we believed with some degree of precision that the dog was at position 7 (for example), and then receive a measurement that the dog is at position 7.5. We want to incorporate that measurement into our belief. In the *Discrete Bayes* chapter we used histograms to denote our estimates at each hallway position, and in the *One Dimensional Kalman Filters* we used Gaussians. Both are method of using *Bayesian* probability.\n",
    "\n",
    "Briefly, *Bayesian* probability is a branch of math that lets us evaluate a hypothesis or new data point given some uncertain information about the past. For example, suppose you are driving down your neighborhood street and see one of your neighbors at their door, using a key to let themselves in. Three doors down you see two people in masks breaking a window of another house. What might you conclude?\n",
    "\n",
-    "It is likely that you would reason that in the first case your neighbors were getting home and unlocking their door to get inside. In the second case you at least strongly suspect a robbery is in progress. In the first case you would just proceed on, and in the second case you'd probably call the police.\n",
+    "It is likely that you would reason that in the first case your neighbors were getting home and unlocking their door to get inside. In the second case you at least strongly suspect a robbery is in progress. In the first case you would proceed on, and in the second case you'd probably call the police.\n",
    "\n",
    "Of course, things are not always what they appear. Perhaps unbeknownst to you your neighbor sold their house that morning, and they were now breaking in to steal the new owner's belongings. In the second case, perhaps the owners of the house were at a costume event at the next house, they had a medical emergency with their child, realized they lost their keys, and were breaking into their own house to get the much needed medication. Those are both *unlikely* events, but possible. Adding a few additional pieces of information would allow you to determine the true state of affairs in all but the most complicated situations.\n",
    "\n",
@@ -328,7 +328,7 @@
    "$$P(A|B) = \\frac{P(B | A)\\, P(A)}{P(B)}\\cdot$$\n",
    "\n",
    "\n",
-    "Before we do some computations, let's review what the terms mean. P(A) is called the *prior probability* of the event A, and is often just shortened to the *prior*. What is the prior? It is just the probability of A being true *before* we incorporate new evidence. In our dog tracking problem above, the prior is the probability we assign to our belief that the dog is positioned at 7 before we make the measurement of 7.5. It is important to master this terminology if you expect to read a lot of the literature.\n",
+    "Before we do some computations, let's review what the terms mean. P(A) is called the *prior probability* of the event A, and is often shortened to the *prior*. What is the prior? It is the probability of A being true *before* we incorporate new evidence. In our dog tracking problem above, the prior is the probability we assign to our belief that the dog is positioned at 7 before we make the measurement of 7.5. It is important to master this terminology if you expect to read a lot of the literature.\n",
    "\n",
    "$P(A|B)$ is the *conditional probability* that A is true given that B is true. For example, if it is true that your neighbor still owns their house, then it will be very likely that they are not breaking into their house. In Bayesian probability this is called the *posterior*, and it denotes our new belief after incorporating the measurement/knowledge of B. For our dog tracking problem the posterior is the probability given to the estimated position after incorporating the measurement 7.5. For the neighbor problem the posterior would be the probability of a break in after you find out that your neighbor sold their home last week.\n",
    "\n",
@@ -383,7 +383,7 @@
    "\n",
    "It's the same thing being calculated by the code. Multiply the prior ($P(A)$) by the probability of the measurement at each position ($P(B|A)$) and divide by the total probability for the event ($P(B)$).\n",
    "\n",
-    "In other words the first half of the Discrete Bayes chapter developed Bayes' equation from a thought experiment. I could have just presented Bayes' equation and then given you the Python routine above to implement it, but chances are you would not have understood *why* Bayes' equation works. Presenting the equation first is the normal approach of Kalman filtering texts, and I always found it extremely nonintuitive. "
+    "In other words the first half of the Discrete Bayes chapter developed Bayes' equation from a thought experiment. I could have presented Bayes' equation and then given you the Python routine above to implement it, but chances are you would not have understood *why* Bayes' equation works. Presenting the equation first is the normal approach of Kalman filtering texts, and I always found it extremely nonintuitive. "
   ]
  },
  {
@@ -409,7 +409,7 @@
    "\n",
    "$$\\mathbf{x}_0 = \\begin{bmatrix}1 \\\\8\\\\3\\end{bmatrix},\\ \\  \\mathbf{x}_1 = \\begin{bmatrix}3\\\\7\\\\7\\end{bmatrix}$$\n",
    "\n",
-    "We compute the mean just as for the univariate case, which we presented in the *Gaussians* chapter. Sum the values and divide by the number of values. Formally\n",
+    "We compute the mean as for the univariate case - sum the values and divide by the number of values. Formally\n",
    "\n",
    "$$ \\mu_x = \\frac{1}{n}\\sum^n_{i=1} x_i$$\n",
    "\n",
@@ -432,11 +432,11 @@
   "source": [
    "### Expected Value\n",
    "\n",
-    "I could just give you the formula for the covariance but it will make more sense if you see how it is derived. To do that we first need to talk about **expected value** of a random variable. The expected value is just the value we expect, on average, for the variable. \n",
+    "I could just give you the formula for the covariance but it will make more sense if you see how it is derived. To do that we first need to talk about **expected value** of a random variable. The expected value is the value we expect, on average, for the variable. \n",
    "\n",
-    "The expected value of a random variable is just the average value it would have if we took an infinite number of samples of it and then averaged those samples together. Let's say we have $x=[1,3,5]$ and each value is equally probable. What would we *expect* $x$ to have, on average?\n",
+    "The expected value of a random variable is the average value it would have if we took an infinite number of samples of it and then averaged those samples together. Let's say we have $x=[1,3,5]$ and each value is equally probable. What would we *expect* $x$ to have, on average?\n",
    "\n",
-    "It would just be the average of 1, 3, and 5, of course, which is 3. That should make sense; we would expect equal numbers of 1, 3, and 5 to occur, so $(1+3+5)/3=3$ is clearly the average of that infinite series of samples.\n",
+    "It would be the average of 1, 3, and 5, of course, which is 3. That should make sense; we would expect equal numbers of 1, 3, and 5 to occur, so $(1+3+5)/3=3$ is clearly the average of that infinite series of samples.\n",
    "\n",
    "Now suppose that each value has a different probability of happening. Say 1 has an 80% chance of occurring, 3 has an 15% chance, and 5 has only a 5% chance. In this case we compute the expected value by multiplying each value of $x$ by the percent chance of it occurring, and summing the result. So for this case we could compute\n",
    "\n",
@@ -472,7 +472,7 @@
    "\n",
    "$$\\sigma^2 = \\frac{1}{N}\\sum_{i=1}^N(x_i - \\mu)^2$$\n",
    "\n",
-    "And indeed, the covariance of a variable with itself is just the variance of the variable, which we can trivially prove with\n",
+    "And indeed, the covariance of a variable with itself is the variance of the variable, which we can trivially prove with\n",
    "\n",
    "$$\\begin{aligned}\n",
    "COV(x,x) &= \\frac{1}{N}\\sum_{i=1}^N (x_i - \\mu_x)(x_i - \\mu_x) \\\\\n",
@@ -534,7 +534,7 @@
    "\n",
    "$$\\Sigma = \\begin{bmatrix}1 &-0.5 & \\\\ -0.5 & 0.25 & \\\\ &&4\\end{bmatrix}$$.\n",
    "\n",
-    "The arithmetic is a bit tedious, so let's just use NumPy's `cov()` function to compute the entire covariance matrix. To compute the covariance in accordance with the equations above you will need to set the parameter `bias=1`.  The meaning of that parameter is not important to this book. If you are interested, wikipedia has a good article on it here http://en.wikipedia.org/wiki/Bias_of_an_estimator."
+    "The arithmetic is a bit tedious, so let's use NumPy's `cov()` function to compute the entire covariance matrix. To compute the covariance in accordance with the equations above you will need to set the parameter `bias=1`.  The meaning of that parameter is not important to this book. If you are interested, wikipedia has a good article on it here http://en.wikipedia.org/wiki/Bias_of_an_estimator."
   ]
  },
  {
@@ -712,7 +712,7 @@
    "x(t) = x_{pred}(t) + noise(t)\n",
    "$$\n",
    "\n",
-    "This is not meant to imply that $noise(t)$ is a function that we can derive analytically or that it is well behaved. If there is a bump in the road at $t=10$ then the noise factor will just incorporate that effect. Again, this is not implying that we model, compute, or even know the value of *noise(t)*, it is merely a statement of fact - we can *always* describe the actual value as the predicted value from our idealized model plus some other value. \n",
+    "This is not meant to imply that $noise(t)$ is a function that we can derive analytically or that it is well behaved. If there is a bump in the road at $t=10$ then the noise factor will incorporate that effect. Again, this is not implying that we model, compute, or even know the value of *noise(t)*, it is merely a statement of fact - we can *always* describe the actual value as the predicted value from our idealized model plus some other value. \n",
    "\n",
    "Let's express this with linear algebra. Using the same notation from previous chapters, we can say that our model of the system (without noise) is:\n",
    "\n",
@@ -721,11 +721,11 @@
    "That is, we have a set of linear equations that describe our system. For our car, \n",
    "$\\mathbf{F}$ will be the coefficients for Newton's equations of motion. \n",
    "\n",
-    "Now we need to model the noise. We will just call that *w*, and add it to the equation.\n",
+    "Now we need to model the noise. We will call that *w*, and add it to the equation.\n",
    "\n",
    "$$ f(\\mathbf{x}) = \\mathbf{Fx} + \\mathbf{w}$$\n",
    "\n",
-    "Finally, we need to consider inputs into the system. We are dealing with linear problems here, so we will assume that there is some input $u$ into the system, and that we have some linear model that defines how that input changes the system. For example, if you press down on the accelerator in your car the car will accelerate. We will need a matrix $\\mathbf{B}$ to convert $u$ into the effect on the system. We just add that into our equation:\n",
+    "Finally, we need to consider inputs into the system. We are dealing with linear problems here, so we will assume that there is some input $u$ into the system, and that we have some linear model that defines how that input changes the system. For example, if you press down on the accelerator in your car the car will accelerate. We will need a matrix $\\mathbf{B}$ to convert $u$ into the effect on the system. We add that into our equation:\n",
    "\n",
    "$$ f(\\mathbf{x}) = \\mathbf{Fx} + \\mathbf{Bu} + \\mathbf{w}$$\n",
    "\n",
@@ -762,7 +762,7 @@
    "\\end{aligned}\n",
    "$$\n",
    "\n",
-    "Just a reminder: the superscript $^-$ is used to denote that the value is a prediction, not an estimate. But $\\mathbf{x}$ and $\\mathbf{x}^-$ are the same thing, the *state* of our system, just at different times of the algorithm. I am not entirely pleased with this notation for several reasons. First, it clutters the equations up, making them harder to read. More importantly, that aren't always correct. For example, consider a situation wherre we have no measurement for 1 or more time periods. In that case we just compute the predict step several times in a row without an intervening update step. Thus, $\\mathbf{x}$ is never computed, and the predict step is actually $\\mathbf{x}^-_t = \\mathbf{Fx}^-_{t-1} + \\mathbf{Bu}$. Alternatively, if you have several measurements for one time epoch (from different sensors, say) it is possible to perform the update once for each measurement instead of trying to incorporate all of the measurements at once. Later in the book we will do this with a localization problem where we have bearing and distance measurements for several landmarks. In that case the subsequent updates are not performing the computations on the predicted state, but on the partially updated state. You can think of this as un update being performed with no movement or time having passed since the last update. The $^-$ notation does not capture any of these. For most of the book I will dispense with this notation unless I want to call out that I am computing a prediciton. \n",
+    "A reminder: the superscript $^-$ is used to denote that the value is a prediction, not an estimate. But $\\mathbf{x}$ and $\\mathbf{x}^-$ are the same thing, the *state* of our system, just at different times of the algorithm. I am not entirely pleased with this notation for several reasons. First, it clutters the equations up, making them harder to read. More importantly, that aren't always correct. For example, consider a situation wherre we have no measurement for 1 or more time periods. In that case we compute the predict step several times in a row without an intervening update step. Thus, $\\mathbf{x}$ is never computed, and the predict step is actually $\\mathbf{x}^-_t = \\mathbf{Fx}^-_{t-1} + \\mathbf{Bu}$. Alternatively, if you have several measurements for one time epoch (from different sensors, say) it is possible to perform the update once for each measurement instead of trying to incorporate all of the measurements at once. Later in the book we will do this with a localization problem where we have bearing and distance measurements for several landmarks. In that case the subsequent updates are not performing the computations on the predicted state, but on the partially updated state. You can think of this as un update being performed with no movement or time having passed since the last update. The $^-$ notation does not capture any of these. For most of the book I will dispense with this notation unless I want to call out that I am computing a prediciton. \n",
    "\n",
    "I will start with the update step, as that is what we started with in the one dimensional Kalman filter case. The first equation is\n",
    "\n",
@@ -770,7 +770,7 @@
    "\\mathbf{y} = \\mathbf{z} - \\mathbf{H x}\\tag{3}\n",
    "$$\n",
    "\n",
-    "On the right we have $\\mathbf{Hx}$. That should be recognizable as the measurement function. Multiplying $\\mathbf{H}$ with $\\mathbf{x}$ puts $\\mathbf{x}$ into *measurement space*; in other words, the same basis and units as the sensor's measurements. The variable $\\mathbf{z}$ is just the measurement; it is typical, but not universal to use $\\mathbf{z}$ to denote measurements in the literature ($\\mathbf{y}$ is also sometimes used). Do you remember this chart?"
+    "On the right we have $\\mathbf{Hx}$. That should be recognizable as the measurement function. Multiplying $\\mathbf{H}$ with $\\mathbf{x}$ puts $\\mathbf{x}$ into *measurement space*; in other words, the same basis and units as the sensor's measurements. The variable $\\mathbf{z}$ is the measurement; it is typical, but not universal to use $\\mathbf{z}$ to denote measurements in the literature ($\\mathbf{y}$ is also sometimes used). Do you remember this chart?"
   ]
  },
  {
@@ -811,7 +811,7 @@
    "\\end{aligned}\n",
    "$$\n",
    "\n",
-    "Unfortunately it is a fair amount of linear algebra to derive this. The derivation can be quite elegant, and I urge you to look it up if you have the mathematical education to follow it. But $\\mathbf{K}$ is just the *Kalman gain* - the ratio of how much measurement vs prediction we should use to create the new estimate. $\\mathbf{R}$ is the *measurement noise*, and $\\mathbf{P}$ is our *uncertainty covariance matrix* from the prediction step.\n",
+    "Unfortunately it is a fair amount of linear algebra to derive this. The derivation can be quite elegant, and I urge you to look it up if you have the mathematical education to follow it. $\\mathbf{K}$ is the **Kalman gain** - the ratio of how much measurement vs prediction we should use to create the new estimate. $\\mathbf{R}$ is the *measurement noise*, and $\\mathbf{P}$ is our *uncertainty covariance matrix* from the prediction step.\n",
    "\n",
    "So let's work through this expression by expression. Start with $\\mathbf{HPH}^\\mathsf{T}$. The linear equation $\\mathbf{ABA}^T$ can be thought of as changing $\\mathbf{B}$ to the coordinate system of $\\mathbf{A}$. So $\\mathbf{HPH}^\\mathsf{T}$ is taking the covariance $\\mathbf{P}$ and putting it in measurement ($\\mathbf{H}$) space. \n",
    "\n",
@@ -843,7 +843,7 @@
    "\n",
    "In other words, the *Kalman gain* equation is doing nothing more than computing a ratio based on how much we trust the prediction vs the measurement. If we are confident in our measurements and unconfident in our predictions $\\mathbf{K}$ will favor the measurement, and vice versa. The equation is complicated because we are doing this in multiple dimensions via matrices, but the concept is simple - scale by a ratio.\n",
    "\n",
-    "Without going into the derivation of $\\mathbf{K}$, I'll say that this equation is the result of finding a value of $\\mathbf{K}$ that optimizes the *mean-square estimation error*. It does this by finding the minimal values for $\\mathbf{P}$ along its diagonal. Recall that the diagonal of $\\mathbf{P}$ is just the variance for each state variable. So, this equation for $\\mathbf{K}$ ensures that the Kalman filter output is optimal. To put this in concrete terms, for our dog tracking problem this means that the estimates for both position and velocity will be optimal - a value of $\\mathbf{K}$ that made the position extremely accurate but the velocity very inaccurate would be rejected in favor of a $\\mathbf{K}$ that made both position and velocity just somewhat accurate."
+    "Without going into the derivation of $\\mathbf{K}$, I'll say that this equation is the result of finding a value of $\\mathbf{K}$ that optimizes the *mean-square estimation error*. It does this by finding the minimal values for $\\mathbf{P}$ along its diagonal. Recall that the diagonal of $\\mathbf{P}$ is the variance for each state variable. So, this equation for $\\mathbf{K}$ ensures that the Kalman filter output is optimal. To put this in concrete terms, for our dog tracking problem this means that the estimates for both position and velocity will be optimal - a value of $\\mathbf{K}$ that made the position extremely accurate but the velocity very inaccurate would be rejected in favor of a $\\mathbf{K}$ that made both position and velocity somewhat accurate."
   ]
  },
  {
@@ -854,7 +854,7 @@
    "\n",
    "$$\\mathbf{x}=\\mathbf{x} +\\mathbf{Ky}\\tag{5}$$\n",
    "\n",
-    "This just multiplies the residual by the Kalman gain, and adds it to the state variable. In other words, this is the computation of our new estimate.\n",
+    "This multiplies the residual by the Kalman gain, and adds it to the state variable. In other words, this is the computation of our new estimate.\n",
    "\n",
    "Finally, we have:\n",
    "\n",
@@ -876,7 +876,7 @@
    "\n",
    "$$\\mathbf{x} = \\mathbf{Fx} + \\mathbf{Bu}\\tag{1}$$\n",
    "\n",
-    "This is just our state transition equation which we have already discussed. $\\mathbf{Fx}$  multiplies $\\mathbf{x}$ with the state transition matrix to compute the next state. $B$ and $u$ add in the contribution of the control input $\\mathbf{u}$, if any.\n",
+    "This is our state transition equation which we have already discussed. $\\mathbf{Fx}$  multiplies $\\mathbf{x}$ with the state transition matrix to compute the next state. $B$ and $u$ add in the contribution of the control input $\\mathbf{u}$, if any.\n",
    "\n",
    "The final equation is:\n",
    "$$\\mathbf{P} = \\mathbf{FPF}^\\mathsf{T} + \\mathbf{Q}\\tag{2}$$\n",
@@ -1024,7 +1024,7 @@
   "source": [
    "## Converting Kalman Filter to a g-h Filter\n",
    "\n",
-    "I've stated that the Kalman filter is a form of the g-h filter. It just takes some algebra to prove this. It's more straightforward to do with the one dimensional case, so I will do that. Recall \n",
+    "I've stated that the Kalman filter is a form of the g-h filter. It just takes some algebra to prove it. It's more straightforward to do with the one dimensional case, so I will do that. Recall \n",
    "\n",
    "$$\n",
    "\\mu_{x}=\\frac{\\sigma_1^2 \\mu_2 + \\sigma_2^2 \\mu_1} {\\sigma_1^2 + \\sigma_2^2}\n",
@@ -1083,7 +1083,7 @@
    "\n",
    "$$h_n = \\frac{COV (x,\\dot{x})}{\\sigma^2_{y}}$$\n",
    "\n",
-    "The takeaway point is that *g* and *h* are specified fully by the variance and covariances of the measurement and predictions at time *n*. In other words, we are just picking a point between the measurement and prediction by a scale factor determined by the quality of each of those two inputs. That is all the Kalman filter is. "
+    "The takeaway point is that *g* and *h* are specified fully by the variance and covariances of the measurement and predictions at time *n*. In other words, we are picking a point between the measurement and prediction by a scale factor determined by the quality of each of those two inputs. That is all the Kalman filter is. "
   ]
  },
  {
@@ -1288,7 +1288,7 @@
    "\n",
    "$$ f(x) = \\dot{x}$$\n",
    "\n",
-    "What does *time invariant* mean? Consider a home stereo. If you input a signal $x$ into it at time $t$, it will output some signal $f(x)$. If you instead make the input at a later time $t + \\Delta t$ the output signal will still be exactly the same, just shifted in time. This is different from, say, an aircraft. If you make a control input to the aircraft at a later time it's behavior will be different because it will have burned additional fuel (and thus lost weight), drag may be different due to being at a different altitude, and so on.\n",
+    "What does *time invariant* mean? Consider a home stereo. If you input a signal $x$ into it at time $t$, it will output some signal $f(x)$. If you instead make the input at a later time $t + \\Delta t$ the output signal will still be exactly the same except shifted in time. This is different from, say, an aircraft. If you make a control input to the aircraft at a later time it's behavior will be different because it will have burned additional fuel (and thus lost weight), drag may be different due to being at a different altitude, and so on.\n",
    "\n",
    "We can solve these equations by integrating each side. The time variant equation is very straightforward. We essentially solved this problem with the Newtonian equations above, but let's be explicit and write it out. Starting with  $$\\dot{\\mathbf{x}}=\\mathbf{v}$$ we get the expected\n",
    "\n",
@@ -1346,7 +1346,7 @@
    "\n",
    "$$x(t_k) = \\Phi(\\Delta t)x(t_{k-1})$$\n",
    "\n",
-    "In other words, we just want to compute the value of $x$ at time $t$ by multiplying its previous value by some matrix $\\Phi$. This is not trivial to do because the original equations do not include time."
+    "In other words, we want to compute the value of $x$ at time $t$ by multiplying its previous value by some matrix $\\Phi$. This is not trivial to do because the original equations do not include time."
   ]
  },
  {
@@ -1457,7 +1457,7 @@
    "\n",
    "We derived this equation in that chapter by using techniques that are much easier to understand. The advantage of the Taylor series expansion is that we can use it for any arbitrary set of differential equations which are time invariant. \n",
    "\n",
-    "Time invariant? This just means equations that do not depend on time. For example, $f(x) = sin(t)t$ is not time invariant; the value will be different at different times due to the multiplication by t. On the other hand $f(x) = sin(t)*3$ is time invariant. $x(t)$ is dependent on time, but that is allowed. This is because to be time invariant it must be true that if $f(t) = x(t)$ then $f(t+\\Delta t) = x(t+\\Delta t)$. This will be true for the second equation, but not the first.\n",
+    "Time invariant? This means equations that do not depend on time. For example, $f(x) = sin(t)t$ is not time invariant; the value will be different at different times due to the multiplication by t. On the other hand $f(x) = sin(t)*3$ is time invariant. $x(t)$ is dependent on time, but that is allowed. This is because to be time invariant it must be true that if $f(t) = x(t)$ then $f(t+\\Delta t) = x(t+\\Delta t)$. This will be true for the second equation, but not the first.\n",
    "\n",
    "However, we often use a Taylor expansion even when the equations are not time invariant. The answer will still be reasonably accurate so long as the time step is short and the system is nearly constant over that time step."
   ]
@@ -1598,7 +1598,7 @@
    "$$\\begin{aligned}\\dot{x}_1 &= x_2 \\\\\n",
    "\\dot{x}_2 &= 6x_2-9x_1 + t\\end{aligned}$$\n",
    "\n",
-    "If you practice this a bit you will become adept at it. Just isolate the highest term, define a new variable and its derivatives, and then substitute."
+    "If you practice this a bit you will become adept at it. Isolate the highest term, define a new variable and its derivatives, and then substitute."
   ]
  },
  {
@@ -1628,7 +1628,7 @@
    "\n",
    "$$\\mathbf{P} = \\mathbf{FP{F}}^\\mathsf{T} + \\mathbf{Q}$$\n",
    "\n",
-    "In this equation $\\mathbf{FP{F}}^\\mathsf{T}$ projects $\\mathbf{P}$ forward to the next time step according to our motion model $\\mathbf{F}$, and $\\mathbf{Q}$ is the *process noise matrix*. We are just adding matrices, so hopefully it is clear that each element in $\\mathbf{Q}$  specifies how much uncertainty is added to the system due to the process noise. \n",
+    "In this equation $\\mathbf{FP{F}}^\\mathsf{T}$ projects $\\mathbf{P}$ forward to the next time step according to our motion model $\\mathbf{F}$, and $\\mathbf{Q}$ is the *process noise matrix*. We are adding matrices, so hopefully it is clear that each element in $\\mathbf{Q}$  specifies how much uncertainty is added to the system due to the process noise. \n",
    "\n",
    "Let's take a brief moment to look at the behavior of these equation in more detail. I will project $\\mathbf{x}$ forward according to the motion model $\\mathbf{F}$ with the equation $\\mathbf{x} = \\mathbf{Fx}$. Then I will project P to the next timestep with the equation $\\mathbf{P} = \\mathbf{FP{F}}^\\mathsf{T}$. I will draw the original covariance in red and the new covariance in black."
   ]
@@ -1716,7 +1716,7 @@
    "\n",
    "So I have demonstrated the reason for the $\\mathbf{FP{F}}^\\mathsf{T}$ part of the equation, which assumes a perfect model. But of course no model is perfect. Except perhaps in a perfect vacuum velocity will never remain constant. If we are tracking a car, for example, even if we set the cruise control the car will slightly change velocity due to the wind, hills, bumps in the road, imperfections in the cruise control, and so on. If the Kalman filter used the equation above it would quickly become over-certain in its computations. Recall that in the unidimensional Kalman filter chapter every predict cycle increased our uncertainty, and every update decreased it. The same needs to happen here.\n",
    "\n",
-    "This is where $\\mathbf{Q}$ comes in. It is a covariance matrix that describes how much *process noise* is in the system. That is just a fancy way of saying how much uncertainty gets added to the system by things we haven't modeled.\n",
+    "This is where $\\mathbf{Q}$ comes in. It is a covariance matrix that describes how much *process noise* is in the system. That is a fancy way of saying how much uncertainty gets added to the system by things we haven't modeled.\n",
    "\n",
    "Let's rerun this simulation, but this time by using the full equation $\\mathbf{P} = \\mathbf{FP{F}}^\\mathsf{T} + \\mathbf{Q}$."
   ]
@@ -1789,7 +1789,7 @@
    "\n",
    "The filtered result will not be optimal, but in my opinion the promise of optimal results from Kalman filters is mostly wishful thinking. Consider, for example, tracking a car. In that problem the process noise would include things like potholes, wind gusts, changing drag due to turning, rolling down windows, and many more factors. We cannot realistically model that analytically, and so in practice we work out a simplified model, compute $\\mathbf{Q}$ based on that simplified model, and then add *a bit more* to  $\\small\\mathbf{Q}$ in hopes of taking the incalculable factors into account. Then we use a lot of simulations and trial runs to see if the filter behaves well; if it doesn't we adjust $\\small\\mathbf{Q}$ until the filter performs well. In this chapter we will focus on forming an intuitive understanding on how adjusting $\\small\\mathbf{Q}$ affects the output of the filter. In the Kalman Filter Math chapter we will discuss the analytic computation of $\\small\\mathbf{Q}$, and also provide code that will compute it for you.\n",
    "\n",
-    "For now we will just import the code from the `FilterPy` module, where it is already implemented. I will import it and call help on it so you can see the documentation for it."
+    "For now we will import the code from the `FilterPy` module, where it is already implemented. I will import it and call help on it so you can see the documentation for it."
   ]
  },
  {
@@ -2329,7 +2329,7 @@
    "\n",
    "> As an aside, most textbooks are more exact with the notation, in Gelb[1] for example, *Pk(+)* is used to denote the uncertainty covariance for the prediction step, and *Pk(-)* for the uncertainty covariance for the update step. Other texts use subscripts with 'k|k-1', superscript $^-$, and many other variations. As a programmer I find all of that fairly unreadable; I am used to thinking about variables changing state as a program runs, and do not use a different variable name for each new computation. There is no agreed upon format, so each author makes different choices. I find it challenging to switch quickly between books an papers, and so have adopted this admittedly less precise notation. Mathematicians will write scathing emails to me, but I hope the programmers and students will rejoice.\n",
    "\n",
-    "> If you are a programmer trying to understand a paper's math equations, I strongly recommend just removing all of the superscripts, subscripts, and diacriticals, replacing them with a single letter. If you work with equations like this every day this is superfluous advice, but when I read I am usually trying to understand the flow of computation. To me it is far more understandable to remember that $P$ in this step represents the updated value of $P$ computed in the last step, as opposed to trying to remember what $P_{k-1}(+)$ denotes, and what its relation to $P_k(-)$ is, if any.\n",
+    "> If you are a programmer trying to understand a paper's math equations, I suggest starting by removing all of the superscripts, subscripts, and diacriticals, replacing them with a single letter. If you work with equations like this every day this is superfluous advice, but when I read I am usually trying to understand the flow of computation. To me it is far more understandable to remember that $P$ in this step represents the updated value of $P$ computed in the last step, as opposed to trying to remember what $P_{k-1}(+)$ denotes, and what its relation to $P_k(-)$ is, if any.\n",
    "\n",
    "> For example, for the equation of $\\mathbf{S}$ above, Wikipedia uses\n",
    "\n",
@@ -2585,7 +2585,7 @@
    "\n",
    "$$y(t_0 + h) = y(t_0) + h y'(t_0) + \\frac{1}{2!}h^2 y''(t_0) + \\frac{1}{3!}h^3 y'''(t_0) +  O(h^4).$$\n",
    "\n",
-    "Here we can see that Euler's method is just using the first two terms of the Taylor expansion. Each subsequent term is smaller than the previous terms, so we are assured that the estimate will not be too far off from the correct value. "
+    "Here we can see that Euler's method is using the first two terms of the Taylor expansion. Each subsequent term is smaller than the previous terms, so we are assured that the estimate will not be too far off from the correct value. "
   ]
  },
  {
@@ -2602,7 +2602,7 @@
    "\n",
    "Runge Kutta integration is the workhorse of numerical integration. As mentioned earlier there are a vast number of methods in the literature. In practice, using the Runge Kutta algorithm that I present here will solve most any problem you will face. It offers a very good balance of speed, precision, and stability, and it is the 'go to' numerical integration method unless you have a very good reason to choose something different. If you have the knowledge to make that decision you have no need to be reading this section!\n",
    "\n",
-    "Let's just dive in. We start with some differential equation\n",
+    "Let's dive in. We start with some differential equation\n",
    "\n",
    "$$\\ddot{y} = \\frac{d}{dt}\\dot{y}$$.\n",
    "\n",
@@ -2733,7 +2733,7 @@
    "\n",
    " The Global Positioning System (GPS) is designed so that at least 6 satellites are in view at any time at any point on the globe. The GPS receiver knows the location of the satellites in the sky relative to the Earth. At each epoch (instant in time) the receiver gets a signal from each satellite from which it can derive the *pseudorange* to the satellite. In more detail, the GPS receiver gets a signal identifying the satellite along with the time stamp of when the signal was transmitted. The GPS satellite has an atomic clock on board so this time stamp is extremely accurate. The signal travels at the speed of light, which is constant in a vacuum, so in theory the GPS should be able to produce an extremely accurate distance measurement to the measurement by measuring how long the signal took to reach the receiver. There are several problems with that. First, the signal is not traveling through a vacuum, but through the atmosphere. The atmosphere causes the signal to bend, so it is not traveling in a straight line. This causes the signal to take longer to reach the receiver than theory suggests. Second, the on board clock on the GPS *receiver* is not very accurate, so deriving an exact time duration is nontrivial. Third, in many environments the signal can bounce off of buildings, trees, and other objects, causing either a longer path or *multipaths*, in which case the receive receives both the original signal from space and the reflected signals. \n",
    "\n",
-    "Let's look at this graphically. I will due this in 2D just to make it easier to graph and see, but of course this will generalize to three dimensions. We know the position of each satellite and the range to each (the range is called the *pseudorange*; we will discuss why later). We cannot measure the range exactly, so there is noise associated with the measurement, which I have depicted with the thickness of the lines. Here is an example of four pseudorange readings from four satellites. I positioned them in a configuration which is unlikely for the actual GPS constellation merely to make the intersections easy to visualize. Also, the amount of error shown is not to scale with the distances, again to make it easier to see."
+    "Let's look at this graphically. I will do this in 2D to make it easier to graph and see, but of course this will generalize to three dimensions. We know the position of each satellite and the range to each (the range is called the *pseudorange*; we will discuss why later). We cannot measure the range exactly, so there is noise associated with the measurement, which I have depicted with the thickness of the lines. Here is an example of four pseudorange readings from four satellites. I positioned them in a configuration which is unlikely for the actual GPS constellation merely to make the intersections easy to visualize. Also, the amount of error shown is not to scale with the distances, again to make it easier to see."
   ]
  },
  {
@@ -2774,13 +2774,13 @@
    "\n",
    "$$\\delta \\mathbf{z}^-= \\mathbf{z} - h(\\mathbf{x}^-)$$\n",
    "\n",
-    "where $\\mathbf{z}$ is the measurement, $h(\\bullet)$ is the measurement function, and $\\delta \\mathbf{z}^-$ is the innovation, which we abbreviate as $y$ in FilterPy. I don't use the $\\mathbf{x}^-$ symbology often, but it is the prediction for the state variable. In other words, this is just the equation $\\mathbf{y} = \\mathbf{z} - \\mathbf{Hx}$ in the linear Kalman filter's update step.\n",
+    "where $\\mathbf{z}$ is the measurement, $h(\\bullet)$ is the measurement function, and $\\delta \\mathbf{z}^-$ is the innovation, which we abbreviate as $y$ in FilterPy. I don't use the $\\mathbf{x}^-$ symbology often, but it is the prediction for the state variable. In other words, this is the equation $\\mathbf{y} = \\mathbf{z} - \\mathbf{Hx}$ in the linear Kalman filter's update step.\n",
    "\n",
    "Next, the *measurement residual* is\n",
    "\n",
    "$$\\delta \\mathbf{z}^+ = \\mathbf{z} - h(\\mathbf{x}^+)$$\n",
    "\n",
-    "I don't use the plus superscript much because I find it quickly makes the equations unreadable, but $\\mathbf{x}^+$ it is just the *a posteriori* state estimate, which is just the predicted or unknown future state. In other words, the predict step of the linear Kalman filter computes this value. Here it is stands for the value of x which the ILS algorithm will compute on each iteration.\n",
+    "I don't use the plus superscript much because I find it quickly makes the equations unreadable, but $\\mathbf{x}^+$ it is the *a posteriori* state estimate, which is the predicted or unknown future state. In other words, the predict step of the linear Kalman filter computes this value. Here it is stands for the value of x which the ILS algorithm will compute on each iteration.\n",
    "\n",
    "These equations give us the following linear algebra equation:\n",
    "\n",
@@ -2931,7 +2931,7 @@
    "$${\\delta \\mathbf{x}} = {{(\\mathbf{H}^\\mathsf{T}\\mathbf{H})^{-1}}\\mathbf{H}^\\mathsf{T} \\delta \\mathbf{z}^-}\n",
    "$$\n",
    "\n",
-    "First, we have to compute $\\mathbf{H}$, where $\\mathbf{H} =  d\\mathbf{z}/d\\mathbf{x}$. Just to keep the example small so the results are easier to interpret we will do this in 2D. Therefore for $n$ satellites $\\mathbf{H}$ expands to\n",
+    "First, we have to compute $\\mathbf{H}$, where $\\mathbf{H} =  d\\mathbf{z}/d\\mathbf{x}$. To keep the example small so the results are easier to interpret we will do this in 2D. Therefore for $n$ satellites $\\mathbf{H}$ expands to\n",
    "\n",
    "$$\\mathbf{H} = \\begin{bmatrix}\n",
    "\\frac{\\partial p_1}{\\partial x_1} & \\frac{\\partial p_1}{\\partial y_1} \\\\\n",
--- a/08_Nonlinear_Filtering.ipynb
+++ b/08_Nonlinear_Filtering.ipynb
@@ -868,7 +868,7 @@
    "\n",
    "This has two benefits. First, there is no process model. You do not need to predict your system state in the next time step. This is of huge value if you do not have a good model for the behavior of the system. For example, it is hard to predict the behavior of a person in a crowd for very long. You don't know where they are going, and they are endlessly being diverted and jostled by the crowd. Second, the algorithm can track arbitrarily many objects at once - some particles will match the behavior on one object, and other particles will match other objects. So this technique is often used to track automobile traffic, people in crowds, and so on. \n",
    "\n",
-    "The costs should be clear. It is computationally expensive to test tens of thousands of points for every step in the filter. But modern CPUs are very fast, and this is a perfect problem for GPUs because the algorithm is highly parallelizable. If we do have the ability to use a process model that is important information that is just being ignored by the filter - you rarely get better results by ignoring information. Of course, there are hybrid filters that do use a process model. Then, the answer is not mathematical. With a Kalman filter my covariance matrix gives me important information about the amount of error in the estimate. The particle filter does not give me a rigorous way to compute this. Finally, the output of the filter is a cloud of points; I then have to figure out how to interpret it. Usually you will be doing something like taking the mean of the points, but this is a difficult problem. There are still many points that do not 'belong' to a tracked object, so you first have to run some sort of clustering algorithm to first find the points that seem to be tracking an object, and then you need another algorithm to produce an state estimate from those points. None of this is intractable, but it is all quite computationally expensive. \n",
+    "The costs should be clear. It is computationally expensive to test tens of thousands of points for every step in the filter. But modern CPUs are very fast, and this is a perfect problem for GPUs because the algorithm is highly parallelizable. If we do have the ability to use a process model that is important information that is being ignored by the filter - you rarely get better results by ignoring information. Of course, there are hybrid filters that do use a process model. Then, the answer is not mathematical. With a Kalman filter my covariance matrix gives me important information about the amount of error in the estimate. The particle filter does not give me a rigorous way to compute this. Finally, the output of the filter is a cloud of points; I then have to figure out how to interpret it. Usually you will be doing something like taking the mean of the points, but this is a difficult problem. There are still many points that do not 'belong' to a tracked object, so you first have to run some sort of clustering algorithm to first find the points that seem to be tracking an object, and then you need another algorithm to produce an state estimate from those points. None of this is intractable, but it is all quite computationally expensive. \n",
    "\n",
    "\n",
    "Finally, we have a new algorithm called the *unscented Kalman filter* (UKF) that does not require you to find analytic solutions to nonlinear equations, and yet almost always performs better than the EKF. It does especially well with highly nonlinear problems - problems where the EKF has significant difficulties. Designing the filter is extremely easy. Some will say the jury is still out, but to my mind the UKF is superior in almost every way to the EKF, and should be the starting point for any implementation, especially if you are not a Kalman filter professional with a graduate degree in the relevant mathematical techniques. The main downside is that the UKF can be a few times slower than the EKF, but this really depends on whether the EKF solves the Jacobian analytically or numerically. If numerically the UKF is almost certainly faster. Finally, it has not been proven (and probably it cannot be proven) that the UKF always yields more accurate results than the EKF. In practice it almost always does, often significantly so. It is very easy to understand and implement, and I strongly suggest this technique as your starting point. "
@@ -889,16 +889,21 @@
    "\n",
    "Until recently the linearized Kalman filter and EKF have been the standard way to solve these problems. They are very difficult to understand and use, and they are also potentially very unstable. \n",
    "\n",
-    "Recent developments have offered what are to my mind superior approaches. The UKF dispenses with the need to find solutions to partial differential equations, but it is usually more accurate than the EKF. It is easy to use and understand. I can get a basic UKF going in just a few minutes by using FilterPy. The particle filter dispenses with mathimatical modeling completely in favor of a Monte Carlo technique of generating a random cloud of thousands of points. It runs slowly, but it can solve otherwise intractable problems with relative ease.\n",
+    "Recent developments have offered what are to my mind superior approaches. The UKF dispenses with the need to find solutions to partial differential equations, but it is usually more accurate than the EKF. It is easy to use and understand. I can get a basic UKF going in a few minutes by using FilterPy. The particle filter dispenses with mathimatical modeling completely in favor of a Monte Carlo technique of generating a random cloud of thousands of points. It runs slowly, but it can solve otherwise intractable problems with relative ease.\n",
    "\n",
-    "I get more email about the EKF than anything else; I suspect that this is because most treatments in books, papers, and on the internet use the EKF. If your interest is in mastering the field of course you will want to learn about the EKF. But if you are just trying to get good results I point you to the UKF and particle filter first. They are so much easier to implement, understand, and use, and they are typically far more stable than the EKF. \n",
+    "I get more email about the EKF than anything else; I suspect that this is because most treatments in books, papers, and on the internet use the EKF. If your interest is in mastering the field of course you will want to learn about the EKF. But if you are just trying to get good results I point you to the UKF and particle filter first. They are much easier to implement, understand, and use, and they are typically far more stable than the EKF. \n",
    "\n",
    "Some will quibble with that advice. A lot of recent publications are devoted to a comparison of the EKF, UKF, and perhaps a few other choices for a given problem. Do you not need to perform a similar comparison for your problem? If you are sending a rocket to Mars, then of course you do. You will be balancing issues such as accuracy, round off errors, divergence, mathematical proof of correctness, and the computational effort required. I can't imagine not knowing the EKF intimately. \n",
    "\n",
    "On the other hand the UKF works spectacularly! I use it at work for real world applications. I mostly haven't even tried to implement an EKF for these applications because I can verify that the UKF is working fine. Is it possible that I might eke out another 0.2% of performance from the EKF in certain situations? Sure! Do I care? No! I completely understand the UKF implementation, it is easy to test and verify, I can pass the code to others and be confident that they can understand and modify it, and I am not a masochist that wants to battle difficult equations when I already have a working solution. If the UKF or particle filters start to perform poorly for some problem then I will turn other techniques, but not before then. And realistically, the UKF usually provides substantially better performance than the EKF over a wide range of problems and conditions. If \"really good\" is good enough I'm going to spend my time working on other problems. \n",
    "\n",
-    "I'm belaboring this point because in most textbooks the EKF is given center stage, and the UKF is either not mentioned at all or just given a 2 page gloss that leaves you completely unprepared to use the filter. This is not due to ignorance on the writer's part. The UKF is still relatively new, and it takes time to write new editions of books. At the time many books were written the UKF was either not discovered yet, or it was just an unproven but promising curiosity. But I am writing this now, the UKF has had enormous success, and it needs to be in your toolkit. That is what I will spend most of my effort trying to teach you. \n",
-    "\n",
+    "I'm belaboring this point because in most textbooks the EKF is given center stage, and the UKF is either not mentioned at all or just given a 2 page gloss that leaves you completely unprepared to use the filter. This is not due to ignorance on the writer's part. The UKF is still relatively new, and it takes time to write new editions of books. At the time many books were written the UKF was either not discovered yet, or it was just an unproven but promising curiosity. But I am writing this now, the UKF has had enormous success, and it needs to be in your toolkit. That is what I will spend most of my effort trying to teach you. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
    "## References\n",
    "\n",
    "[1] https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python/blob/master/Supporting_Notebooks/Computing_and_plotting_PDFs.ipynb"
--- a/09_Unscented_Kalman_Filter.ipynb
+++ b/09_Unscented_Kalman_Filter.ipynb
@@ -468,7 +468,7 @@
   "source": [
    "Let's consider the simplest possible case and see if it offers any insight. The simplest possible system is the *identity function*. In mathematical notation this is $f(x) = x$. It should be clear that if our algorithm does not work for the identity function then the filter will never converge. In other words, if the input is  1 (for a one dimensional system), the output must also be 1. If the output was different, such as 1.1, then when we fed 1.1 into the transform at the next time step, we'd get out yet another number, maybe 1.23. The filter would run away (diverge). \n",
    "\n",
-    "The fewest number of points that we can use is one per dimension. This is the number that the linear Kalman filter uses. The input to a Kalman filter for the distribution $\\mathcal{N}(\\mu,\\sigma^2)$ is just $\\mu$ itself. So while this works for the linear case, it is not a good answer for the nonlinear case.\n",
+    "The fewest number of points that we can use is one per dimension. This is the number that the linear Kalman filter uses. The input to a Kalman filter for the distribution $\\mathcal{N}(\\mu,\\sigma^2)$ is $\\mu$ itself. So while this works for the linear case, it is not a good answer for the nonlinear case.\n",
    "\n",
    "Perhaps we can use one point per dimension, but altered somehow. However, if we were to pass some value $\\mu+\\Delta$ into the identity function $f(x)=x$ it would not converge, so this is not a possible algorithm. We must conclude that a one point sample will not work.\n",
    "\n",
@@ -561,7 +561,7 @@
    "## Van der Merwe's Scaled Sigma Point Algorithm\n",
    "There are several published ways for selecting the sigma points for the Unscented Kalman filter, and you are free to invent your own.  However, since 2005 or so research and industry have mostly settled on the version published by Rudolph Van der Merwe in his 2004 PhD dissertation [1] because it performs well with a variety of problems and it has a good tradeoff between performance and accuracy. It is a slight reformulation of the *Scaled Unscented Transform* published by Simon J. Julier [2].\n",
    "\n",
-    "Before we work through the derivation, let's just look at an example. I will plot the sigma points on top of a covariance ellipse showing the first and second standard deviations, and size them based on the weights assigned to them."
+    "Before we work through the derivation, let's look at an example. I will plot the sigma points on top of a covariance ellipse showing the first and second standard deviations, and size them based on the weights assigned to them."
   ]
  },
  {
@@ -711,7 +711,7 @@
    "\\end{aligned}\n",
    "$$\n",
    "\n",
-    "These equations should be familar - they are just the constraint equations we developed above. They perhaps don't look like much, but let's see an example of their power.\n",
+    "These equations should be familar - they are the constraint equations we developed above. They perhaps don't look like much, but let's see an example of their power.\n",
    "\n",
    "Earlier we wrote a function that found the mean of a distribution by passing 50,000 points through a nonlinear function. Let's now use 5 sigma points - we will pass them through the nonlinear function, and compute their mean with the unscented transform. This code is below. Under the comment `### create sigma points` I use code from FilterPy to generate the sigma points. It uses functionality which you will learn later in this chapter; pass by it for now. The key points are we generate 5 points deterministically, pass them through the nonlinear function, and then run the unscented transform on them to generate the estimated mean."
   ]
@@ -796,7 +796,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we can present the entire Unscented Kalman filter algorithm. As discussed earlier assume that there is a function $f(x, dt)$ that performs the state transition for our filter - it predicts the next state given the current state. Also assume there is a measurement function $h(x)$  - it takes the current state and computes what measurement that state corresponds to. These are just nonlinear forms of the $\\mathbf{F}$ and $\\mathbf{H}$ matrices used by the linear Kalman filter."
+    "Now we can present the entire Unscented Kalman filter algorithm. As discussed earlier assume that there is a function $f(x, dt)$ that performs the state transition for our filter - it predicts the next state given the current state. Also assume there is a measurement function $h(x)$  - it takes the current state and computes what measurement that state corresponds to. These are nonlinear forms of the $\\mathbf{F}$ and $\\mathbf{H}$ matrices used by the linear Kalman filter."
   ]
  },
  {
@@ -903,12 +903,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We are now ready to consider implementing an unscented Kalman filter. All of the math is above is already implemented in FilterPy, and you are perhaps a bit lost at this point, so lets just launch into solving some problems so you can gain confidence in how easy the UKF actually is. Later we will revisit how the UKF is implemented in Python.\n",
+    "We are now ready to consider implementing an unscented Kalman filter. All of the math is above is already implemented in FilterPy, and you are perhaps a bit lost at this point, so lets launch into solving some problems so you can gain confidence in how easy the UKF actually is. Later we will revisit how the UKF is implemented in Python.\n",
    "\n",
    "\n",
    "Let's start by solving a problem you already know how to do. Although the UKF was designed for nonlinear problems, it works fine on linear problems. In fact, it is guaranteed to get the same result as the linear Kalman filter for linear problems. We will write a solver for the linear problem of tracking using a constant velocity model in 2D. This will allows us to focus on what is the same (and most is the same!) and what is different with the UKF. \n",
    "\n",
-    "To design a linear Kalman filter you need to design the $\\bf{x}$, $\\bf{F}$, $\\bf{H}$, $\\bf{R}$, and $\\bf{Q}$ matrices. We have done this many times already so let me just present a design to you without a lot of comment. We want a constant velocity model, so we can define $\\bf{x}$ to be\n",
+    "To design a linear Kalman filter you need to design the $\\bf{x}$, $\\bf{F}$, $\\bf{H}$, $\\bf{R}$, and $\\bf{Q}$ matrices. We have done this many times already so let me present a design to you without a lot of comment. We want a constant velocity model, so we can define $\\bf{x}$ to be\n",
    "\n",
    "$$ \\mathbf{x} = \\begin{bmatrix}x &  \\dot{x} & y & \\dot{y} \\end{bmatrix}^\\mathsf{T}$$.\n",
    "\n",
@@ -997,7 +997,7 @@
    "\n",
    "The equations of the UKF are implemented for you with the `FilterPy` class `UnscentedKalmanFilter`; all you have to do is specify the matrices and the nonlinear functions `f(x)` and `h(x)`. `f(x)` implements the state transition function that is implemented by the matrix $\\mathbf{F}$ in the linear filter, and `h(x)` implements the measurement function implemented with the matrix $\\mathbf{H}$. In nonlinear problems these functions are nonlinear, so we cannot use matrices to specify them.\n",
    "\n",
-    "For our linear problem we can just define these functions to implement the linear equations. The function `f(x)` takes the signature `def f(x,dt)` and `h(x)` takes the signature `def h(x)`. Below is a reasonable implementation of these two functions. Each is expected to return a 1-D NumPy array with the result."
+    "For our linear problem we can define these functions to implement the linear equations. The function `f(x)` takes the signature `def f(x,dt)` and `h(x)` takes the signature `def h(x)`. Below is a reasonable implementation of these two functions. Each is expected to return a 1-D NumPy array with the result."
   ]
  },
  {
@@ -2572,7 +2572,7 @@
    "\n",
    "The Python for this is not much more difficult once we wrap our heads around the $[\\sqrt{(n+\\kappa)\\Sigma}]_i$ term.\n",
    "\n",
-    "The term $[\\sqrt{(n+\\kappa)\\Sigma}]_i$ has to be a matrix because $\\Sigma$ is a matrix. The subscript $i$ is choosing the column vector of the matrix. What is the square root of a matrix? There is no unique definition. A typical definition is that the square root of a matrix $\\Sigma$ is just the matrix $S$ that, when multiplied by itself, yields $\\Sigma$.\n",
+    "The term $[\\sqrt{(n+\\kappa)\\Sigma}]_i$ has to be a matrix because $\\Sigma$ is a matrix. The subscript $i$ is choosing the column vector of the matrix. What is the square root of a matrix? There is no unique definition. A typical definition is that the square root of a matrix $\\Sigma$ is the matrix $S$ that, when multiplied by itself, yields $\\Sigma$.\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
@@ -2791,9 +2791,7 @@
    "\n",
    "    zs = [some_func(i) for i in range(1000)]\n",
    "    \n",
-    "Just do whatever it takes to get your measurements in an array.\n",
-    "\n",
-    "Now we will just call the `batch_filter()` method.\n",
+    "Now we call the `batch_filter()` method.\n",
    "\n",
    "    Xs, Ps = ukf.batch_filter(zs)\n",
    "    \n",
@@ -3011,7 +3009,7 @@
   "source": [
    "** author's note: this entire section needs a lot of work. Ignore for now.**\n",
    "\n",
-    "I have found the literature on choosing values for $\\alpha$, $\\beta$, and $\\kappa$ to be rather lacking. Van der Merwe's dissertation contains the most information, but it is not exhaustive. So let's just explore what they do. \n",
+    "I have found the literature on choosing values for $\\alpha$, $\\beta$, and $\\kappa$ to be rather lacking. Van der Merwe's dissertation contains the most information, but it is not exhaustive. So let's explore what they do. \n",
    "\n",
    "Van der Merwe suggests using $\\beta=2$ for Gaussian problems, and $\\kappa=3-n$. So let's start there and vary $\\alpha$. I will let $n=1$ minimize the size of the arrays we need to look at and to avoid having to compute the square root of matrices"
   ]
@@ -3091,7 +3089,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can see that the sigma point spread over 100 standard deviations.If our data was Gaussian we'd be incorporating data many standard deviations away from the mean; for nonlinear problems this is unlikely to produce good results. But suppose our distribution was not Gaussian, but instead had very fat tails? We might need to sample from those tails to get a good estimate, and hence it would make sense to make $kappa$ larger (not 200, which was absurdly large just to make the change in the sigma points stark). \n",
+    "We can see that the sigma point spread over 100 standard deviations.If our data was Gaussian we'd be incorporating data many standard deviations away from the mean; for nonlinear problems this is unlikely to produce good results. But suppose our distribution was not Gaussian, but instead had very fat tails? We might need to sample from those tails to get a good estimate, and hence it would make sense to make $kappa$ larger (not 200, which was absurdly large to make the change in the sigma points stark). \n",
    "\n",
    "With a similar line of reasoning, suppose that our distribution has nearly no tails - the probability distribution looks more like an inverted parabola. In such a case we'd probably want to pull the sigma points in closer to the mean to avoid sampling in regions where there will never be real data.\n",
    "\n",
@@ -3390,7 +3388,7 @@
    "\\end{bmatrix} &+ \\mathcal{N}(0, R)\n",
    "\\end{aligned}$$\n",
    "\n",
-    "I will not implement this in Python just yet as there is a difficulty that will be discussed in the *Implementation* section below."
+    "I will not implement this in Python yet as there is a difficulty that will be discussed in the *Implementation* section below."
   ]
  },
  {
@@ -3789,7 +3787,7 @@
    "\n",
    "Your impression of this chapter probably depends on how many nonlinear Kalman filters you have implemented in the past. If this is your first exposure perhaps the computation of $2n+1$ sigma points and the subsequent writing of the $f(x)$ and $h(x)$ function struck you as a bit finicky. Indeed, I spent more time than I'd care to admit getting everything working. On the other hand, if you have implemented an EKF or linearized Kalman filter you are perhaps bouncing gleefully in your seat. There is a small amount of tedium in writing the functions for the UKF, but the concepts are very basic. As you will see in the next chapter the EKF for the same problems requires some fairly difficult mathematics. In fact, for many problems we cannot find a closed form solution for the equations of the EKF, and we must retreat to some sort of iterated solution.\n",
    "\n",
-    "However, the main advantage of the UKF over the EKF is not just the relative ease of implementation. It is somewhat premature to discuss this because you haven't learned the EKF yet, but the EKF linearizes the problem at one point and passes that point through a linear Kalman filter. In contrast, the UKF takes $2n+1$ samples. Therefore the UKF is almost always more accurate than the EKF, especially when the problem is highly nonlinear. While it is not true that the UKF is guaranteed to always outperform the EKF, in practice it has been shown to perform at least as well, and usually much better than the EKF. \n",
+    "However, the advantage of the UKF over the EKF is not only the relative ease of implementation. It is somewhat premature to discuss this because you haven't learned the EKF yet, but the EKF linearizes the problem at one point and passes that point through a linear Kalman filter. In contrast, the UKF takes $2n+1$ samples. Therefore the UKF is almost always more accurate than the EKF, especially when the problem is highly nonlinear. While it is not true that the UKF is guaranteed to always outperform the EKF, in practice it has been shown to perform at least as well, and usually much better than the EKF. \n",
    "\n",
    "Hence my recommendation is to always start by implementing the UKF. If your filter has real world consequences if it diverges (people die, lots of money lost, power plant blows up) of course you will have to engage in a lot of sophisticated analysis and experimentation to chose the best filter. That is beyond the scope of this book, and you should be going to graduate school to learn this theory in much greater detail than this book provides. \n",
    "\n",
--- a/10_Extended_Kalman_Filters.ipynb
+++ b/10_Extended_Kalman_Filters.ipynb
@@ -404,7 +404,7 @@
    "\n",
    "For nonlinear problems our function $f()$ is a set of differential equations. Modeling physical systems with differential equations is well outside the scope of this book. You will need to be reasonably well versed in this branch of applied mathematics to successfully implement the EKF for your problem. If you have not read it yet, please read the section **Modeling Dynamic Systems** in the **Kalman Filter Math** chapter as it contains the math that you will need to complete this chapter.\n",
    "\n",
-    "I think the easiest way to understand the EKF is to just start off with an example. Perhaps the reason for some of my mathematical choices will not be clear, but trust that the end result will be an EKF."
+    "I think the easiest way to understand the EKF is to start off with an example. Perhaps the reason for some of my mathematical choices will not be clear, but trust that the end result will be an EKF."
   ]
  },
  {
@@ -633,7 +633,7 @@
    "\\frac{x_{alt}}{\\sqrt{x_{pos}^2 + x_{alt}^2}}\n",
    "\\end{bmatrix}$$\n",
    "\n",
-    "This may seem daunting, so step back and recognize that all of this math is just doing something very simple. We have an equation for the slant range to the airplane which is nonlinear. The Kalman filter only works with linear equations, so we need to find a linear equation that approximates $\\mathbf{H}$ As we discussed above, finding the slope of a nonlinear equation at a given point is a good approximation. For the Kalman filter, the 'given point' is the state variable $\\mathbf{x}$ so we need to take the derivative of the slant range with respect to $\\mathbf{x}$. \n",
+    "This may seem daunting, so step back and recognize that all of this math is doing something very simple. We have an equation for the slant range to the airplane which is nonlinear. The Kalman filter only works with linear equations, so we need to find a linear equation that approximates $\\mathbf{H}$ As we discussed above, finding the slope of a nonlinear equation at a given point is a good approximation. For the Kalman filter, the 'given point' is the state variable $\\mathbf{x}$ so we need to take the derivative of the slant range with respect to $\\mathbf{x}$. \n",
    "\n",
    "To make this more concrete, let's now write a Python function that computes the Jacobian of $\\mathbf{H}$. The `ExtendedKalmanFilter` class will be using this to generate `ExtendedKalmanFilter.H` at each step of the process."
   ]
@@ -731,7 +731,7 @@
   "source": [
    "Now we can implement our filter. I have not yet designed $\\mathbf{R}$ and $\\mathbf{Q}$ which is required to get optimal performance. However, we have already covered a lot of confusing material and I want you to see concrete examples as soon as possible. Therefore I will use 'reasonable' values for $\\mathbf{R}$ and $\\mathbf{Q}$.\n",
    "\n",
-    "The `FilterPy` library provides the class `ExtendedKalmanFilter`. It works very similar to the `KalmanFilter` class we have been using, except that it allows you to provide functions that compute the Jacobian of $\\mathbf{H}$ and the function $h(\\mathbf{x})$. We have already written the code for these two functions, so let's just get going.\n",
+    "The `FilterPy` library provides the class `ExtendedKalmanFilter`. It works very similar to the `KalmanFilter` class we have been using, except that it allows you to provide functions that compute the Jacobian of $\\mathbf{H}$ and the function $h(\\mathbf{x})$. We have already written the code for these two functions, so let's get going.\n",
    "\n",
    "We start by importing the filter and creating it. There are 3 variables in `x` and only 1 measurement. At the same time we will create our radar simulator.\n",
    "\n",
@@ -1102,7 +1102,7 @@
    "\n",
    "$$f(x, u) \\approx \\mathbf{x} + \\frac{\\partial f(x, u)}{\\partial x}$$\n",
    "\n",
-    "We replace $f(x, u)$ with our state estimate $\\mathbf{x}$, and the derivative is just the Jacobian of $f$."
+    "We replace $f(x, u)$ with our state estimate $\\mathbf{x}$, and the derivative is the Jacobian of $f$."
   ]
  },
  {
@@ -1199,7 +1199,7 @@
    "\\frac{\\partial \\dot{\\theta}}{\\partial v} & \\frac{\\partial \\dot{\\theta}}{\\partial \\alpha}\n",
    "\\end{bmatrix}$$\n",
    "\n",
-    "Let's just compute that with SymPy:"
+    "Let's compute that with SymPy:"
   ]
  },
  {
--- a/11_Particle_Filters.ipynb
+++ b/11_Particle_Filters.ipynb
@@ -358,7 +358,7 @@
    "This technique of using a finite number of randomly sampled to compute a result is called a **Monte Carlo** (MC) method. The idea is simple. Generate *enough* points to get a representative sample of the problem, run the points through the system you are modeling, and then compute the results on the transformed points. \n",
    "\n",
    "\n",
-    "In a nutshell this is what particle filtering is. A bit later I'll demonstrate how MC can integrate over a probability distribution, but you don't need that formality to understand how they work. It's very simple. It is just the Bayesian filter algorithm we have been using throughout the book applied to thousands of particles, where each particle represents a *possible* state for the system. We extract the estimated state from the thousands of particles using weighted statistics of the particles.\n",
+    "In a nutshell this is what particle filtering is. A bit later I'll demonstrate how MC can integrate over a probability distribution, but you don't need that formality to understand how they work. It's very simple. It is the Bayesian filter algorithm we have been using throughout the book applied to thousands of particles, where each particle represents a *possible* state for the system. We extract the estimated state from the thousands of particles using weighted statistics of the particles.\n",
    "\n",
    "\n",
    "** Generic Particle Filter Algorithm**\n",
@@ -709,7 +709,7 @@
    "\n",
    "Here I use the notation $x^i$ to indicate the i$^{th}$ particle. A superscript is used because we often need to use subscripts to denote time steps the k$^{th}$ or k+1$^{th}$ particle, yielding the unwieldy $x^i_{k+1}$. I'm not partial to this notation but it is what is used in the literature.\n",
    "\n",
-    "This equation seems to be reasonable, but is it? I will discuss this in the *Importance Sampling* section below. For now, just know that the answer is yes, it is reasonable. "
+    "This equation seems to be reasonable, but is it? I will discuss this in the *Importance Sampling* section below. For now, know that the answer is yes, it is reasonable. "
   ]
  },
  {
@@ -926,9 +926,9 @@
   "source": [
    "### Effect of Sensor Errors on the Filter\n",
    "\n",
-    "The first few iterations of the filter resulted in many duplicate particles. This happens because the model for the sensors is Gaussian, and we gave it a small standard deviation of $\\sigma=0.1$. This is  counterintuitive at first. The Kalman filter performs *better* when the noise is smaller, yet the particle filter can perform worse. We can reason why this is true. The standard deviation is just 0.1. This means that if the robot is at (1, 1) and a particle is at (2, 2) the particle is 14 standard deviations away from the sensor. This gives it a near zero probability, and it is extremely unlikely to survive after the resampling. \n",
+    "The first few iterations of the filter resulted in many duplicate particles. This happens because the model for the sensors is Gaussian, and we gave it a small standard deviation of $\\sigma=0.1$. This is  counterintuitive at first. The Kalman filter performs *better* when the noise is smaller, yet the particle filter can perform worse. We can reason why this is true. The standard deviation is 0.1. This means that if the robot is at (1, 1) and a particle is at (2, 2) the particle is 14 standard deviations away from the sensor. This gives it a near zero probability, and it is extremely unlikely to survive after the resampling. \n",
    "\n",
-    "This is *very important* to understand - a very accurate sensor can lead to poor performance of the filter because few of the particles will be a good sample of the probability distribution. There are a few fixes available to us. First, we can artificially increase the sensor noise standard deviation so the particle filter will accept more points as matching the robots probability distribution. This is non-optimal because some of those points will be a poor match. The real problem is that there just isn't enough points being generated such that enough are near the robot. Increasing `N` usually fixes this problem. This decision is not cost free as increasing the number of particles significantly increase the computation time. Still, let's look at the result of using 100,000 particles."
+    "This is *very important* to understand - a very accurate sensor can lead to poor performance of the filter because few of the particles will be a good sample of the probability distribution. There are a few fixes available to us. First, we can artificially increase the sensor noise standard deviation so the particle filter will accept more points as matching the robots probability distribution. This is non-optimal because some of those points will be a poor match. The real problem is that there aren't enough points being generated such that enough are near the robot. Increasing `N` usually fixes this problem. This decision is not cost free as increasing the number of particles significantly increase the computation time. Still, let's look at the result of using 100,000 particles."
   ]
  },
  {
@@ -1091,7 +1091,7 @@
    "\n",
    "$$I = \\int f(x)q(x)\\, \\,  \\cdot \\,  \\frac{\\pi(x)}{q(x)}\\, \\mathsf{d}x$$\n",
    "\n",
-    "$q(x)$ is known to us, so we can compute $\\int f(x)q(x)$ using MC integration. That leaves us with  $\\frac{\\pi(x)}{q(x)}$.  That is just a ratio, and we define it as a *weight*. This gives us\n",
+    "$q(x)$ is known to us, so we can compute $\\int f(x)q(x)$ using MC integration. That leaves us with  $\\frac{\\pi(x)}{q(x)}$.  That is a ratio, and we define it as a *weight*. This gives us\n",
    "\n",
    "$$I = \\sum\\limits_{i=1}^N f(x^i)w(x^i)$$\n",
    "\n",
@@ -1131,7 +1131,7 @@
    "\n",
    "I will give you a couple of the most commonly used algorithms; with that information you should be able to search and find alternative should these prove deficient for your application.\n",
    "\n",
-    "FilterPy implements several of the popular algorithms. FilterPy doesn't know how your particle filter is implemented, so it doesn't make sense to have the resample algorithm generate the new samples. Instead, the algorithms just create an ndarray containing the indexes of the weights and particles that are chosen; your class or code needs to perform the resampling step. For example, the robot localization class implements this with\n",
+    "FilterPy implements several of the popular algorithms. FilterPy doesn't know how your particle filter is implemented, so it doesn't make sense to have the resample algorithm generate the new samples. Instead, the algorithms create an ndarray containing the indexes of the weights and particles that are chosen; your class or code needs to perform the resampling step. For example, the robot localization class implements this with\n",
    "\n",
    "    def resample_from_index(self, indexes):\n",
    "        assert len(indexes) == self.N\n",
--- a/12_Smoothing.ipynb
+++ b/12_Smoothing.ipynb
@@ -395,7 +395,7 @@
    "\n",
    "Smoothing filters incorporate future measurements into the estimate for step `k`. The measurement from `k+1` will have the most effect, `k+2` will have less effect, `k+3` less yet, and so on. \n",
    "\n",
-    "This topic is called **smoothing**, but I think that is a misleading name. I could smooth the data above just by passing it through a low pass filter. The result would be smooth, but not necessarily accurate because a low pass filter will remove real variations just as much as it removes noise. In contrast, Kalman smoothers are *optimal* - they incorporate all available information to make the best estimate that is mathematically achievable."
+    "This topic is called **smoothing**, but I think that is a misleading name. I could smooth the data above by passing it through a low pass filter. The result would be smooth, but not necessarily accurate because a low pass filter will remove real variations just as much as it removes noise. In contrast, Kalman smoothers are *optimal* - they incorporate all available information to make the best estimate that is mathematically achievable."
   ]
  },
  {
@@ -494,7 +494,7 @@
    "        \n",
    "This implementation mirrors the implementation provided in FilterPy. It assumes that the Kalman filter is being run externally in batch mode, and the results of the state and covariances are passed in via the `Xs` and `Ps` variable.\n",
    "\n",
-    "Let's just look at an example. "
+    "Here is an example. "
   ]
  },
  {
@@ -711,7 +711,7 @@
    "\n",
    "$$\\mathbf{x} = \\begin{bmatrix}\\mathbf{x}_k \\\\ \\mathbf{x}_{k-1} \\\\ \\vdots\\\\ \\mathbf{x}_{k-N+1}\\end{bmatrix}$$\n",
    "\n",
-    "This yields a very large covariance matrix that contains the covariance between states at different steps. FilterPy's class `FixedLagSmoother` takes care of all of this computation for you, including creation of the augmented matrices. All you need to do is compose it just as if you are using the `KalmanFilter` class and then call `smooth()`, which implements the predict and update steps of the algorithm. \n",
+    "This yields a very large covariance matrix that contains the covariance between states at different steps. FilterPy's class `FixedLagSmoother` takes care of all of this computation for you, including creation of the augmented matrices. All you need to do is compose it as if you are using the `KalmanFilter` class and then call `smooth()`, which implements the predict and update steps of the algorithm. \n",
    "\n",
    "Each call of `smooth` computes the estimate for the current measurement, but it also goes back and adjusts the previous `N-1` points as well. The smoothed values are contained in the list `FixedLagSmoother.xSmooth`. If you use `FixedLagSmoother.x` you will get the most recent estimate, but it is not smoothed and is no different from a standard Kalman filter output."
   ]
--- a/13_Adaptive_Filtering.ipynb
+++ b/13_Adaptive_Filtering.ipynb
@@ -276,7 +276,7 @@
   "source": [
    "So far we have considered the problem of tracking objects that are well behaved in relation to our process model. For example, we can use a constant velocity model track an object moving in a straight line. So long as the object moves in a straight line at a reasonably constant speed, or varies it's track and/or velocity very slowly this filter will perform very well. Suppose instead that we are trying to track a maneuvering target, by which I mean an object with control inputs, such as a car along a road, an aircraft in flight, and so on. In these situations the filters perform quite poorly. Alternatively, consider a situation such as tracking a sailboat in the ocean. Even if we model the control inputs we have no way to model the wind or the ocean currents. \n",
    "\n",
-    "A first order approach to this problem is just to make the process noise $\\mathbf{Q}$ larger to account for the unpredictability of the system dynamics. While this can *work* in the sense of providing a non-diverging filter, the result is typically far from optimal. The larger $\\mathbf{Q}$ results in the filter giving more emphasis to the noise in the measurements. We will see an example of this shortly.\n",
+    "A first order approach to this problem is to make the process noise $\\mathbf{Q}$ larger to account for the unpredictability of the system dynamics. While this can *work* in the sense of providing a non-diverging filter, the result is typically far from optimal. The larger $\\mathbf{Q}$ results in the filter giving more emphasis to the noise in the measurements. We will see an example of this shortly.\n",
    "\n",
    "So in this chapter we will discuss the concept of an *adaptive filter*. This means about what it sounds like. The filter will *adapt* itself when it detects dynamics that the process model cannot account for. I will start with an example of the problem, and then discuss and implement various adaptive filters."
   ]
@@ -292,7 +292,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So let's begin by writing a simulation of a maneuvering target. We are not interested in modeling anything with high fidelity, nor do we really care about 3 dimensions, so I will just implement a simple 2D model that you can provide steering inputs into. You can provide a new speed and/or direction, and it will modify its state to match."
+    "So let's begin by writing a simulation of a maneuvering target. We are not interested in modeling anything with high fidelity, nor do we really care about 3 dimensions, so I will implement a simple 2D model that you can provide steering inputs into. You can provide a new speed and/or direction, and it will modify its state to match."
   ]
  },
  {
@@ -523,7 +523,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Just to be clear about the dimension reduction, if we wanted to track both *x* and *y* we would have written\n",
+    "To be clear about the dimension reduction, if we wanted to track both *x* and *y* we would have written\n",
    "\n",
    "    cvfilter = KalmanFilter(dim_x = 4, dim_z=2)\n",
    "    cvfilter.x = array([0., 0., 0., 0.])\n",
@@ -775,7 +775,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The constant acceleration model is able to track the maneuver with no lag, but at the cost of very noisy output during the steady state behavior. The noisy output is due to the filter being unable to distinguish between the beginning of an  maneuver and just noise in the signal. Noise in the signal implies an acceleration, and so the acceleration term of the filter tracks it. \n",
+    "The constant acceleration model is able to track the maneuver with no lag, but at the cost of very noisy output during the steady state behavior. The noisy output is due to the filter being unable to distinguish between the beginning of an  maneuver and noise in the signal. Noise in the signal implies an acceleration, and so the acceleration term of the filter tracks it. \n",
    "\n",
    "It seems we cannot win. A constant velocity filter cannot react quickly when the target is accelerating, but a constant acceleration filter misinterprets noise during zero acceleration regimes as legitimate acceleration.\n",
    "\n",
@@ -984,7 +984,7 @@
   "source": [
    "This plot should make clear the effect of normalizing the residual. Squaring the residual ensures that the signal is always greater than zero, and normalizing by the measurement covariance scales the signal so that we can distinguish when the residual is markedly changed relative to the measurement noise. The maneuver starts at t=3 seconds, and we can see that $\\epsilon$ starts to increase rapidly not long after that. \n",
    "\n",
-    "We will want to start scaling $\\mathbf{Q}$ up once $\\epsilon$ exceeds some limit, and back down once it again falls below that limit. We multiply $\\mathbf{Q}$ by a scaling factor. Perhaps there is literature on choosing this factor analytically; I just derive it experimentally. We can be somewhat more analytical about choosing the limit for $\\epsilon$ (named $\\epsilon_{max}$) - generally speaking once the residual is greater than 3 standard deviations or so we can assume the difference is due to a real change and not to noise. However, sensors are rarely truly Gaussian and so a larger number, such as 5-6 standard deviations is used in practice.\n",
+    "We will want to start scaling $\\mathbf{Q}$ up once $\\epsilon$ exceeds some limit, and back down once it again falls below that limit. We multiply $\\mathbf{Q}$ by a scaling factor. Perhaps there is literature on choosing this factor analytically; I derive it experimentally. We can be somewhat more analytical about choosing the limit for $\\epsilon$ (named $\\epsilon_{max}$) - generally speaking once the residual is greater than 3 standard deviations or so we can assume the difference is due to a real change and not to noise. However, sensors are rarely truly Gaussian and so a larger number, such as 5-6 standard deviations is used in practice.\n",
    "\n",
    "I have implemented this algorithm using reasonable values for $\\epsilon_{max}$ and the $\\mathbf{Q}$ scaling factor. To make inspection of the result easier I have limited the plot to the first 10 seconds of simulation. "
   ]
@@ -1539,7 +1539,7 @@
    "\n",
    "As you might imagine this is a broad topic, and there are many ways of designing and implementing MM estimators. But consider a simple approach for the target we have been tracking in this chapter. One idea would be to simultaneously run a constant velocity and a constant acceleration filter, and to switch between their outputs when we detect a maneuver by inspecting the residuals. Even this choice gives us many options. Consider the dynamics of a turning object. For example, an automobile turns on a wheelbase - the front wheels turn, and the car pivots around the rear wheels. This is a nonlinear process, so for best results we would want to use some type of nonlinear filter (EKF, UKF, etc) to model the turns. On the other hand, a linear constant velocity filter would perform fine for the steady state portions of the travel. So our bank of filters might consist of a linear KF and an EKF filter for the turns. However, neither is particularly well suited for modeling behaviors such as accelerating and braking. So a highly performing MM estimator might contain a bank of many filters, each designed to perform best for a certain performance envelope of the tracked object.\n",
    "\n",
-    "Of course, you do not need to just base your filters on the order of the model. You can use different noise models, different adapters in each. For example, in the section above I showed many plots depicting the effects of changing parameters on the estimate of the velocity and position. Perhaps one setting works better for position, and a different setting for velocity. Put both in your bank of filters. You could then take the best estimates for the position from one filter, and the best estimate for the velocity from a different filter.\n",
+    "Of course, you do not need to base your filters on the order of the model. You can use different noise models, different adapters in each. For example, in the section above I showed many plots depicting the effects of changing parameters on the estimate of the velocity and position. Perhaps one setting works better for position, and a different setting for velocity. Put both in your bank of filters. You could then take the best estimates for the position from one filter, and the best estimate for the velocity from a different filter.\n",
    "\n",
    "### A Two Filter Adaptive Filter\n",
    "\n",