Update

2020-03-29 11:06:58 -07:00 · 2020-03-29 11:06:58 -07:00 · 1790ccf188
commit 1790ccf188
parent 423ed1bac3
3 changed files with 36 additions and 54 deletions
--- a/04_mnist_basics.ipynb
+++ b/04_mnist_basics.ipynb
@ -33,7 +33,7 @@
   "source": [
    "Having seen what it looks like to actually train a variety of models in Chapter 2, let’s now look under the hood and see exactly what is going on. We’ll start by using computer vision to introduce fundamental tools and concepts for deep learning.\n",
    "\n",
-    "To be exact, we'll discuss the role of arrays and tensors, and of broadcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of a loss function for our basic classification task, and the role of mini-batches. We'll also describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together to see them at work.\n",
+    "To be exact, we'll discuss the role of arrays and tensors, and of broadcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of a loss function for our basic classification task, and the role of mini-batches. We'll also describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together.\n",
    "\n",
    "In future chapters we’ll do deep dives into other applications as well, and see how these concepts and tools generalize. But this chapter is about laying foundation stones. To be frank, that also makes this one of the hardest chapters, because of how these concepts all depend on each other. Like an arch, all the stones need to be in place for the structure to stay up. Also like an arch, once that happens, it's a powerful structure that can support other things. But it requires some patience to assemble.\n",
    "\n",
@ -71,7 +71,7 @@
    "\n",
    "Geoff Hinton has told of how even academic papers showing dramatically better results than anything previously published would be rejected from top journals and conferences, just because they used a neural network. Yann Lecun's work on convolutional neural networks, which we will study in the next section, showed that these models could read hand-written text--something that had never been achieved before. However his breakthrough was ignored by most researchers, even as it was used commercially to read 10% of the checks in the US!\n",
    "\n",
-    "In addition to these three Turing Award winners, there are many other researchers who have battled to get us to where we are today. For instance, Jurgen Schmidhuber (who many believe should have shared in the Turing Award) pioneered many important ideas, including working on the *LSTM* architecture with his student Sepp Hochreiter (widely used for speech recognition and other text modeling tasks, and used in the IMDB example in <<chapter_intro>>). Perhaps most important of all, Paul Werbos in 1974 invented back-propagation for neural networks, the technique shown in this chapter and used universally for training neural networks ([Werbos 1994](https://books.google.com/books/about/The_Roots_of_Backpropagation.html?id=WdR3OOM2gBwC)). His development was almost entirely ignored for decades, but today it is the most important foundation of modern AI.\n",
+    "In addition to these three Turing Award winners, there are many other researchers who have battled to get us to where we are today. For instance, Jurgen Schmidhuber (who many believe should have shared in the Turing Award) pioneered many important ideas, including working with his student Sepp Hochreiter on the *LSTM* architecture (widely used for speech recognition and other text modeling tasks, and used in the IMDB example in <<chapter_intro>>). Perhaps most important of all, Paul Werbos in 1974 invented back-propagation for neural networks, the technique shown in this chapter and used universally for training neural networks ([Werbos 1994](https://books.google.com/books/about/The_Roots_of_Backpropagation.html?id=WdR3OOM2gBwC)). His development was almost entirely ignored for decades, but today it is the most important foundation of modern AI.\n",
    "\n",
    "There is a lesson here for all of us! On your deep learning journey you will face many obstacles, both technical, and (even more difficult) people around you who don't believe you'll be successful. There's one *guaranteed* way to fail, and that's to stop trying. We've seen that the only consistent trait amongst every fast.ai student that's gone on to be a world-class practitioner is that they are all very tenacious."
   ]
@ -167,7 +167,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "There's a folder of \"3\"s, and a folder of \"7\"s. In machine learning parlance, we say that \"3\" and \"7\" are the *labels* in this dataset. Let's take a look in one of these folders (using `sorted` to ensure we all get the same order of files):"
+    "There's a folder of \"3\"s, and a folder of \"7\"s. In machine learning parlance, we say that \"3\" and \"7\" are the *labels* (or targets) in this dataset. Let's take a look in one of these folders (using `sorted` to ensure we all get the same order of files):"
   ]
  },
  {
@ -2035,7 +2035,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So, is our baseline model any good? To quantify this, we will use a metric."
+    "So, is our baseline model any good? To quantify this, we must define a metric."
   ]
  },
  {
@ -2051,7 +2051,9 @@
   "source": [
    "Recall that a metric is a number which is calculated from the predictions of our model, and the correct labels in our dataset, in order to tell us how good our model is. For instance, we could use either of the functions we saw in the previous section, mean squared error, or mean absolute error, and take the average of them over the whole dataset. However, neither of these are numbers that are very understandable to most people; in practice, we normally use *accuracy* as the metric for classification models.\n",
    "\n",
-    "As we've discussed, we will want to calculate our metric over a a *validation set*. To get a validation set we need to remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called \"valid\"? That's what this directory is for!\n",
+    "As we've discussed, we will want to calculate our metric over a *validation set*. This is so that we don't inadvertently overfit -- that is, train a model to work well only on our training data. To be very precise, this is not really a risk on the pixel similarity model we're using here as a first try, since it has no trained components. But we'll use a validation set anyway to follow normal practices and to be ready for our second try later.\n",
+    "\n",
+    "To get a validation set we need to remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called \"valid\"? That's what this directory is for!\n",
    "\n",
    "So to start with, let's create tensors for our threes and sevens from that directory. These are the tensors we will use calculate a metric measuring the quality of our first try model, which measures distance from an ideal image."
   ]
@ -2151,11 +2153,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Instead of complaining about shapes not matching, it returned the distance for every single image, as a vector (i.e. a rank 1 tensor) of length 1010 (the number of threes in our validation set). How did that happen?\n",
+    "Instead of complaining about shapes not matching, it returned the distance for every single image, as a vector (i.e. a rank 1 tensor) of length 1,010 (the number of threes in our validation set). How did that happen?\n",
    "\n",
    "Have a look again at our function `mnist_distance`, and you'll see we have there the subtraction `(a-b)`.\n",
    "\n",
-    "The magic trick is that PyTorch, when it tries to perform a simple operation subtraction between two tensors of different ranks, will use *broadcasting*. Broadcasting means PyTorch will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.\n",
+    "The magic trick is that PyTorch, when it tries to perform a simple operation subtraction between two tensors of different ranks, will use *broadcasting*. Broadcasting is a feature where PyTorch will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.\n",
    "\n",
    "After broadcasting expands the two argument tensors to have the same rank, PyTorch applies its usual logic for two tensors of the same rank, which is to perform the operation on each corresponding element of the two tensors, and returns the tensor result. For instance:"
   ]
@ -2211,16 +2213,16 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We are calculating the difference between the \"ideal 3\" and each of 1010 threes in the validation set, for each of `28x28` images, resulting in the shape `1010,28,28`.\n",
+    "We are calculating the difference between the \"ideal 3\" and each of 1,010 threes in the validation set, for each of `28x28` images, resulting in the shape `1010,28,28`.\n",
    "\n",
-    "There's a couple of important points about how broadcasting is implemented, which mean it is valuable not just for expressivity but also for performance:\n",
+    "There's a couple of important points about how broadcasting is implemented, which make it valuable not just for expressivity but also for performance:\n",
    "\n",
    "- PyTorch doesn't *actually* copy `mean3` 1010 times. Instead, it just *pretends* as if it was a tensor of that shape, but doesn't actually allocate any additional memory\n",
    "- It does the whole calculation in C (or, if you're using a GPU, in CUDA, the equivalent of C on the GPU), tens of thousands of times faster than pure Python (up to millions of times faster on a GPU!)\n",
    "\n",
    "This is true of all broadcasting and elementwise operations and functions done in PyTorch. **It's the most important technique for you to know to create efficient PyTorch code.** \n",
    "\n",
-    "Next in `mnist_distance` we see `abs()`. You might be able to guess now what this does when applied to a tensor. It applies the method to each individual element in the tensor, and returns a tensor of the results (that is, it applies the method \"elementwise\"). So in this case, we'll get back 1010 absolute values.\n",
+    "Next in `mnist_distance` we see `abs()`. You might be able to guess now what this does when applied to a tensor. It applies the method to each individual element in the tensor, and returns a tensor of the results (that is, it applies the method \"elementwise\"). So in this case, we'll get back 1,010 absolute values.\n",
    "\n",
    "Finally, our function calls `mean((-1,-2))`. The tuple `(-1,-2)` represents a range of axes. In Python, `-1` refers to the last element, and `-2` refers to the second last. So in this case, this tells PyTorch that we want to take the mean ranging over the values indexed by the last two axes of the tensor. The last two axes are the horizontal and vertical dimensions of an image. So after taking the mean over the last two axes, we are left with just the first tensor axis, which indexes over our images, which is why our final size was `(1010)`. In other words, for every image, we averaged the intensity of all the pixels in that image.\n",
    "\n",
@ -2350,9 +2352,9 @@
    "\n",
    "> : _Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would \"learn\" from its experience._\n",
    "\n",
-    "As we discussed, this is the key to allowing us to have something which can get better and better — to learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can't really improve our pixel similarity approach by modifying a set of parameters (which will be the SGD part, as we will see). In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.\n",
+    "As we discussed, this is the key to allowing us to have something which can get better and better — to learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can't really improve our pixel similarity approach by modifying a set of parameters. In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.\n",
    "\n",
-    "Instead of trying to find the similarity between an image and an \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function for each possible category, for instance the probability of being the number eight:\n",
+    "Instead of trying to find the similarity between an image and an \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function and set of weight values for each possible category, for instance the probability of being the number eight:\n",
    "\n",
    "```\n",
    "def pr_eight(x,w) = (x*w).sum()\n",
@ -2504,7 +2506,7 @@
    "\n",
    "- **Initialize**:: we initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
    "- **Loss**:: This is the thing Arthur Samuel referred to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
-    "- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes, by calculating *gradients*. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n",
+    "- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes. The way to do this is by calculating *gradients*. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n",
    "- **Stop**:: We have already discussed how to choose how many epochs to train a model for. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time."
   ]
  },
@ -2629,7 +2631,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The one magic step is the bit where we calculate the *gradients*. As we mentioned, we can use calculus as a performance optimization; it allows us to more quickly calculate whether our loss will go up or down when we adjust our parameters up or down. In other words, the gradients will tell us how much we have to change each weight to make our model better.\n",
+    "The one magic step is the bit where we calculate the *gradients*. As we mentioned, we use calculus as a performance optimization; it allows us to more quickly calculate whether our loss will go up or down when we adjust our parameters up or down. In other words, the gradients will tell us how much we have to change each weight to make our model better.\n",
    "\n",
    "Perhaps you remember back to your high school calculus class: the *derivative* of a function tells you how much a change in the parameters of a function will change its result. Don't worry, lots of us forget our calculus once high school is behind us! But you will have to have some intuitive understanding of what a derivative is before you continue, so if this is all very fuzzy in your head, head over to Khan Academy and complete the lessons on basic derivatives. You won't have to know how to calculate them yourselves, you just have to know what a derivative is.\n",
    "\n",
@ -2708,7 +2710,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The \"backward\" here refers to \"back propagation\", which is the name given to the process of calculating the derivative of each layer (we'll see how this is done exactly in chapter <chapter_foundations>, when we calculate the gradients of a deep neural net from scratch). This is called the \"backward pass\" of the network, as opposed to the \"forward pass\", which is where the activations are calculated. Life would probably be easier if `backward` was just called `calculate_grad`, but deep learning folks really do like to add jargon everywhere they can!"
+    "The \"backward\" here refers to \"back propagation\", which is the name given to the process of calculating the derivative of each layer. We'll see how this is done exactly in chapter <chapter_foundations>, when we calculate the gradients of a deep neural net from scratch. This is called the \"backward pass\" of the network, as opposed to the \"forward pass\", which is where the activations are calculated. Life would probably be easier if `backward` was just called `calculate_grad`, but deep learning folks really do like to add jargon everywhere they can!"
   ]
  },
  {
@ -2830,7 +2832,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The gradient only tells us the slope of our function, it doesn't actually tell us how far to adjust the parameters. It gives us some idea of how far to adjust them; if the slope is very large, then that may suggest that we have more adjustments to do, whereas if the slope is very small, that may suggest that we are close to the optimal value."
+    "The gradient only tells us the slope of our function, it doesn't actually tell us exactly how far to adjust the parameters. But it gives us some idea of how far; if the slope is very large, then that may suggest that we have more adjustments to do, whereas if the slope is very small, that may suggest that we are close to the optimal value."
   ]
  },
  {
@ -2852,7 +2854,7 @@
    "\n",
    "This is known as *stepping* your parameters, using a *optimiser step*.\n",
    "\n",
-    "If you pick a learning rate that's too low, it can mean having to do for a lot of steps. <<descent_small>> illustrates that."
+    "If you pick a learning rate that's too low, it can mean having to do a lot of steps. <<descent_small>> illustrates that."
   ]
  },
  {
@ -2908,7 +2910,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To understand SGD, it might be easier to start with a simple, synthetic, example. Let's imagine you were measuring the speed of a roller coaster as it went over the top of a hump. It would start fast, and then get slower as it went up the hill, and then would be slowest at the top, and it would then speed up again as it goes downhill. If you're measuring the speed manually every second for 20 seconds, it might look something like this:"
+    "We've seen how to use gradients to find a minimum. Now it's time to look at an SGD example, and see how finding a minimum can be used to train a model to fit data better.\n",
+    "\n",
+    "Let us start with a simple, synthetic, example model. Imagine you were measuring the speed of a roller coaster as it went over the top of a hump. It would start fast, and then get slower as it went up the hill, and then would be slowest at the top, and it would then speed up again as it goes downhill. You want to build a model of how the space changes over time. If you're measuring the speed manually every second for 20 seconds, it might look something like this:"
   ]
  },
  {
@ -2958,9 +2962,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We've added a bit of random noise, since measuring things manually isn't precise. This means it's not that easy to answer the question: what was the roller coaster's lowest speed? Using SGD we can try to find a function that matches our observations. We can't consider every possible function, so let's use a guess that it will be quadratic, i.e. a function of the form `a*(time**2)+(b*time)+c`.\n",
+    "We've added a bit of random noise, since measuring things manually isn't precise. This means it's not that easy to answer the question: what was the roller coaster's speed? Using SGD we can try to find a function that matches our observations. We can't consider every possible function, so let's use a guess that it will be quadratic, i.e. a function of the form `a*(time**2)+(b*time)+c`.\n",
    "\n",
-    "We want to distinguish clearly between the function's input (the time when we are measuring the coaster's speed) and its parameters (the values that define *which* quadratic we're trying). So let us collect the parameters in one argument and separate the input, `t`, and the parameters, `params`, in the function's signature: "
+    "We want to distinguish clearly between the function's input (the time when we are measuring the coaster's speed) and its parameters (the values that define *which* quadratic we're trying). So let us collect the parameters in one argument and thus separate the input, `t`, and the parameters, `params`, in the function's signature: "
   ]
  },
  {
@ -3191,7 +3195,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Step 5--*Step* the weights. In other words, update the parameters based on the gradients we just calculated."
+    "Step 5--*Step* the weights. In other words, update the parameters based on the gradients we just calculated.\n",
+    "\n",
+    "> a: Understanding this bit depends on remembering recent history. To calculate the gradients we call `backward()` on the `loss`. But this `loss` was itself calculated by `mse()`, which in turn took `preds` as an input, which was calculated using `f` taking as an input `params`, which was the object on which we originally called `required_grads_()` -- which is the original call that now allows us to call `backward()` on `loss`. This chain of function calls represents the mathematical composition of functions, which enables PyTorch to use calculus's chain rule under the hood to calculate these gradients."
   ]
  },
  {
@ -3334,7 +3340,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Looking only at these loss numbers disguises the fact that each iteration represents an entirely different quadratic function being tried, on the way to find the best possible quadratic function. We can see this process visually if, instead of printing out the loss function, we plot the function at every step. Then we can see how the shape is approaching the best possible quadratic function for our data:"
+    "Loss is going down, just as we hoped! But looking only at these loss numbers disguises the fact that each iteration represents an entirely different quadratic function being tried, on the way to find the best possible quadratic function. We can see this process visually if, instead of printing out the loss function, we plot the function at every step. Then we can see how the shape is approaching the best possible quadratic function for our data:"
   ]
  },
  {
@ -3415,7 +3421,7 @@
    "\n",
    "The purpose of the loss function is to measure the difference between predicted values and the true values -- that is, the targets (aka, the labels). So let's make another argument `targets`, a vector (i.e., another rank-1 tensor), indexed over the images, with a value of 0 or 1 which tells whether that image actually is a 3.\n",
    "\n",
-    "So, for instance, suppose we had three images which we knew were a 3, a 7, and a 3. And suppose our model predicted with high confidence that the first was a 3, with slight confidence that the second was a 7, and with fair confidence (and incorrectly!) that the last was a 7. This would mean our loss function would take values as its inputs:"
+    "So, for instance, suppose we had three images which we knew were a 3, a 7, and a 3. And suppose our model predicted with high confidence that the first was a 3, with slight confidence that the second was a 7, and with fair confidence (and incorrectly!) that the last was a 7. This would mean our loss function would receive these values as its inputs:"
   ]
  },
  {
@ -3617,7 +3623,7 @@
    "\n",
    "Having defined a loss function, now is a good moment to recapitulate why we did this. After all, we already had a *metric*, which was overall accuracy. So why did we define a *loss*?\n",
    "\n",
-    "The key difference is that the metric is to drive human understanding and the loss is to drive automated learning. To drive automated learning, the loss must be a function which has a meaningful derivative. It can't have big flat sections, and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level. The requirements on loss sometimes do not really reflect exactly what we are trying to achieve, but are rather a compromise between our real goal, and a function that can be optimised using its gradient. The loss function is calculated for each item in our dataset, and then at the end of an epoch these are all averaged, and the overall mean is reported for the epoch.\n",
+    "The key difference is that the metric is to drive human understanding and the loss is to drive automated learning. To drive automated learning, the loss must be a function which has a meaningful derivative. It can't have big flat sections, and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level. This requirement on loss means that sometimes it does not really reflect exactly what we are trying to achieve, but is rather a compromise between our real goal, and a function that can be optimised using its gradient. The loss function is calculated for each item in our dataset, and then at the end of an epoch these are all averaged, and the overall mean is reported for the epoch.\n",
    "\n",
    "Metrics, on the other hand, are the numbers that we really care about. These are the things which are printed at the end of each epoch, and tell us how our model is really doing. It is important that we learn to focus on these metrics, rather than the loss, when judging the performance of a model."
   ]
@ -3633,15 +3639,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we have a loss function which is suitable to drive SGD, we can consider some of the details involved in the next phase of the learning process, which is *step* (i.e., change or update) the weights based on the gradients. This is called an optimisation step.\n",
+    "Now that we have a loss function which is suitable to drive SGD, we can consider some of the details involved in the next phase of the learning process, which is to *step* (i.e., change or update) the weights based on the gradients. This is called an optimisation step.\n",
    "\n",
    "In order to take an optimiser step we need to calculate the loss over one or more data items. How many should we use? We could calculate it for the whole dataset, and take the average, or we could calculate it for a single data item. But neither of these is ideal. Calculating it for the whole dataset would take a very long time. Calculating it for a single item would not use much information, and so it would result in a very imprecise and unstable gradient. That is, you'd be going to the trouble of updating the weights but taking into account only how that would improve the model's performance on that single item.\n",
    "\n",
-    "So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a *mini-batch*. The number of data items in the mini batch is called the *batch size*. A larger batch size means that you will get a more accurate and stable estimate of your datasets gradient on the loss function, but it will take longer, and you will get less mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.\n",
+    "So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a *mini-batch*. The number of data items in the mini batch is called the *batch size*. A larger batch size means that you will get a more accurate and stable estimate of your dataset's gradient on the loss function, but it will take longer, and you will get less mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.\n",
    "\n",
    "Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time. So it is helpful if we can give them lots of data items to work on at a time. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!\n",
    "\n",
-    "As we've seen, in the discussion of data augmentation, we get better generalisation if we can vary things during training. A simple and effective thing we can vary during training is what data items we put in each mini batch. Rather than simply enumerating our data set in order for every epoch, instead what we normally do is to randomly shuffle it on every epoch, before we create mini batches. PyTorch and fastai provide a class that will do the shuffling and mini batch collation for you, called `DataLoader`.\n",
+    "As we've seen, in the discussion of data augmentation, we get better generalisation if we can vary things during training. A simple and effective thing we can vary during training is what data items we put in each mini batch. Rather than simply enumerating our dataset in order for every epoch, instead what we normally do is to randomly shuffle it on every epoch, before we create mini batches. PyTorch and fastai provide a class that will do the shuffling and mini batch collation for you, called `DataLoader`.\n",
    "\n",
    "A `DataLoader` can take any Python collection, and turn it into an iterator over many batches, like so:"
   ]
@ -3702,7 +3708,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When we pass a Dataset to a DataLoader we will get back many batches which are themselves tuples of independent and dependent variables:"
+    "When we pass a Dataset to a DataLoader we will get back many batches which are themselves tuples of tensors representing batches of independent and dependent variables:"
   ]
  },
  {
@ -3748,7 +3754,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "it's time to implement the graph we saw in <<gradient_descent>>. In code, our process will be implemented something like this for each epoch:\n",
+    "It's time to implement the graph we saw in <<gradient_descent>>. In code, our process will be implemented something like this for each epoch:\n",
    "\n",
    "```python\n",
    "for x,y in dl:\n",
@ -5478,18 +5484,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/10_nlp.ipynb
+++ b/10_nlp.ipynb
@ -213,7 +213,7 @@
   "source": [
    "As we write this book, the default *English word tokenizer* for fastai uses a library called *spaCy*. This uses a sophisticated rules engine that has special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not always be Spacy, depending when you're reading this).\n",
    "\n",
-    "Let's try it out. We'll use fastai's `coll_repr(collection,n)` function to display the results; this displays the first `n` items of `collection`, along with the full size--it's what `L` uses by default. Not that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
+    "Let's try it out. We'll use fastai's `coll_repr(collection,n)` function to display the results; this displays the first `n` items of `collection`, along with the full size--it's what `L` uses by default. Note that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
   ]
  },
  {
--- a/clean/04_mnist_basics.ipynb
+++ b/clean/04_mnist_basics.ipynb
@ -4010,18 +4010,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,