Updates

2020-05-18 14:18:08 -07:00 · 2020-05-18 14:18:08 -07:00 · d8d39c560a
commit d8d39c560a
parent a3599602ce
14 changed files with 390 additions and 415 deletions
--- a/01_intro.ipynb
+++ b/01_intro.ipynb
@ -225,7 +225,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Harvard professor David Perkins, who wrote _Making Learning Whole_ (Jossey-Bass), has much to say about teaching. The basic idea is to teach the *whole game*. That means that if you're teaching baseball, you first take people to a baseball game or get them to play it. You don't teach them how to line thread into a ball, the physics of a parabola, or the coefficient of friction of a ball on a bat.\n",
+    "Harvard professor David Perkins, who wrote _Making Learning Whole_ (Jossey-Bass), has much to say about teaching. The basic idea is to teach the *whole game*. That means that if you're teaching baseball, you first take people to a baseball game or get them to play it. You don't teach them how to wind twine to make a baseball from scratch, the physics of a parabola, or the coefficient of friction of a ball on a bat.\n",
    "\n",
    "Paul Lockhart, a Columbia math PhD, former Brown professor, and K-12 math teacher, imagines in the influential [essay](https://www.maa.org/external_archive/devlin/LockhartsLament.pdf) \"A Mathematician's Lament\" a nightmare world where music and art are taught the way math is taught. Children are allowed to listen to or play music until they have spent over a decade mastering music notation and theory, spending classes transposing sheet music into a different key. In art class, students study colors and applicators, but aren't allowed to actually paint until college. Sound absurd? This is how math is taught–-we require students to spend years doing rote memorization and learning dry, disconnected *fundamentals* that we claim will pay off later, long after most of them quit the subject.\n",
    "\n",
--- a/04_mnist_basics.ipynb
+++ b/04_mnist_basics.ipynb
@ -2160,9 +2160,7 @@
   "source": [
    "Instead of complaining about shapes not matching, it returned the distance for every single image as a vector (i.e., a rank-1 tensor) of length 1,010 (the number of 3s in our validation set). How did that happen?\n",
    "\n",
-    "Take another look at our function `mnist_distance`, and you'll see we have there the subtraction `(a-b)`.\n",
-    "\n",
-    "The magic trick is that PyTorch, when it tries to perform a simple subtraction operation between two tensors of different ranks, will use *broadcasting*. That is, it will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.\n",
+    "Take another look at our function `mnist_distance`, and you'll see we have there the subtraction `(a-b)`. The magic trick is that PyTorch, when it tries to perform a simple subtraction operation between two tensors of different ranks, will use *broadcasting*. That is, it will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.\n",
    "\n",
    "After broadcasting so the two argument tensors have the same rank, PyTorch applies its usual logic for two tensors of the same rank: it performs the operation on each corresponding element of the two tensors, and returns the tensor result. For instance:"
   ]
@ -3579,7 +3577,7 @@
   "source": [
    "To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the specific task we are targeting. So the model will need to *learn* better weights.\n",
    "\n",
-    "We begin by comparing the outputs the model gives us with our targets (we have labeled data, so we know what result the model should give) using a *loss function*, which returns a number that we want to make as low as possible by improving our weights. To do this, we take a few data items (such as images) and feed them to our model. We compare the corresponding targets using our loss function, and the score we get tells us how wrong our predictions were. We then change the weights a little bit to make it slightly better.\n",
+    "We begin by comparing the outputs the model gives us with our targets (we have labeled data, so we know what result the model should give) using a *loss function*, which returns a number that we want to make as low as possible by improving our weights. To do this, we take a few data items (such as images) from the training set and feed them to our model. We compare the corresponding targets using our loss function, and the score we get tells us how wrong our predictions were. We then change the weights a little bit to make it slightly better.\n",
    "\n",
    "To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradients*. (Actually, we let PyTorch do it for us!) Let's consider an analogy. Imagine you are lost in the mountains with your car parked at the lowest point. To find your way back to it, you might wander in a random direction, but that probably wouldn't help much. Since you know your vehicle is at the lowest point, you would be better off going downhill. By always taking a step in the direction of the steepest downward slope, you should eventually arrive at your destination. We use the magnitude of the gradient (i.e., the steepness of the slope) to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size. We then *iterate* until we have reached the lowest point, which will be our parking lot, then we can *stop*.\n",
    "\n",
--- a/08_collab.ipynb
+++ b/08_collab.ipynb
@ -1931,7 +1931,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Boot Strapping a Collaborative Filtering Model"
+    "## Bootstrapping a Collaborative Filtering Model"
   ]
  },
  {
--- a/10_nlp.ipynb
+++ b/10_nlp.ipynb
@ -29,7 +29,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify reviews. One thing is a bit different from the transfer learning we have in computer vision: the pretrained model was not trained on the same task as the model we used to classify reviews.\n",
+    "In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify reviews. That example highlighted a difference between transfer learning in NLP and computer vision: in general in NLP the pretrained model is trained on a different task.\n",
    "\n",
    "What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called *self-supervised learning*: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to develop an understanding of the English (or other) language. Self-supervised learning can also be used in other domains; for instance, see [\"Self-Supervised Learning and Computer Vision\"](https://www.fast.ai/2020/01/13/self_supervised/) for an introduction to vision applications. Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pretraining a model used for transfer learning."
   ]
@ -1398,7 +1398,7 @@
   "source": [
    "The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The 8perplexity* metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`). We  also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy.\n",
    "\n",
-    "Going back to our ULMFit process diagramThe grey first arrow in our overall picture has been done for us and made available as a pretrained model in fastai; we've now built the `DataLoaders` and `Learner` for the second stage, and we're ready to fine-tune it!"
+    "Let's go back to the process diagram from the beginning of this chapter. The first arrow has been completed for us and made available as a pretrained model in fastai, and we've just built the `DataLoaders` and `Learner` for the second stage. Now we're ready to fine-tune our language model!"
   ]
  },
  {
@ -1508,7 +1508,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can then train the model for more epochs after unfreezing:"
+    "Once the initial training has completed, we can continue fine-tuning the model after unfreezing:"
   ]
  },
  {
--- a/11_midlevel_data.ipynb
+++ b/11_midlevel_data.ipynb
@ -1253,7 +1253,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Use the mid-level API to prepare the data in `DataLoaders` on the pets dataset. On the adult dataset (used in chapter 1).\n",
+    "1. Use the mid-level API to prepare the data in `DataLoaders` on your own datasets. Try this with the Pet dataset and the Adult dataset from Chapter 1.\n",
    "1. Look at the Siamese tutorial in the fastai documentation to learn how to customize the behavior of `show_batch` and `show_results` for new type of items. Implement it in your own project."
   ]
  },
--- a/12_nlp_dive.ipynb
+++ b/12_nlp_dive.ipynb
@ -419,7 +419,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "An arrow represents the actual layer computation—i.e., the linear layer followed by the activation layers. Using this notation, <<lm_rep>> shows what our simple language model looks like."
+    "An arrow represents the actual layer computation—i.e., the linear layer followed by the activation function. Using this notation, <<lm_rep>> shows what our simple language model looks like."
   ]
  },
  {
@ -1578,11 +1578,11 @@
    "\n",
    "First, the arrows for input and old hidden state are joined together. In the RNN we wrote earlier in this chapter, we were adding them together. In the LSTM, we stack them in one big tensor. This means the dimension of our embeddings (which is the dimension of $x_{t}$) can be different than the dimension of our hidden state. If we call those `n_in` and `n_hid`, the arrow at the bottom is of size `n_in + n_hid`; thus all the neural nets (orange boxes) are linear layers with `n_in + n_hid` inputs and `n_hid` outputs.\n",
    "\n",
-    "The first gate (looking from left to right) is called the *forget gate*. Since it's a linear layer followed by a sigmoid, its output will consist of scalars between 0 and 1. We multiply this result by the cell gate, so for all the values close to 0, we will forget what was inside that cell state (and for the values close to 1 it doesn't do anything). This gives the LSTM the ability to forget things about its long-term state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
+    "Since it’s a linear layer followed by a sigmoid, its output will consist of scalars between 0 and 1. We multiply this result by the cell state to determine which information to keep and which to throw away: values closer to 0 are discarded and values closer to 1 are kept. This gives the LSTM the ability to forget things about its long-term state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
    "\n",
-    "The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance, we may see a new gender pronoun, in which case we'll need to replace the information about gender that the forget gate removed. Like the forget gate, the input gate ends up on a product, so it just decides which element of the cell state to update (values close to 1) or not (values close to 0). The third gate determines the updated values, with elements between -1 and 1 (thanks to the tanh function). The result is then added to the cell state.\n",
+    "The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance, we may see a new gender pronoun, in which case we'll need to replace the information about gender that the forget gate removed. Similar to the forget gate, the input gate decides which elements of the cell state to update (values close to 1) or not (values close to 0). The third gate determines what those updated values are, in the range of –1 to 1 (thanks to the tanh function). The result is then added to the cell state.\n",
    "\n",
-    "The last gate is the *output gate*. Similar to the forget gate, it decides which element of the cell state to keep for our prediction at this timestep (values close to 1) and which to discard (values close to 0). The cell state is applied the tanh function before being used.\n",
+    "The last gate is the *output gate*. It determines which information from the cell state to use to generate the output. The cell state goes through a tanh before being combined with the sigmoid output from the output gate, and the result is the new hidden state.\n",
    "\n",
    "In terms of code, we can write the same steps like this:"
   ]
--- a/13_convolutions.ipynb
+++ b/13_convolutions.ipynb
@ -31,9 +31,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In <<chapter_mnist_basics>> we learned how to create a neural network recognising images. We were able to achieve a bit over 98% accuracy at recognising threes from sevens. But we also saw that fastai's built in classes were able to get close to 100%. Let's start trying to close the gap.\n",
+    "In <<chapter_mnist_basics>> we learned how to create a neural network recognizing images. We were able to achieve a bit over 98% accuracy at distinguishing 3s from 7s--but we also saw that fastai's built-in classes were able to get close to 100%. Let's start trying to close the gap.\n",
    "\n",
-    "In this chapter, we will start by digging into what convolutions are and build a CNN from scratch. We will then study a range of techniques to improve training stability and learn all the tweaks the library usually applies for us to get great results."
+    "In this chapter, we will begin by digging into what convolutions are and building a CNN from scratch. We will then study a range of techniques to improve training stability and learn all the tweaks the library usually applies for us to get great results."
   ]
  },
  {
@ -47,25 +47,25 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One of the most powerful tools that machine learning practitioners have at their disposal is *feature engineering*. A *feature* is a transformation of the data which is designed to make it easier to model. For instance, the `add_datepart` function that we used for our tabular dataset preprocessing added date features to the Bulldozers dataset. What kind of features might we be able to create from images?"
+    "One of the most powerful tools that machine learning practitioners have at their disposal is *feature engineering*. A *feature* is a transformation of the data which is designed to make it easier to model. For instance, the `add_datepart` function that we used for our tabular dataset preprocessing in <<chapter_tabular>> added date features to the Bulldozers dataset. What kinds of features might we be able to create from images?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: Feature engineering: creating new transformations of the input data in order to make it easier to model."
+    "> jargon: Feature engineering: Creating new transformations of the input data in order to make it easier to model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the context of an image, a *feature* will be a visually distinctive attribute of an image. Here's an idea: the number seven is characterised by a horizontal edge near the top of the digit, and a bottom left to top right diagonal edge underneath that. On the other hand, the number three is characterised by a diagonal edge in one direction in the top left and bottom right of the digit, the opposite diagonal on the bottom left and top right, a horizontal edge in the middle of the top and the bottom, and so forth. So what if we could extract information about where the edges occur in each image, and then use that as our features, instead of raw pixels?\n",
+    "In the context of an image, a feature is a visually distinctive attribute. For example, the number 7 is characterized by a horizontal edge near the top of the digit, and a top-right to bottom-left diagonal edge underneath that. On the other hand, the number 3 is characterized by a diagonal edge in one direction at the top left and bottom right of the digit, the opposite diagonal at the bottom left and top right, horizontal edges at the middle, top, and bottom, and so forth. So what if we could extract information about where the edges occur in each image, and then use that information as our features, instead of raw pixels?\n",
    "\n",
-    "It turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition — two operations which are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n",
+    "It turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n",
    "\n",
-    "A convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3x3 matrix in the top right of <<basic_conv>>."
+    "A convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3\\*3 matrix in the top right of <<basic_conv>>."
   ]
  },
  {
@ -79,9 +79,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The 7x7 grid to the left is our *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel, to each element of a 3x3 block of the image. The results of these multiplications are then added together. The diagram above shows an example of applying a kernel to a single location in the image, the 3x3 block around cell 18.\n",
+    "The 7\\*7 grid to the left is the *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel by each element of a 3\\*3 block of the image. The results of these multiplications are then added together. The diagram in <<basic_conv>> shows an example of applying a kernel to a single location in the image, the 3\\*3 block around cell 18.\n",
    "\n",
-    "Let's do this with code. First, we create a little 3x3 matrix like so:"
+    "Let's do this with code. First, we create a little 3\\*3 matrix like so:"
   ]
  },
  {
@ -99,8 +99,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We're going to call this our *kernel*\n",
-    "(because that's what fancy computer vision researchers call these). And we'll need an image, of course:"
+    "We're going to call this our kernel (because that's what fancy computer vision researchers call these). And we'll need an image, of course:"
   ]
  },
  {
@ -149,7 +148,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we're going to take the top 3x3 pixel square of our image, and we'll multiply each of those by each item in our kernel. Then we'll add them up. Like so:"
+    "Now we're going to take the top 3\\*3-pixel square of our image, and multiply each of those values by each item in our kernel. Then we'll add them up, like so:"
   ]
  },
  {
@ -199,7 +198,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Not very interesting so far - they are all white pixels in the top left corner. But let's pick a couple of more interesting spots:"
+    "Not very interesting so far--all the pixels in the top-left corner are white. But let's pick a couple of more interesting spots:"
   ]
  },
  {
@ -1324,15 +1323,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you can see, this little calculation is returning a high number where the 3x3 pixel square represents a top edge (i.e. where there are low values at the top of the square, and high values immediately underneath). That's because the `-1` values in our kernel have little impact in that case, but the `1` values have a lot.\n",
+    "As you can see, this little calculation is returning a high number where the 3\\*3-pixel square represents a top edge (i.e., where there are low values at the top of the square, and high values immediately underneath). That's because the `-1` values in our kernel have little impact in that case, but the `1` values have a lot.\n",
    "\n",
-    "Let's look a tiny bit at the math. The filter will take any window of size 3 by 3 in our images, and if we name the pixel values like this:\n",
+    "Let's look a tiny bit at the math. The filter will take any window of size 3\\*3 in our images, and if we name the pixel values like this:\n",
    "\n",
    "$$\\begin{matrix} a1 & a2 & a3 \\\\ a4 & a5 & a6 \\\\ a7 & a8 & a9 \\end{matrix}$$\n",
    "\n",
-    "it will return $a1+a2+a3-a7-a8-a9$. Now if we are in a part of the image where there $a1$, $a2$ and $a3$ are kind of the same as $a7$, $a8$ and $a9$, then the terms will cancel each other and we will get 0. However if $a1$ is greater than $a7$, $a2$ is greater than $a8$ and $a3$ is greater than $a9$, we will get a bigger number as a result. So this filter detects horizontal edges, more precisely edges where we go from bright parts of the image at the top to darker parts at the bottom.\n",
+    "it will return $a1+a2+a3-a7-a8-a9$. If we are in a part of the image where $a1$, $a2$, and $a3$ add up to the same as $a7$, $a8$, and $a9$, then the terms will cancel each other out and we will get 0. However, if $a1$ is greater than $a7$, $a2$ is greater than $a8$, and $a3$ is greater than $a9$, we will get a bigger number as a result. So this filter detects horizontal edges--more precisely, edges where we go from bright parts of the image at the top to darker parts at the bottom.\n",
    "\n",
-    "Changing our filter to have the row of ones at the top and the -1 at the bottom would detect horizonal edges that go from dark to light. Putting the ones and -1 in columns versus rows would give us a filter that detect vertical edges. Each set of weights will produce a different kind of outcome.\n",
+    "Changing our filter to have the row of `1`s at the top and the `-1`s at the bottom would detect horizonal edges that go from dark to light. Putting the `1`s and `-1`s in columns versus rows would give us filters that detect vertical edges. Each set of weights will produce a different kind of outcome.\n",
    "\n",
    "Let's create a function to do this for one location, and check it matches our result from before:"
   ]
@ -1371,7 +1370,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "But note that we can't apply it to the corner (such as location 0,0), since there isn't a complete 3x3 square there."
+    "But note that we can't apply it to the corner (e.g., location 0,0), since there isn't a complete 3\\*3 square there."
   ]
  },
  {
@ -1385,7 +1384,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can map `apply_kernel()` across the coordinate grid. That is, we'll be taking our 3x3 kernel, and applying it to each 3x3 section of our image. For instance, here are the positions a 3x3 kernel can be applied to in the first row of a 5x5 image:"
+    "We can map `apply_kernel()` across the coordinate grid. That is, we'll be taking our 3\\*3 kernel, and applying it to each 3\\*3 section of our image. For instance, <<nopad_conv>> shows the positions a 3\\*3 kernel can be applied to in the first row of a 5\\*5 image."
   ]
  },
  {
@ -1399,7 +1398,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To get a *grid* of coordinates we can use a *nested list comprehension*, like so:"
+    "To get a grid of coordinates we can use a *nested list comprehension*, like so:"
   ]
  },
  {
@ -1429,14 +1428,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> note: Nested list comprehensions are used a lot in Python, so if you haven't seen them before, take a few minutes to make sure you understand what's happening here, and experiment with writing your own nested list comprehensions."
+    "> note: Nested List Comprehensions: Nested list comprehensions are used a lot in Python, so if you haven't seen them before, take a few minutes to make sure you understand what's happening here, and experiment with writing your own nested list comprehensions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here's the result of applying our kernel over a coordinate grid."
+    "Here's the result of applying our kernel over a coordinate grid:"
   ]
  },
  {
@ -1468,7 +1467,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Looking good! Our top edges are black, and bottom edges are white (since they are the *opposite* of top edges). Now that our *image* contains negative numbers too, matplotlib has automatically changed our colors, so that white is the smallest number in the image, black the highest, and zeros appear as grey.\n",
+    "Looking good! Our top edges are black, and bottom edges are white (since they are the *opposite* of top edges). Now that our image contains negative numbers too, `matplotlib` has automatically changed our colors so that white is the smallest number in the image, black the highest, and zeros appear as gray.\n",
    "\n",
    "We can try the same thing for left edges:"
   ]
@ -1505,21 +1504,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This operation of applying a kernel over a grid in this way is called *convolution*. In the paper [A guide to convolution arithmetic for deep learning](https://arxiv.org/abs/1603.07285) there are many great diagrams showing how image kernels can be applied. Here's an example from the paper showing (at bottom) a light blue 4x4 image, with a dark blue 3x3 kernel being applied, creating a 2x2 green output activation map at the top. "
+    "As we mentioned before, a convolution is the operation of applying such a kernel over a grid in this way. In the paper [\"A Guide to Convolution Arithmetic for Deep Learning\"](https://arxiv.org/abs/1603.07285) there are many great diagrams showing how image kernels can be applied. Here's an example from the paper showing (at the bottom) a light blue 4\\*4 image, with a dark blue 3\\*3 kernel being applied, creating a 2\\*2 green output activation map at the top. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Result of applying a 3x3 kernel to a 4x4 image\" width=\"782\" caption=\"Result of applying a 3x3 kernel to a 4x4 image (curtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_ex_four_conv\" src=\"images/att_00028.png\">"
+    "<img alt=\"Result of applying a 3\\*3 kernel to a 4\\*4 image\" width=\"782\" caption=\"Result of applying a 3\\*3 kernel to a 4\\*4 image (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_ex_four_conv\" src=\"images/att_00028.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Look at the shape of the result. If the original image has a height of `h` and a width of `w`, how many 3 by 3 windows can we find? As you see from the example, there are `h-2` by `w-2` windows, so the image we get as a result as a height of `h-2` and a witdh of `w-2`."
+    "Look at the shape of the result. If the original image has a height of `h` and a width of `w`, how many 3\\*3 windows can we find? As you can see from the example, there are `h-2` by `w-2` windows, so the image we get has a result as a height of `h-2` and a width of `w-2`."
   ]
  },
  {
@ -1540,16 +1539,16 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Convolution is such an important and widely-used operation that PyTorch has it builtin. It's called `F.conv2d` (Recall F is a fastai import from torch.nn.functional as recommended by PyTorch). The PyTorch docs tell us that it includes these parameters:\n",
+    "Convolution is such an important and widely used operation that PyTorch has it built in. It's called `F.conv2d` (recall that `F` is a fastai import from `torch.nn.functional`, as recommended by PyTorch). The PyTorch docs tell us that it includes these parameters:\n",
    "\n",
-    "- **input**: input tensor of shape `(minibatch, in_channels, iH, iW)`\n",
-    "- **weight**: filters of shape `(out_channels, in_channels, kH, kW)`\n",
+    "- input:: input tensor of shape `(minibatch, in_channels, iH, iW)`\n",
+    "- weight:: filters of shape `(out_channels, in_channels, kH, kW)`\n",
    "\n",
-    "Here `iH,iW` is the height and width of the image (i.e. `28,28`), and `kH,kW` is the height and width of our kernel (`3,3`). But apparently PyTorch is expecting rank 4 tensors for both these arguments, but currently we only have rank 2 tensors (i.e. matrices, arrays with two axes).\n",
+    "Here `iH,iW` is the height and width of the image (i.e., `28,28`), and `kH,kW` is the height and width of our kernel (`3,3`). But apparently PyTorch is expecting rank-4 tensors for both these arguments, whereas currently we only have rank-2 tensors (i.e., matrices, or arrays with two axes).\n",
    "\n",
    "The reason for these extra axes is that PyTorch has a few tricks up its sleeve. The first trick is that PyTorch can apply a convolution to multiple images at the same time. That means we can call it on every item in a batch at once!\n",
    "\n",
-    "The second trick is that PyTorch can apply multiple kernels at the same time. So let's create the diagonal edge kernels too, and then stack all 4 of our edge kernels into a single tensor:"
+    "The second trick is that PyTorch can apply multiple kernels at the same time. So let's create the diagonal-edge kernels too, and then stack all four of our edge kernels into a single tensor:"
   ]
  },
  {
@ -1584,7 +1583,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In order to test on a mini-batch, we'll need a `DataLoader`, and a sample mini-batch. Let's use the data block API:"
+    "To test this, we'll need a `DataLoader` and a sample mini-batch. Let's use the data block API:"
   ]
  },
  {
@ -1634,9 +1633,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One batch contains 64 images, each of 1 channel, with 28x28 pixels. `F.conv2d` can handle multi-channel (e.g. colour) images. A *channel* is a single basic color in an image--for regular full color images there are 3 channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions channels x rows x columns.\n",
+    "One batch contains 64 images, each of 1 channel, with 28\\*28 pixels. `F.conv2d` can handle multichannel (i.e., color) images too. A *channel* is a single basic color in an image--for regular full-color images there are three channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions `[channels, rows, columns]`.\n",
    "\n",
-    "We'll see how to handle more than one channel later in this chapter. Kernels passed to `F.conv2d` need to be rank-4 tensors: channels_in x features_out x rows x columns. `edge_kernels` is currently missing one of these: the `1` for features_out. We need to tell PyTorch that the number of input channels in the kernel is one, by inserting an axis of size one (this is known as a *unit axis*) in the first location, since the PyTorch docs show that's where `in_channels` is expected. To insert a unit axis into a tensor, use the `unsqueeze` method:"
+    "We'll see how to handle more than one channel later in this chapter. Kernels passed to `F.conv2d` need to be rank-4 tensors: `[channels_in, features_out, rows, columns]`. `edge_kernels` is currently missing one of these. We need to tell PyTorch that the number of input channels in the kernel is one, which we can do by inserting an axis of size one (this is known as a *unit axis*) in the first location, where the PyTorch docs show `in_channels` is expected. To insert a unit axis into a tensor, we use the `unsqueeze` method:"
   ]
  },
  {
@ -1700,7 +1699,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The output shape shows our 64 images in the mini-batch, 4 kernels, and 26x26 edge maps (we started with 28x28 images, but lost one pixel from each side as discussed earlier). We can see we get the same results as when we did this manually:"
+    "The output shape shows we gave 64 images in the mini-batch, 4 kernels, and 26\\*26 edge maps (we started with 28\\*28 images, but lost one pixel from each side as discussed earlier). We can see we get the same results as when we did this manually:"
   ]
  },
  {
@ -1729,7 +1728,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The most important trick that PyTorch has up its sleeve is that it can use the GPU to do all this work in parallel. That is, applying multiple kernels, to multiple images, across multiple channels. Doing lots of work in parallel is critical to getting GPUs to work efficiently; if we did each of these one at a time, we'll often run hundreds of times slower (and if we used our manual convolution loop from the previous section, we'd be millions of times slower!) Therefore, to become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time."
+    "The most important trick that PyTorch has up its sleeve is that it can use the GPU to do all this work in parallel--that is, applying multiple kernels, to multiple images, across multiple channels. Doing lots of work in parallel is critical to getting GPUs to work efficiently; if we did each of these operations one at a time, we'd often run hundreds of times slower (and if we used our manual convolution loop from the previous section, we'd be millions of times slower!). Therefore, to become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time."
   ]
  },
  {
@ -1750,7 +1749,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With appropriate padding, we can ensure that the output activation map is the same size as the original image, which can make things a lot simpler when we construct our architectures."
+    "With appropriate padding, we can ensure that the output activation map is the same size as the original image, which can make things a lot simpler when we construct our architectures. <<pad_conv>> shows how adding padding allows us to apply the kernels in the image corners."
   ]
  },
  {
@ -1764,44 +1763,44 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With a 5x5 input, and 4x4 kernel, and 2 pixels of padding, we end up with a 6x6 activation map, as we can see in <<four_by_five_conv>>"
+    "With a 5\\*5 input, 4\\*4 kernel, and 2 pixels of padding, we end up with a 6\\*6 activation map, as we can see in <<four_by_five_conv>>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"4x4 kernel with 5x5 input and 2 pixels of padding\" width=\"783\" caption=\"4x4 kernel with 5x5 input and 2 pixels of padding (curtesy of Vincent Dumoulin and Francesco Visin)\" id=\"four_by_five_conv\" src=\"images/att_00029.png\">"
+    "<img alt=\"A 4\\*4 kernel with 5\\*5 input and 2 pixels of padding\" width=\"783\" caption=\"A 4\\*4 kernel with 5\\*5 input and 2 pixels of padding (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"four_by_five_conv\" src=\"images/att_00029.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If we add a kernel of size `ks` by `ks` (with `ks` an odd number), the necessary padding on each side to keep the same shape is `ks//2`. An even number for `ks` would require a different amount of padding on the top/bottom, left/right, but in practice we almost never use an even filter size.\n",
+    "If we add a kernel of size `ks` by `ks` (with `ks` an odd number), the necessary padding on each side to keep the same shape is `ks//2`. An even number for `ks` would require a different amount of padding on the top/bottom and left/right, but in practice we almost never use an even filter size.\n",
    "\n",
-    "So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application as in <<three_by_five_conv>>. This is known as a *stride 2* convolution. The most common kernel size in practice is 3x3, and the most common padding is 1. As you'll see, stride 2 convolutions are useful for decreasing the size of our outputs, and stride 1 convolutions are useful for adding layers without changing the output size."
+    "So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application, as in <<three_by_five_conv>>. This is known as a *stride-2* convolution. The most common kernel size in practice is 3\\*3, and the most common padding is 1. As you'll see, stride-2 convolutions are useful for decreasing the size of our outputs, and stride-1 convolutions are useful for adding layers without changing the output size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"3x3 kernel with 5x5 input, stride 2 convolution, and 1 pixel of padding\" width=\"774\" caption=\"3x3 kernel with 5x5 input, stride 2 convolution, and 1 pixel of padding (curtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_by_five_conv\" src=\"images/att_00030.png\">"
+    "<img alt=\"A 3\\*3 kernel with 5\\*5 input, stride-2 convolution, and 1 pixel of padding\" width=\"774\" caption=\"A 3\\*3 kernel with 5\\*5 input, stride-2 convolution, and 1 pixel of padding (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_by_five_conv\" src=\"images/att_00030.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In an image of size `h` by `w` like before, using a padding of 1 and a stride of 2 will give us a result of size `(h+1)//2` by `(w+1)//2`. The general formula for each dimension is `(n + 2*pad - ks)//stride + 1` where `pad` is the padding, `ks` the size of our kernel and `stride` the stride."
+    "In an image of size `h` by `w`, using a padding of 1 and a stride of 2 will give us a result of size `(h+1)//2` by `(w+1)//2`. The general formula for each dimension is `(n + 2*pad - ks)//stride + 1`, where `pad` is the padding, `ks`, the size of our kernel, and `stride` is the stride."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's now have a look at how the pixel values of the result of our convolutions are computed."
+    "Let's now take a look at how the pixel values of the result of our convolutions are computed."
   ]
  },
  {
@ -1817,7 +1816,7 @@
   "source": [
    "To explain the math behing convolutions, fast.ai student Matt Kleinsmith came up with the very clever idea of showing [CNNs from different viewpoints](https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c). In fact, it's so clever, and so helpful, we're going to show it here too!\n",
    "\n",
-    "Here's our 3x3 pixel *image*, with each *pixel* labeled with a letter:"
+    "Here's our 3\\*3 pixel image, with each pixel labeled with a letter:"
   ]
  },
  {
@ -1831,7 +1830,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...and our kernel, with each weight labeled with a greek letter:"
+    "And here's our kernel, with each weight labeled with a Greek letter:"
   ]
  },
  {
@ -1859,7 +1858,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here’s how we applied the kernel to each section of the image to yield each result:"
+    "<<apply_kernel>> shows how we applied the kernel to each section of the image to yield each result."
   ]
  },
  {
@ -1873,7 +1872,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The equation view:"
+    "The equation view is in <<eq_view>>."
   ]
  },
  {
@ -1887,26 +1886,17 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Notice that the bias term, b, is the same for each section of the image. You can consider the bias as part of the filter, just like the weights (α, β, γ, δ) are part of the filter.\n",
-    "\n",
-    "The compact equation view:"
+    "Notice that the bias term, *b*, is the same for each section of the image. You can consider the bias as part of the filter, just like the weights (α, β, γ, δ) are part of the filter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"The compact equation\" width=\"218\" src=\"images/att_00037.png\">"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here's an interesting insight -- a convolution can be represented as a special kind of matrix multiplication. The weight matrix is just like the ones from traditional neural networks. However, this weight matrix has two special properties:\n",
+    "Here's an interesting insight--a convolution can be represented as a special kind of matrix multiplication, as illustrated in <<conv_matmul>>. The weight matrix is just like the ones from traditional neural networks. However, this weight matrix has two special properties:\n",
    "\n",
    "1. The zeros shown in gray are untrainable. This means that they’ll stay zero throughout the optimization process.\n",
-    "1. Some of the weights are equal, and while they are trainable (i.e. changeable), they must remain equal. These are called *shared weights*.\n",
+    "1. Some of the weights are equal, and while they are trainable (i.e., changeable), they must remain equal. These are called *shared weights*.\n",
    "\n",
    "The zeros correspond to the pixels that the filter can't touch. Each row of the weight matrix corresponds to one application of the filter."
   ]
@ -1936,11 +1926,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "There is no reason to believe that some particular edge filters are the most useful kernels for image recognition. Furthermore, we've seen that in later layers convolutional kernels become complex transformations of features from lower levels — we do not have a good idea of how to manually construct these.\n",
+    "There is no reason to believe that some particular edge filters are the most useful kernels for image recognition. Furthermore, we've seen that in later layers convolutional kernels become complex transformations of features from lower levels, but we don't have a good idea of how to manually construct these.\n",
    "\n",
    "Instead, it would be best to learn the values of the kernels. We already know how to do this—SGD! In effect, the model will learn the features that are useful for classification.\n",
    "\n",
-    "When we use convolutions instead of (or in addition to) regular linear layers we create a *convolutional neural network*, or *CNN*."
+    "When we use convolutions instead of (or in addition to) regular linear layers we create a *convolutional neural network* (CNN)."
   ]
  },
  {
@ -2027,11 +2017,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One thing to note here is that we didn't need to specify \"28x28\" as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel. So it needs to know how many pixels there are. But a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels, and the kernel size, as we saw in the previous section.\n",
+    "One thing to note here is that we didn't need to specify 28\\*28 as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel, so it needs to know how many pixels there are, but a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels and the kernel size, as we saw in the previous section.\n",
    "\n",
-    "Have a think about what the output shape is going to be.\n",
-    "\n",
-    "Let's try it and see:"
+    "Think about what the output shape is going to be, then let's try it and see:"
   ]
  },
  {
@ -2058,7 +2046,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is not something we can use to do classification, since we need a single output activation per image, not a 28x28 map of activations. One way to deal with this is to use enough stride-2 convolutions such that the final layer is size 1. That is, after one stride-2 convolution, the size will be 14x14, after 2 it will be 7x7, then 4x4, 2x2, and finally size 1.\n",
+    "This is not something we can use to do classification, since we need a single output activation per image, not a 28\\*28 map of activations. One way to deal with this is to use enough stride-2 convolutions such that the final layer is size 1. That is, after one stride-2 convolution the size will be 14\\*14, after two it will be 7\\*7, then 4\\*4, 2\\*2, and finally size 1.\n",
    "\n",
    "Let's try that now. First, we'll define a function with the basic parameters we'll use in each convolution:"
   ]
@ -2079,7 +2067,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> important: Refactoring parts of your neural networks like this makes it much less likely you'll get errors due to inconsistencies in your architectures, and makes it more obvious to the reader which parts of your layers are actually changing."
+    "> important: Refactoring: Refactoring parts of your neural networks like this makes it much less likely you'll get errors due to inconsistencies in your architectures, and makes it more obvious to the reader which parts of your layers are actually changing."
   ]
  },
  {
@ -2093,7 +2081,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: channels and features: These two terms are largely used interchangably, and refer to the size of the second axis of a weight matrix, which is, therefore, the number of activations per grid cell after a convolution. *Features* is never used to refer to the input data, but *channels* can refer to either the input data (generally channels are colors) or activations inside the network."
+    "> jargon: channels and features: These two terms are largely used interchangably, and refer to the size of the second axis of a weight matrix, which is, the number of activations per grid cell after a convolution. _Features_ is never used to refer to the input data, but _channels_ can refer to either the input data (generally channels are colors) or activations inside the network."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here is how we can build a simple CNN:"
   ]
  },
  {
@ -2116,14 +2111,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: I like to add comments like the above after each convolution to show how large the activation map will be after each layer. The above comments assume that the input size is 28x28"
+    "> j: I like to add comments like the ones here after each convolution to show how large the activation map will be after each layer. These comments assume that the input size is 28*28"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now the network outputs two activations, which maps to the two possible levels in our labels:"
+    "Now the network outputs two activations, which map to the two possible levels in our labels:"
   ]
  },
  {
@ -2166,7 +2161,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To see exactly what's going on in your model, use `summary`:"
+    "To see exactly what's going on in the model, we can use `summary`:"
   ]
  },
  {
@ -2228,7 +2223,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Note that the output of the final Conv2d layer is `64x2x1x1`. We need to remove those extra `1x1` axes; that's what `Flatten` does. It's basically the same as PyTorch's `squeeze` method, but as a module.\n",
+    "Note that the output of the final `Conv2d` layer is `64x2x1x1`. We need to remove those extra `1x1` axes; that's what `Flatten` does. It's basically the same as PyTorch's `squeeze` method, but as a module.\n",
    "\n",
    "Let's see if this trains! Since this is a deeper network than we've built from scratch before, we'll use a lower learning rate and more epochs:"
   ]
@ -2285,7 +2280,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Success! It's getting closer to the resnet-18 result we had, although it's not quite there yet, and it's taking more epochs, and we're needing to use a lower learning rate. So we've got a few more tricks still to learn--but we're getting closer and closer to being able to create a modern CNN from scratch."
+    "Success! It's getting closer to the `resnet18` result we had, although it's not quite there yet, and it's taking more epochs, and we're needing to use a lower learning rate. We still have a few more tricks to learn, but we're getting closer and closer to being able to create a modern CNN from scratch."
   ]
  },
  {
@ -2299,7 +2294,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can see from the summary that we have an input of size `64x1x28x28`. The axes are: `batch,channel,height,width`. This is often represented as `NCHW` (where `N` refers to batch size). Tensorflow, on the other hand, uses `NHWC` axis order. The first layer is:"
+    "We can see from the summary that we have an input of size `64x1x28x28`. The axes are `batch,channel,height,width`. This is often represented as `NCHW` (where `N` refers to batch size). Tensorflow, on the other hand, uses `NHWC` axis order. The first layer is:"
   ]
  },
  {
@ -2330,7 +2325,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So we have 1 channel input, 4 channel output, and a 3x3 kernel. Let's check the weights of the first convolution:"
+    "So we have 1 input channel, 4 output channels, and a 3\\*3 kernel. Let's check the weights of the first convolution:"
   ]
  },
  {
@ -2357,7 +2352,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The summary shows we have 40 parameters, and `4*1*3*3` is 36. What are the other 4 parameters? Let's see what the bias contains:"
+    "The summary shows we have 40 parameters, and `4*1*3*3` is 36. What are the other four parameters? Let's see what the bias contains:"
   ]
  },
  {
@ -2384,18 +2379,18 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can now use this information to better understand our earlier statement in this section: \"because we're decreasing the number of activations in the activation map by a factor of 4; we don't want to decrease the capacity of a layer by too much at a time\".\n",
+    "We can now use this information to clarify our statement in the previous section: \"When we use a stride-2 convolution, we often increase the number of features because we're decreasing the number of activations in the activation map by a factor of 4; we don't want to decrease the capacity of a layer by too much at a time.\"\n",
    "\n",
    "There is one bias for each channel. (Sometimes channels are called *features* or *filters* when they are not input channels.) The output shape is `64x4x14x14`, and this will therefore become the input shape to the next layer. The next layer, according to `summary`, has 296 parameters. Let's ignore the batch axis to keep things simple. So for each of `14*14=196` locations we are multiplying `296-8=288` weights (ignoring the bias for simplicity), so that's `196*288=56_448` multiplications at this layer. The next layer will have `7*7*(1168-16)=56_448` multiplications.\n",
    "\n",
-    "So what happened here is that our stride 2 conv halved the *grid size* from `14x14` to `7x7`, and we doubled the *number of filters* from 8 to 16, resulting in no overall change in the amount of computation. If we left the number of channels the same in each stride 2 layer, the amount of computation being done in the net would get less and less as it gets deeper. But we know that the deeper layers have to compute semantically rich features (such as eyes, or fur), so we wouldn't expect that doing *less* compute would make sense."
+    "What happened here is that our stride-2 convolution halved the *grid size* from `14x14` to `7x7`, and we doubled the *number of filters* from 8 to 16, resulting in no overall change in the amount of computation. If we left the number of channels the same in each stride-2 layer, the amount of computation being done in the net would get less and less as it gets deeper. But we know that the deeper layers have to compute semantically rich features (such as eyes or fur), so we wouldn't expect that doing *less* computation would make sense."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Another way to think of this is based on *receptive fields*."
+    "Another way to think of this is based on receptive fields."
   ]
  },
  {
@ -2409,60 +2404,60 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The \"receptive field\" is the area of an image that is involved in the calculation of a layer. On the book website, you'll find an Excel spreadsheet called `conv-example.xlsx` that shows the calculation of two stride 2 convolutional layers using an MNIST digit. Each layer has a single kernel. If we click on one of the cells in the *conv2* section, which shows the output of the second convolutional layer, and click *trace precendents*, we see this:"
+    "The *receptive field* is the area of an image that is involved in the calculation of a layer. On the [book's website](https://book.fast.ai/), you'll find an Excel spreadsheet called *conv-example.xlsx* that shows the calculation of two stride-2 convolutional layers using an MNIST digit. Each layer has a single kernel. <<preced1>> shows what we see if we click on one of the cells in the *conv2* section, which shows the output of the second convolutional layer, and click *trace precendents*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Immediate precedents of conv2 layer\" width=\"308\" caption=\"Immediate precedents of conv2 layer\" id=\"preced1\" src=\"images/att_00068.png\">"
+    "<img alt=\"Immediate precedents of conv2 layer\" width=\"308\" caption=\"Immediate precedents of Conv2 layer\" id=\"preced1\" src=\"images/att_00068.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here, the green cell is the cell we clicked on, and the blue highlighted cells are its *precedents*--that is, the cells used to calculate its value. These cells are the corresponding 3x3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Let's now click *show precedents* again, to show what cells are used to calculate these inputs, and see what happens:"
+    "Here, the cell with the green border is the cell we clicked on, and the blue highlighted cells are its *precedents*--that is, the cells used to calculate its value. These cells are the corresponding 3\\*3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Let's now click *trace precedents* again, to see what cells are used to calculate these inputs. <<preced2>> shows what happens."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Secondary precedents of conv2 layer\" width=\"601\" caption=\"Secondary precedents of conv2 layer\" id=\"preced2\" src=\"images/att_00069.png\">"
+    "<img alt=\"Secondary precedents of conv2 layer\" width=\"601\" caption=\"Secondary precedents of Conv2 layer\" id=\"preced2\" src=\"images/att_00069.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this example, we just have two convolutional layers, each of stride 2, so this is now tracing right back to the input image. We can see that a 7x7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This 7x7 area is the *receptive field* in the Input of the green activation in Conv2. We can also see that a second filter kernel is needed now, since we have two layers.\n",
+    "In this example, we have just two convolutional layers, each of stride 2, so this is now tracing right back to the input image. We can see that a 7\\*7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This 7\\*7 area is the *receptive field* in the input of the green activation in Conv2. We can also see that a second filter kernel is needed now, since we have two layers.\n",
    "\n",
-    "As you see from this example, the deeper we are in the network (specifically, the more stride 2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation in that layer. So we know now that in the deeper layers of the network, we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of seeing the same thing we saw in the previous section: when we introduce a stride 2 conv in our network, we should also increase the number of channels."
+    "As you see from this example, the deeper we are in the network (specifically, the more stride-2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation in that layer is. We now know that in the deeper layers of the network we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of saying the same thing we mentionedin the previous section: when we introduce a stride-2 conv in our network, we should also increase the number of channels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When writing this particular chapter, we had a lot of questions we needed answers for, to be able to explain to you those CNNs as best we could. Believe it or not, we found most of the answers on Twitter. "
+    "When writing this particular chapter, we had a lot of questions we needed answers for, to be able to explain CNNs to you as best we could. Believe it or not, we found most of the answers on Twitter. We're going to take a quick break to talk to you about that now, before we move on to color images."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### A Note about Twitter"
+    "### A Note About Twitter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We are not, to say the least, big users of social networks in general. But our goal of this book is to help you become the best deep learning practitioner you can, and we would be remiss not to mention how important Twitter has been in our own deep learning journeys.\n",
+    "We are not, to say the least, big users of social networks in general. But our goal in writing this book is to help you become the best deep learning practitioner you can, and we would be remiss not to mention how important Twitter has been in our own deep learning journeys.\n",
    "\n",
-    "You see, there's another part of Twitter, far away from Donald Trump and the Kardashians, which is the part of Twitter where deep learning researchers and practitioners talk shop every day. As we were writing the section above, Jeremy wanted to double-check to ensure that what we were saying about stride 2 convolutions was accurate, so he asked on Twitter:"
+    "You see, there's another part of Twitter, far away from Donald Trump and the Kardashians, which is the part of Twitter where deep learning researchers and practitioners talk shop every day. As we were writing this section, Jeremy wanted to double-checkthat what we were saying about stride-2 convolutions was accurate, so he asked on Twitter:"
   ]
  },
  {
@ -2490,7 +2485,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Christian Szegedy is the first author of [Inception](https://arxiv.org/pdf/1409.4842.pdf), the 2014 Imagenet winner and source of many key insights used in modern neural networks. Two hours later, this appeared:"
+    "Christian Szegedy is the first author of [Inception](https://arxiv.org/pdf/1409.4842.pdf), the 2014 ImageNet winner and source of many key insights used in modern neural networks. Two hours later, this appeared:"
   ]
  },
  {
@ -2504,9 +2499,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Do you recognize that name? We saw it in <<chapter_production>>, when we were talking about the Turing Award winners who set the foundation of deep learning today!\n",
+    "Do you recognize that name? You saw it in <<chapter_production>>, when we were talking about the Turing Award winners who established the foundations of deep learning today!\n",
    "\n",
-    "Jeremy also asked on Twitter for help checking our description of label smoothing in <<chapter_sizing_and_tta>> was accurate, and got a response from again from directly from Christian Szegedy (label smoothing was originally introduced in the Inception paper):"
+    "Jeremy also asked on Twitter for help checking our description of label smoothing in <<chapter_sizing_and_tta>> was accurate, and got a response again from directly from Christian Szegedy (label smoothing was originally introduced in the Inception paper):"
   ]
  },
  {
@ -2520,30 +2515,30 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Many of the top people in deep learning today are Twitter regulars, and are very open about interacting with the wider community. One good way to get started is to look at a list of Jeremy's [recent Twitter likes](https://twitter.com/jeremyphoward/likes), or [Sylvain's](https://twitter.com/GuggerSylvain/likes). That way, you can see a list of Twitter users that we thought had interesting and useful things to say.\n",
+    "Many of the top people in deep learning today are Twitter regulars, and are very open about interacting with the wider community. One good way to get started is to look at a list of Jeremy's [recent Twitter likes](https://twitter.com/jeremyphoward/likes), or [Sylvain's](https://twitter.com/GuggerSylvain/likes). That way, you can see a list of Twitter users that we think have interesting and useful things to say.\n",
    "\n",
-    "Twitter is the main way we both stay up to date with interesting papers, software releases, and other deep learning news. For making connections with the deep learning community, we recommend getting involved both in the [fast.ai forums](https://forums.fast.ai) and Twitter."
+    "Twitter is the main way we both stay up to date with interesting papers, software releases, and other deep learning news. For making connections with the deep learning community, we recommend getting involved both in the [fast.ai forums](https://forums.fast.ai) and on Twitter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Up until now, we have only shown you examples of pictures in black and white, with only one value per pixel. In practice, most colored images have three values per pixel to define is color."
+    "That said, let's get back to the meat of this chapter. Up until now, we have only shown you examples of pictures in black and white, with one value per pixel. In practice, most colored images have three values per pixel to define their color. We'll look at working with color images next."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Colour Images"
+    "## Color Images"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A colour picture is a rank-3 tensor."
+    "A colour picture is a rank-3 tensor:"
   ]
  },
  {
@ -2593,7 +2588,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The first axis contains the channels: red, green, and blue:"
+    "The first axis contains the channels, red, green, and blue:"
   ]
  },
  {
@ -2624,9 +2619,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We saw what the convolution operation was for one filter on one channel of the image (our examples were done on a square). A convolution layer will take an image with a certain number of channels (3 for the first layer for regular RGB color images) and output an image with a different number of channels. Like our hidden size that represented the numbers of neurons in a linear layer, we can decide to have has many filters as we want, and each of them will be able to specialize, some to detect horizontal edges, other to detect vertical edges and so forth, to give something like we studied in <<chapter_production>>.\n",
+    "We saw what the convolution operation was for one filter on one channel of the image (our examples were done on a square). A convolutional layer will take an image with a certain number of channels (three for the first layer for regular RGB color images) and output an image with a different number of channels. Like our hidden size that represented the numbers of neurons in a linear layer, we can decide to have as many filters as we want, and each of them will be able to specialize, some to detect horizontal edges, others to detect vertical edges and so forth, to give something like we studied in <<chapter_production>>.\n",
    "\n",
-    "On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the example given by <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
+    "In one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channels in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter, then sum the results (as we saw before) and sum over all the filters. In the example given in <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
   ]
  },
  {
@ -2640,7 +2635,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So, in order to apply a convolution to a colour picture we require a kernel tensor with a matching size as the first axis. At each location, the corresponding parts of the kernel and the image patch are multiplied together.\n",
+    "So, in order to apply a convolution to a color picture we require a kernel tensor with a size that matches the first axis. At each location, the corresponding parts of the kernel and the image patch are multiplied together.\n",
    "\n",
    "These are then all added together, to produce a single number, for each grid location, for each output feature, as shown in <<rgbconv2>>."
   ]
@ -2656,15 +2651,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then we have `ch_out` filters like this, so in the end, the result of our convolutional layer will be a batch of images with `ch_out` channels and a height and width given by the formula above. This give us `ch_out` tensors of size `ch_in x ks x ks` that we represent in one big tensor of 4 dimensions. In PyTorch, the order of the dimensions for those weights is `ch_out x ch_in x ks x ks`.\n",
+    "Then we have `ch_out` filters like this, so in the end, the result of our convolutional layer will be a batch of images with `ch_out` channels and a height and width given by the formula outlined earlier. This give us `ch_out` tensors of size `ch_in x ks x ks` that we represent in one big tensor of four dimensions. In PyTorch, the order of the dimensions for those weights is `ch_out x ch_in x ks x ks`.\n",
    "\n",
-    "Additionally, we may want to have a bias for each filter. In the example above, the final result for our convolutional layer would be $y_{R} + y_{G} + y_{B} + b$ in that case. Like in a linear layer, there are as many bias as we have kernels, so the bias is a vector of size `ch_out`.\n",
+    "Additionally, we may want to have a bias for each filter. In the preceding example, the final result for our convolutional layer would be $y_{R} + y_{G} + y_{B} + b$ in that case. Like in a linear layer, there are as many bias as we have kernels, so the biases is a vector of size `ch_out`.\n",
    "\n",
-    "There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer has 3 inputs.\n",
+    "There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer has three inputs.\n",
    "\n",
-    "There are lots of ways of processing color images. For instance, you can change them to black and white, or change from RGB to HSV (Hue, Saturation, and Value) color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won't make any difference to your model results, as long as you don't lose information in the transformation. So transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance a pet breed may have a distinctive color); but converting to HSV generally won't make any difference.\n",
+    "There are lots of ways of processing color images. For instance, you can change them to black and white, change from RGB to HSV (hue, saturation, and value) color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won't make any difference to your model results, as long as you don't lose information in the transformation. So, transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance, a pet breed may have a distinctive color); but converting to HSV generally won't make any difference.\n",
    "\n",
-    "Now you know what those pictures in <<chapter_intro>> of \"what a neural net learns\" from the Zeiler and Fergus paper mean! This is their picture of some of the layer 1 weights which we showed:"
+    "Now you know what those pictures in <<chapter_intro>> of \"what a neural net learns\" from the [Zeiler and Fergus paper](https://arxiv.org/abs/1311.2901) mean! This is their picture of some of the layer 1 weights which we showed:"
   ]
  },
  {
@ -2678,9 +2673,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is taking the 3 slices of the convolutional kernel, for each output feature, and displaying them as images. We can see that even although the creators of the neural net never explicitly created kernels to find edges, for instance, the neural net automatically discovered these features using SGD.\n",
+    "This is taking the three slices of the convolutional kernel, for each output feature, and displaying them as images. We can see that even though the creators of the neural net never explicitly created kernels to find edges, for instance, the neural net automatically discovered these features using SGD.\n",
    "\n",
-    "Now let's see how we can train those CNNs, and show you all the techniques fastai uses behind the hood for efficient training."
+    "Now let's see how we can train these CNNs, and show you all the techniques fastai uses under the hood for efficient training."
   ]
  },
  {
@ -2694,7 +2689,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Since we are so good at recognizing threes from sevens, let's move onto something harder—recognizing all 10 digits. That means we'll need to use `MNIST` instead of `MNIST_SAMPLE`:"
+    "Since we are so good at recognizing 3s from 7s, let's move on to something harder—recognizing all 10 digits. That means we'll need to use `MNIST` instead of `MNIST_SAMPLE`:"
   ]
  },
  {
@ -2740,7 +2735,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The data is in two folders named `training` and `testing`, so we have to tell `GrandparentSplitter` about that (it defaults to `train` and `valid`). We define a function `get_dls` to make it easy to change our batch size later:"
+    "The data is in two folders named *training* and *testing*, so we have to tell `GrandparentSplitter` about that (it defaults to `train` and `valid`). We de do that in the `get_dls` function, which we create to make it easy to change our batch size later:"
   ]
  },
  {
@ -2765,7 +2760,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It is always a good idea to look at your data before you use it:"
+    "Remember, it's always a good idea to look at your data before you use it:"
   ]
  },
  {
@ -2808,7 +2803,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the previous section, we built a model based on a `conv` function like this:"
+    "Earlier in this chapter, we built a model based on a `conv` function like this:"
   ]
  },
  {
@ -2827,13 +2822,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's start with a basic CNN as a baseline. We'll use the same as we had in the last Section, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
+    "Let's start with a basic CNN as a baseline. We'll use the same one as earlier, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
    "\n",
-    "As we discussed, we generally want to double the number of filters each time we have a stride 2 layer. So, one way to increase the number of filters throughout our network is to double the number of activations in the first layer – then every layer after that will end up twice as big as the previous version as well.\n",
+    "As we discussed, we generally want to double the number of filters each time we have a stride-2 layer. One way to increase the number of filters throughout our network is to double the number of activations in the first layer–then every layer after that will end up twice as big as in the previous version as well.\n",
    "\n",
-    "But there is a subtle problem with this. Consider the kernel which is being applied to each pixel. By default, we use a 3x3 pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four filters output. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to 8 filters. Then when we apply our kernel we would be using nine pixels to calculate eight numbers. That means that it isn't really learning much at all — the output size is almost the same as the input size. Neural networks will only create useful features if they're forced to do so—that is, that the number of outputs from an operation is smaller than the number of inputs.\n",
+    "But there is a subtle problem with this. Consider the kernel that is being applied to each pixel. By default, we use a 3\\*3-pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four output filters. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to eight filters. Then when we apply our kernel we will be using nine pixels to calculate eight numbers. That means it isn't really learning much at all: the output size is almost the same as the input size. Neural networks will only create useful features if they're forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.\n",
    "\n",
-    "To fix this, we can use a larger kernel in the first layer. If we use a kernel of 5x5 pixels then there are 25 pixels being used at each kernel application — creating eight filters from this will mean the neural net will have to find some useful features."
+    "To fix this, we can use a larger kernel in the first layer. If we use a kernel of 5\\*5 pixels then there are 25 pixels being used at each kernel application. Creating eight filters from this will mean the neural net will have to find some useful features:"
   ]
  },
  {
@ -2857,7 +2852,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you'll see in a moment, we're going to look inside our models while they're training in order to try to find ways to make them train better. To do this, we use the `ActivationStats` callback, which records the mean, standard deviation, and histogram of activations of every trainable layer (as we've seen, callbacks are used to add behavior to the training loop; we'll see how they work in <<chapter_accel_sgd>>)."
+    "As you'll see in a moment, we can look inside our models while they're training in order to try to find ways to make them train better. To do this we use the `ActivationStats` callback, which records the mean, standard deviation, and histogram of activations of every trainable layer (as we've seen, callbacks are used to add behavior to the training loop; we'll explore how they work in <<chapter_accel_sgd>>):"
   ]
  },
  {
@ -2936,9 +2931,9 @@
   "source": [
    "This didn't train at all well! Let's find out why.\n",
    "\n",
-    "One handy feature of the callbacks passed to `Learner` is that they are made available automatically, with the same name as the callback class, except in `camel_case`. So our `ActivationStats` callback can be accessed through `activation_stats`. In fact--I'm sure you remember `learn.recorder`... can you guess how that is implemented? That's right, it's a callback called `Recorder`!\n",
+    "One handy feature of the callbacks passed to `Learner` is that they are made available automatically, with the same name as the callback class, except in `camel_case`. So, our `ActivationStats` callback can be accessed through `activation_stats`. I'm sure you remember `learn.recorder`... can you guess how that is implemented? That's right, it's a callback called `Recorder`!\n",
    "\n",
-    "`ActivationStats` includes some handy utilities for plotting the activations during training. `plot_layer_stats(idx)` plots the mean and standard deviation of the activations of layer number `idx`, along with the percent of activations near zero. Here's the first layer's plot:"
+    "`ActivationStats` includes some handy utilities for plotting the activations during training. `plot_layer_stats(idx)` plots the mean and standard deviation of the activations of layer number *`idx`*, along with the percentage of activations near zero. Here's the first layer's plot:"
   ]
  },
  {
@ -2996,7 +2991,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As expected, the problems get worse towards the end of the network, as the instability and zero activations compound over layers. The first thing we can do to make training more stable is to increase the batch size."
+    "As expected, the problems get worse towards the end of the network, as the instability and zero activations compound over layers. Let's look at what we can do to make training more stable."
   ]
  },
  {
@ -3010,7 +3005,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One way to make training more stable is to *increase the batch size*. Larger batches have gradients that are more accurate, since they're calculated from more data. On the downside though, a larger batch size means fewer batches per epoch, which means less opportunities for your model to update weights. Let's see if a batch size of 512 helps:"
+    "One way to make training more stable is to increase the batch size. Larger batches have gradients that are more accurate, since they're calculated from more data. On the downside, though, a larger batch size means fewer batches per epoch, which means less opportunities for your model to update weights. Let's see if a batch size of 512 helps:"
   ]
  },
  {
@ -3110,20 +3105,20 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Our initial weights are not well suited to the task we're trying to solve. Therefore, it is dangerous to begin training with a high learning rate: we may very well make the training diverge instantly, as we've seen above. We probably don't want to end training with a high learning rate either, so that we don't skip over a minimum. But we want to train at a high learning rate for the rest of training, because we'll be able to train more quickly. Therefore, we should change the learning rate during training, from low, to high, and then back to low again.\n",
+    "Our initial weights are not well suited to the task we're trying to solve. Therefore, it is dangerous to begin training with a high learning rate: we may very well make the training diverge instantly, as we've seen. We probably don't want to end training with a high learning rate either, so that we don't skip over a minimum. But we want to train at a high learning rate for the rest of the training period, because we'll be able to train more quickly that way. Therefore, we should change the learning rate during training, from low, to high, and then back to low again.\n",
    "\n",
-    "Leslie Smith (yes, the same guy that invented the learning rate finder!) developed this idea in his article [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120) by designing a schedule for learning rate separated in two phases: one were the learning rate grows from the minimum value to the maximum value (*warm-up*), and then one where it decreases back to the minimum value (*annealing*). Smith called this combination of approaches *1cycle training*.\n",
+    "Leslie Smith (yes, the same guy that invented the learning rate finder!) developed this idea in his article [\"Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates\"](https://arxiv.org/abs/1708.07120). He designed a schedule for learning rate separated into two phases: one where the learning rate grows from the minimum value to the maximum value (*warmup*), and one where it decreases back to the minimum value (*annealing*). Smith called this combination of approaches *1cycle training*.\n",
    "\n",
    "1cycle training allows us to use a much higher maximum learning rate than other types of training, which gives two benefits:\n",
    "\n",
-    "- By training with higher learning rates, we train faster, a phenomenon Leslie N. Smith named *super-convergence*\n",
-    "- By training with higher learning rates, we overfit less because we skip over the sharp local minimas to end-up in a smoother (and therefore more generalizable) part of the loss.\n",
+    "- By training with higher learning rates, we train faster--a phenomenon Smith named *super-convergence*.\n",
+    "- By training with higher learning rates, we overfit less because we skip over the sharp local minima to end up in a smoother (and therefore more generalizable) part of the loss.\n",
    "\n",
-    "The second point is an interesting and subtle idea; it is based on the observation that a model that generalises well is one whose loss would not change very much if you change the input by a small amount. If a model trains at a large learning rate for quite a while, and can find a good loss when doing so, it must have found an area that also generalises well, because it is jumping around a lot from batch to batch (that is basically the definition of a high learning rate). The problem is that, as we have discussed, just jumping to a high learning rate is more likely to result in diverging losses, rather than seeing your losses improve. So we don't just jump to a high learning rate. Instead, we start at a low learning rate, where our losses do not diverge, and we allow the optimiser to gradually find smoother and smoother areas of our parameters, by gradually going to higher and higher learning rates.\n",
+    "The second point is an interesting and subtle one; it is based on the observation that a model that generalizes well is one whose loss would not change very much if you changed the input by a small amount. If a model trains at a large learning rate for quite a while, and can find a good loss when doing so, it must have found an area that also generalizes well, because it is jumping around a lot from batch to batch (that is basically the definition of a high learning rate). The problem is that, as we have discussed, just jumping to a high learning rate is more likely to result in diverging losses, rather than seeing your losses improve. So we don't jump straight to a high learning rate. Instead, we start at a low learning rate, where our losses do not diverge, and we allow the optimizer to gradually find smoother and smoother areas of our parameters by gradually going to higher and higher learning rates.\n",
    "\n",
-    "Then, once we have found a nice smooth area for our parameters, we then want to find the very best part of that area, which means we have to bring out learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models, and trains more quickly. That is why it is the approach that is used by default for `fine_tune` in fastai.\n",
+    "Then, once we have found a nice smooth area for our parameters, we want to find the very best part of that area, which means we have to bring our learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models and trains more quickly. That is why it is the approach that is used by default for `fine_tune` in fastai.\n",
    "\n",
-    "In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase.\n",
+    "In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also that continues in the direction of previous steps. Leslie Smith introduced the idea of *cyclical momentums* in [\"A Disciplined Approach to Neural Network Hyper-Parameters: Part 1\"](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase.\n",
    "\n",
    "We can use 1cycle training in fastai by calling `fit_one_cycle`:"
   ]
@ -3217,13 +3212,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Smith's original 1cycle paper used a linear warm-up and linear annealing. As you see above, we adapted the approach in fastai by combining it with another popular approach: cosine annealing. `fit_one_cycle` provides the following parameters you can adjust:\n",
+    "Smith's original 1cycle paper used a linear warmup and linear annealing. As you can see, we adapted the approach in fastai by combining it with another popular approach: cosine annealing. `fit_one_cycle` provides the following parameters you can adjust:\n",
    "\n",
-    "- `lr_max`:: The highest learning rate that will be used (this can also be a list of learning rates for each layer group, or a python `slice` object containing the first and last layer group learning rates)\n",
+    "- `lr_max`:: The highest learning rate that will be used (this can also be a list of learning rates for each layer group, or a Python `slice` object containing the first and last layer group learning rates)\n",
    "- `div`:: How much to divide `lr_max` by to get the starting learning rate\n",
    "- `div_final`::  How much to divide `lr_max` by to get the ending learning rate\n",
-    "- `pct_start`:: What % of the batches to use for the warmup\n",
-    "- `moms`:: A tuple `(mom1,mom2,mom3)` where mom1 is the initial momentum, mom2 is the minimum momentum, and mom3 is the final momentum.\n",
+    "- `pct_start`:: What percentage of the batches to use for the warmup\n",
+    "- `moms`:: A tuple `(mom1,mom2,mom3)` where *`mom1`* is the initial momentum, *`mom2`* is the minimum momentum, and *`mom3`* is the final momentum\n",
    "\n",
    "Let's take a look at our layer stats again:"
   ]
@ -3254,7 +3249,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The % of non-zero weights is getting much better, although it's still quite high.\n",
+    "The percentage of nonzero weights is getting much better, although it's still quite high.\n",
    "\n",
    "We can see even more about what's going on in our training using `color_dim`, passing it a layer index:"
   ]
@ -3285,7 +3280,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "`color_dim` was developed by fast.ai in conjunction with a student, Stefano Giomo. Stefano, who refers to the idea as the *colorful dimension*, has a [detailed explanation](https://forums.fast.ai/t/the-colorful-dimension/42908) of the history and details behind the method. The basic idea is to create a histogram of the activations of a layer, which we would hope would follow a smooth pattern such as the normal distribution shown by Stefano here:"
+    "`color_dim` was developed by fast.ai in conjunction with a student, Stefano Giomo. Stefano, who refers to the idea as the *colorful dimension*, provides an [in-depth explanation](https://forums.fast.ai/t/the-colorful-dimension/42908) of the history and details behind the method. The basic idea is to create a histogram of the activations of a layer, which we would hope would follow a smooth pattern such as the normal distribution (colorful_dist)."
   ]
  },
  {
@ -3299,25 +3294,25 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To create `color_dim`, we take the histogram shown on the left here, and convert it into just the colored representation shown at the bottom. Then we flip it on its side, as shown on the right. We found that the distribution is clearer if we take the `log` of the histogram values. Then, Stefano describes:\n",
+    "To create `color_dim`, we take the histogram shown on the left here, and convert it into just the colored representation shown at the bottom. Then we flip it on its side, as shown on the right. We found that the distribution is clearer if we take the log of the histogram values. Then, Stefano describes:\n",
    "\n",
    "> : The final plot for each layer is made by stacking the histogram of the activations from each batch along the horizontal axis. So each vertical slice in the visualisation represents the histogram of activations for a single batch. The color intensity corresponds to the height of the histogram, in other words the number of activations in each histogram bin.\n",
    "\n",
-    "This is Stefano's picture of how this all fits together:"
+    "<<colorful_summ>> shows how this all fits together."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img src=\"images/colorful_summ.png\" id=\"colorful_summ\" caption=\"Summary of 'colorful dimension'\" alt=\"Summary of 'colorful dimension'\" width=\"800\">"
+    "<img src=\"images/colorful_summ.png\" id=\"colorful_summ\" caption=\"Summary of the colorful dimension (courtesy of Stefano Giomo)\" alt=\"Summary of the colorful dimension\" width=\"800\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It shows why log(f) is mre colorful than f when f follows a normal distribution because taking a log changes the gaussian in a quadratic."
+    "This illustrates why log(f) is more colorful than *f* when *f* follows a normal distribution because taking a log changes the Gaussian in a quadratic, which isn't as narrow."
   ]
  },
  {
@ -3353,9 +3348,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This shows a classic picture of \"bad training\". We start with nearly all activations at zero--that's what we see at the far left, with nearly all the left hand side dark blue; the bright yellow at the bottom are the near-zero activations. Then over the first few batches we see the number of non-zero activations exponentially increasing. But it goes too far, and collapses! We see the dark blue return, and the bottom becomes bright yellow again. It almost looks like training restarts from scratch. Then we see the activations increase again, and then it collapses again. After repeating a few times, eventually we see a spread of activations throughout the range.\n",
+    "This shows a classic picture of \"bad training.\" We start with nearly all activations at zero--that's what we see at the far left, with all the dark blue. The bright yellow at the bottom represents the near-zero activations. Then, over the first few batches we see the number of nonzero activations exponentially increasing. But it goes too far, and collapses! We see the dark blue return, and the bottom becomes bright yellow again. It almost looks like training restarts from scratch. Then we see the activations increase again, and collapse again. After repeating this a few times, eventually we see a spread of activations throughout the range.\n",
    "\n",
-    "It's much better if training can be smooth from the start. The cycles of exponential increase and then collapse that we see above tend to result in a lot of near-zero activations, resulting in slow training, and poor final results. One way to solve this problem is to use Batch normalization."
+    "It's much better if training can be smooth from the start. The cycles of exponential increase and then collapse tend to result in a lot of near-zero activations, resulting in slow training and poor final results. One way to solve this problem is to use batch normalization."
   ]
  },
  {
@ -3369,33 +3364,33 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To fix the slow training and poor final results we ended up with in the previous section, we need to both fix the initial large percentage of near-zero activations, and then try to maintain a good distribution of activations throughout training.\n",
+    "To fix the slow training and poor final results we ended up with in the previous section, we need to fix the initial large percentage of near-zero activations, and then try to maintain a good distribution of activations throughout training.\n",
    "\n",
-    "Sergey Ioffe and Christian Szegedy showed a solution to this problem in the 2015 paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167). In the abstract, they describe just the problem that we've seen:\n",
+    "Sergey Ioffe and Christian Szegedy presented a solution to this problem in the 2015 paper [\"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift\"](https://arxiv.org/abs/1502.03167). In the abstract, they describe just the problem that we've seen:\n",
    "\n",
-    "> : \"Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization... We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.\"\n",
+    "> : Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization... We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.\n",
    "\n",
    "Their solution, they say is:\n",
    "\n",
-    "> : \"...making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization.\"\n",
+    "> : Making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization.\n",
    "\n",
-    "The paper caused great excitement as soon as it was released, because they showed the chart in <<batchnorm>>, which clearly demonstrated that batch normalization could train a model that was even more accurate than the current state of the art (the *inception* architecture), around 5x faster:"
+    "The paper caused great excitement as soon as it was released, because it included the chart in <<batchnorm>>, which clearly demonstrated that batch normalization could train a model that was even more accurate than the current state of the art (the *Inception* architecture) and around 5x faster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Impact of batch normalization\" width=\"553\" caption=\"Impact of batch normalization\" id=\"batchnorm\" src=\"images/att_00046.png\">"
+    "<img alt=\"Impact of batch normalization\" width=\"553\" caption=\"Impact of batch normalization (courtesy of Sergey Ioffe and Christian Szegedy)\" id=\"batchnorm\" src=\"images/att_00046.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The way batch normalization (often just called *batchnorm*) works is that it takes an average of the mean and standard deviations of the activations of a layer, and uses those to normalize the activations. However, this can cause problems because the network might really want some activations to be really high in order to make accurate predictions, they also add two learnable parameters (meaning they will be updated in our SGD step), usually called `gamma` and `beta`; after normalizing the activations to get some new activation vector `y`, a batchnorm layer returns `gamma*y + beta`.\n",
+    "Batch normalization (often just called *batchnorm*) works by taking an average of the mean and standard deviations of the activations of a layer and using those to normalize the activations. However, this can cause problems because the network might want some activations to be really high in order to make accurate predictions. So they also added two learnable parameters (meaning they will be updated in the SGD step), usually called `gamma` and `beta`. After normalizing the activations to get some new activation vector `y`, a batchnorm layer returns `gamma*y + beta`.\n",
    "\n",
-    "That why our activations can have any mean or variance, which is independent from the mean and std of the results of the previous layer. Those statistics are learned separately, making training easier on our model. The behavior is different during training and validation: during training, we use the mean and standard deviation of the batch to normalize the data. During validation, we instead use a running mean of the statistics calculated during training.\n",
+    "That's why our activations can have any mean or variance, independent from the mean and standard deviation of the results of the previous layer. Those statistics are learned separately, making training easier on our model. The behavior is different during training and validation: during training, we use the mean and standard deviation of the batch to normalize the data, while during validation we instead use a running mean of the statistics calculated during training.\n",
    "\n",
    "Let's add a batchnorm layer to `conv`:"
   ]
@ -3417,7 +3412,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...and fit our model:"
+    "and fit our model:"
   ]
  },
  {
@ -3494,11 +3489,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is just what we hope to see: a smooth development of activations, with no \"crashes\". Batchnorm has really delivered on its promise here! In fact, batchnorm has been so successful that we see it (or something very similar) today in nearly all modern neural networks.\n",
+    "This is just what we hope to see: a smooth development of activations, with no \"crashes.\" Batchnorm has really delivered on its promise here! In fact, batchnorm has been so successful that we see it (or something very similar) in nearly all modern neural networks.\n",
    "\n",
-    "An interesting observation about models containing batch normalisation layers is that they tend to generalise better than models that don't contain them. Although we haven't as yet seen a rigourous analysis of what's going on here, most researchers believe that the reason for this is that batch normalisation add some extra randomness to the training process. Each mini batch will have a somewhat different mean and standard deviation to other mini batches. Therefore, the activations will be normalised by different values each time. In order for the model to make accurate predictions, it will have to learn to become robust with these variations. In general, adding additional randomisation to the training process often helps.\n",
+    "An interesting observation about models containing batch normalization layers is that they tend to generalize better than models that don't contain them. Although we haven't as yet seen a rigorous analysis of what's going on here, most researchers believe that the reason for this is that batch normalization adds some extra randomness to the training process. Each mini-batch will have a somewhat different mean and standard deviation than other mini-batches. Therefore, the activations will be normalized by different values each time. In order for the model to make accurate predictions, it will have to learn to become robust to these variations. In general, adding additional randomization to the training process often helps.\n",
    "\n",
-    "Since things are going so well, let's train for a few more epochs and see how it goes. In fact, let's even *increase* the learning rate, since the abstract of the batchnorm paper claimed we should be able to \"train at much higher learning rates\":"
+    "Since things are going so well, let's train for a few more epochs and see how it goes. In fact, let's *increase* the learning rate, since the abstract of the batchnorm paper claimed we should be able to \"train at much higher learning rates\":"
   ]
  },
  {
@ -3657,13 +3652,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We've seen that convolutions are just a type of matrix multiplication, with two constraints on the weight matrix: some elements are always zero, and some elements are tied (forced to always have the same value). In <<chapter_intro>> we saw the eight requirements from the 1986 book *Parallel Distributed Processing*; one of them was \"A pattern of connectivity among units\". That's exactly what these constraints do: they enforce a certain pattern of connectivity.\n",
+    "We've seen that convolutions are just a type of matrix multiplication, with two constraints on the weight matrix: some elements are always zero, and some elements are tied (forced to always have the same value). In <<chapter_intro>> we saw the eight requirements from the 1986 book *Parallel Distributed Processing*; one of them was \"A pattern of connectivity among units.\" That's exactly what these constraints do: they enforce a certain pattern of connectivity.\n",
    "\n",
-    "These constraints allow us to use far less parameters in our model, without sacrificing the ability to represent complex visual features. That means we can train deeper models faster, with less over-fitting. Although the universal approximation theorem shows that it should be *possible* to represent anything in a fully connected network in one hidden layer, we've seen now that in *practice* we can train much better models by being thoughtful about network architecture.\n",
+    "These constraints allow us to use far fewer parameters in our model, without sacrificing the ability to represent complex visual features. That means we can train deeper models faster, with less overfitting. Although the universal approximation theorem shows that it should be *possible* to represent anything in a fully connected network in one hidden layer, we've seen now that in *practice* we can train much better models by being thoughtful about network architecture.\n",
    "\n",
    "Convolutions are by far the most common pattern of connectivity we see in neural nets (along with regular linear layers, which we refer to as *fully connected*), but it's likely that many more will be discovered.\n",
    "\n",
-    "Then we have seen how to interpret the activations of layers in the network to see if training is going well or not, and how Batchnorm helps regularizing the training and makes it smoother. In the next chapter, we will use both of those layers to build the most popular architecture in computer vision: residual networks."
+    "We've also seen how to interpret the activations of layers in the network to see whether training is going well or not, and how batchnorm helps regularize the training and makes it smoother. In the next chapter, we will use both of those layers to build the most popular architecture in computer vision: a residual network."
   ]
  },
  {
@ -3679,27 +3674,27 @@
   "source": [
    "1. What is a \"feature\"?\n",
    "1. Write out the convolutional kernel matrix for a top edge detector.\n",
-    "1. Write out the mathematical operation applied by a 3 x 3 kernel to a single pixel in an image.\n",
-    "1. What is the value of a convolutional kernel apply to a 3 x 3 matrix of zeros?\n",
-    "1. What is padding?\n",
-    "1. What is stride?\n",
+    "1. Write out the mathematical operation applied by a 3\\*3 kernel to a single pixel in an image.\n",
+    "1. What is the value of a convolutional kernel apply to a 3\\*3 matrix of zeros?\n",
+    "1. What is \"padding\"?\n",
+    "1. What is \"stride\"?\n",
    "1. Create a nested list comprehension to complete any task that you choose.\n",
-    "1. What are the shapes of the input and weight parameters to PyTorch's 2D convolution?\n",
-    "1. What is a channel?\n",
+    "1. What are the shapes of the `input` and `weight` parameters to PyTorch's 2D convolution?\n",
+    "1. What is a \"channel\"?\n",
    "1. What is the relationship between a convolution and a matrix multiplication?\n",
-    "1. What is a convolutional neural network?\n",
+    "1. What is a \"convolutional neural network\"?\n",
    "1. What is the benefit of refactoring parts of your neural network definition?\n",
    "1. What is `Flatten`? Where does it need to be included in the MNIST CNN? Why?\n",
    "1. What does \"NCHW\" mean?\n",
    "1. Why does the third layer of the MNIST CNN have `7*7*(1168-16)` multiplications?\n",
-    "1. What is a receptive field?\n",
+    "1. What is a \"receptive field\"?\n",
    "1. What is the size of the receptive field of an activation after two stride 2 convolutions? Why?\n",
-    "1. Run conv-example.xlsx yourself and experiment with \"trace precedents\".\n",
+    "1. Run *conv-example.xlsx* yourself and experiment with *trace precedents*.\n",
    "1. Have a look at Jeremy or Sylvain's list of recent Twitter \"like\"s, and see if you find any interesting resources or ideas there.\n",
    "1. How is a color image represented as a tensor?\n",
    "1. How does a convolution work with a color input?\n",
-    "1. What method can we use to see that data in DataLoaders?\n",
-    "1. Why do we double the number of filters after each stride 2 conv?\n",
+    "1. What method can we use to see that data in `DataLoaders`?\n",
+    "1. Why do we double the number of filters after each stride-2 conv?\n",
    "1. Why do we use a larger kernel in the first conv with MNIST (with `simple_cnn`)?\n",
    "1. What information does `ActivationStats` save for each layer?\n",
    "1. How can we access a learner's callback after training?\n",
@ -3710,7 +3705,7 @@
    "1. What is 1cycle training?\n",
    "1. What are the benefits of training with a high learning rate?\n",
    "1. Why do we want to use a low learning rate at the end of training?\n",
-    "1. What is cyclical momentum?\n",
+    "1. What is \"cyclical momentum\"?\n",
    "1. What callback tracks hyperparameter values during training (along with other information)?\n",
    "1. What does one column of pixels in the `color_dim` plot represent?\n",
    "1. What does \"bad training\" look like in `color_dim`? Why?\n",
@ -3732,8 +3727,7 @@
   "source": [
    "1. What features other than edge detectors have been used in computer vision (especially before deep learning became popular)?\n",
    "1. There are other normalization layers available in PyTorch. Try them out and see what works best. Learn about why other normalization layers have been developed, and how they differ from batch normalization.\n",
-    "1. Try moving the activation function after the batch normalization layer in `conv`. Does it make a difference? See what you can find out about what order is recommended, and why.\n",
-    "1. Batch normalization isn't defined for a batch size of one, since the standard deviation isn't defined for a single item. "
+    "1. Try moving the activation function after the batch normalization layer in `conv`. Does it make a difference? See what you can find out about what order is recommended, and why."
   ]
  },
  {
--- a/14_resnet.ipynb
+++ b/14_resnet.ipynb
@ -23,14 +23,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Resnets"
+    "# ResNets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this chapter, we will build on top of the CNNs (Convolutional Neural Networks) introduced in the previous chapter and explain to you the ResNet (for residual network) architecture. It was introduced in 2015 in [this article](https://arxiv.org/abs/1512.03385) and is by far the most used model architecture nowadays. More recent developments in image models almost always use the same trick of residual connections, and most of the time, they are just a tweak of the original ResNet.\n",
+    "In this chapter, we will build on top of the CNNs introduced in the previous chapter and explain to you the ResNet (residual network) architecture. It was introduced in 2015 by Kaiming He et al. in the article [\"Deep Residual Learning for Image Recognition\"](https://arxiv.org/abs/1512.03385) and is by far the most used model architecture nowadays. More recent developments in image models almost always use the same trick of residual connections, and most of the time, they are just a tweak of the original ResNet.\n",
    "\n",
    "We will first show you the basic ResNet as it was first designed, then explain to you what modern tweaks make it more performant. But first, we will need a problem a little bit more difficult than the MNIST dataset, since we are already close to 100% accuracy with a regular CNN on it."
   ]
@ -46,9 +46,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It's going to be tough to judge any improvement we do to our models when we are already at an accuracy that is as high as we saw on MNIST in the previous chapter, so we will tackle a tougher image classification problem by going back to Imagenette. We'll stick with small images to keep things reasonably fast.\n",
+    "It's going to be tough to judge any improvements we make to our models when we are already at an accuracy that is as high as we saw on MNIST in the previous chapter, so we will tackle a tougher image classification problem by going back to Imagenette. We'll stick with small images to keep things reasonably fast.\n",
    "\n",
-    "Let's grab the data--we'll use the already-resized 160px version to make things faster still, and will random crop to 128px:"
+    "Let's grab the data—we'll use the already-resized 160 px version to make things faster still, and will random crop to 128 px:"
   ]
  },
  {
@ -103,14 +103,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When we looked at MNIST we were dealing with 28 x 28 pixel images. For Imagenette we are going to be training with 128 x 128 pixel images. Later on we would like to be able to use larger images as well — at least as big as 224 x 224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
+    "When we looked at MNIST we were dealing with 28\\*28-pixel images. For Imagenette we are going to be training with 128\\*128-pixel images. Later, we would like to be able to use larger images as well—at least as big as 224\\*224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
    "\n",
-    "The approach we used was to ensure that there were enough stride two convolutions such that the final layer would have a grid size of one. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so a matrix of activations for a mini batch). We could do the same thing for Imagenette, but that's going to cause two problems:\n",
+    "The approach we used was to ensure that there were enough stride-2convolutions such that the final layer would have a grid size of 1. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so, a matrix of activations for a mini-batch). We could do the same thing for Imagenette, but that's would cause two problems:\n",
    "\n",
-    "- We are going to need lots of stride two layers to make our grid one by one at the end — perhaps more than we would otherwise choose\n",
-    "- The model will not work on images of any size other than the size we originally trained on.\n",
+    "- We'd need lots of stride-2 layers to make our grid 1\\*1 at the end—perhaps more than we would otherwise choose.\n",
+    "- The model would not work on images of any size other than the size we originally trained on.\n",
    "\n",
-    "One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than one by one. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always did. The most famous is the 2013 ImageNet winner VGG, still sometimes used today. But there was another problem with this architecture: not only does it not work with images other than those of the same size as the training set, but it required a lot of memory, because flattening out the convolutional create resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
+    "One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than 1\\*1. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always took. The most famous example is the 2013 ImageNet winner VGG, still sometimes used today. But there was another problem with this architecture: not only did it not work with images other than those of the same size used in the training set, but it required a lot of memory, because flattening out the convolutional layer resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
    "\n",
    "This problem was solved through the creation of *fully convolutional networks*. The trick in fully convolutional networks is to take the average of activations across a convolutional grid. In other words, we can simply use this function:"
   ]
@ -128,9 +128,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you see, it is taking the mean over the X and Y axes. This function will always convert a grid of activations into a single activation per image. PyTorch provides a slightly more versatile module called `nn.AdaptiveAvgPool2d`, which averages a grid of activations into whatever sized destination you require (although we nearly always use the size of one).\n",
+    "As you see, it is taking the mean over the x- and y-axes. This function will always convert a grid of activations into a single activation per image. PyTorch provides a slightly more versatile module called `nn.AdaptiveAvgPool2d`, which averages a grid of activations into whatever sized destination you require (although we nearly always use a size of 1).\n",
    "\n",
-    "A fully convolutional network, therefore, has a number of convolutional layers, some of which will be stride two, at the end of which is an adaptive average pooling layer, a flatten layer to remove the unit axes, and finally a linear layer. Here is our first fully convolutional network:"
+    "A fully convolutional network, therefore, has a number of convolutional layers, some of which will be stride 2, at the end of which is an adaptive average pooling layer, a flatten layer to remove the unit axes, and finally a linear layer. Here is our first fully convolutional network:"
   ]
  },
  {
@ -156,25 +156,25 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We're going to be replacing the implementation of `block` in the network with other variants in a moment, which is why we're not calling it `conv` any more. We're saving some time by taking advantage of fastai's `ConvLayer` that already provides the functionality of `conv` from the last chapter (plus a lot more!)"
+    "We're going to be replacing the implementation of `block` in the network with other variants in a moment, which is why we're not calling it `conv` any more. We're also saving some time by taking advantage of fastai's `ConvLayer`, which that already provides the functionality of `conv` from the last chapter (plus a lot more!)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> stop: Consider this question: Would this approach makes sense for an optical character recognition (OCR) problem such as MNIST? We see the vast majority of practitioners tackling OCR and similar problems tend to use fully convolutional networks, because that's what nearly everybody learns nowadays. But it really doesn't make any sense! You can't decide whether, for instance, whether a number is a \"3\" or an \"8\" by slicing it into small pieces, jumbling them up, and deciding whether on average each piece looks like a \"3\" or an \"8\". But that's what adaptive average pooling effectively does! Fully convolutional networks are only really a good choice for objects that don't have a single correct orientation or size (i.e. like most natural photos)."
+    "> stop: Consider this question: would this approach makes sense for an optical character recognition (OCR) problem such as MNIST? The vast majority of practitioners tackling OCR and similar problems tend to use fully convolutional networks, because that's what nearly everybody learns nowadays. But it really doesn't make any sense! You can't decide, for instance, whether a number is a 3 or an 8 by slicing it into small pieces, jumbling them up, and deciding whether on average each piece looks like a 3 or an 8. But that's what adaptive average pooling effectively does! Fully convolutional networks are only really a good choice for objects that don't have a single correct orientation or size (e.g., like most natural photos)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing `1 x 1` dimension like we did in our previous model. \n",
+    "Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height, and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing 1\\*1 dimension like we did in our previous model. \n",
    "\n",
-    "This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size: for instance max pooling layers of size 2 that were very popular in older CNNs reduce the size of our image by half on each dimension by taking the maximum of each 2 by 2 window (with a stride of 2).\n",
+    "This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size. For instance, max pooling layers of size 2, which were very popular in older CNNs, reduce the size of our image by half on each dimension by taking the maximum of each 2\\*2 window (with a stride of 2).\n",
    "\n",
-    "As before, we can define a `Learner` with our custom model and then train it on the data we grabbed before:"
+    "As before, we can define a `Learner` with our custom model and then train it on the data we grabbed earlier:"
   ]
  },
  {
@ -236,7 +236,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "`3e-3` is very often a good learning rate for CNNs, and that appears to be the case here too, so let's try that:"
+    "3e-3 is often a good learning rate for CNNs, and that appears to be the case here too, so let's try that:"
   ]
  },
  {
@ -312,7 +312,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "That's a pretty good start, considering we have to pick the correct one of ten categories, and we're training from scratch for just 5 epochs! But we can do way better than this using a deeper model. However, just stacking new layers won't really improve our results (you can try and see for yourself!). To work around this problem, ResNets introduce the idea of skip connections. Let's have a look at what it is exactly."
+    "That's a pretty good start, considering we have to pick the correct one of 10 categories, and we're training from scratch for just 5 epochs! We can do way better than this using a deeper mode, but just stacking new layers won't really improve our results (you can try and see for yourself!). To work around this problem, ResNets introduce the idea of *skip connections*. We'll explore those and other aspects of ResNets in the next section."
   ]
  },
  {
@ -326,58 +326,58 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We now have all the pieces needed to build the models we have been using in each computer vision task since the beginning of this book: ResNets. We'll introduce the main idea behind them and show how it improves accuracy on Imagenette compared to our previous model, before building a version with all the recent tweaks."
+    "We now have all the pieces we need to build the models we have been using in our computer vision tasks since the beginning of this book: ResNets. We'll introduce the main idea behind them and show how it improves accuracy on Imagenette compared to our previous model, before building a version with all the recent tweaks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Skip-Connections"
+    "### Skip Connections"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In 2015 the authors of the ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using less layers — and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalisation issue, but a training issue. As the paper explains:\n",
+    "In 2015, the authors of the ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using fewer layers—and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalization issue, but a training issue. As the paper explains:\n",
    "\n",
    "> : Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as [previously reported] and thoroughly verified by our experiments.\n",
    "\n",
-    "They showed the graph in <<resnet_depth>>, with training error on the left, and test on the right."
+    "This phenomenon was illustrated by the graph in <<resnet_depth>>, with training error on the left and test error on the right."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Training of networks of different depth\" width=\"700\" caption=\"Training of networks of different depth\" id=\"resnet_depth\" src=\"images/att_00042.png\">"
+    "<img alt=\"Training of networks of different depth\" width=\"700\" caption=\"Training of networks of different depth (courtesy of Kaiming He et al.)\" id=\"resnet_depth\" src=\"images/att_00042.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As the authors mention here, they are not the first people to have noticed this curious fact. But they were the 1st to make a very important leap:\n",
+    "As the authors mention here, they are not the first people to have noticed this curious fact. But they were the first to make a very important leap:\n",
    "\n",
    "> : Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model.\n",
    "\n",
-    "Being an academic paper, this process written in a rather inaccessible way — but it's actually saying something very simple: start with the 20 layer neural network that is trained well, and add another 36 layers that do nothing at all (for instance, they linear layer with a single weight equal to one, and bias equal to 0). This would be a 56 layer network which does exactly the same thing as the 20 layer network. This shows that there are always deep networks which should be *at least as good* as any shallow network. But for some reason, SGD does not seem able to find them.\n",
+    "AS this is an academic paper this process is described in a rather inaccessible way, but the concept is actually very simple: start with a 20-layer neural network that is trained well, and add another 36 layers that do nothing at all (for instance, they could be linear layers with a single weight equal to 1, and bias equal to 0). The result will be a 56-layer network that does exactly the same thing as the 20-layer network, proving that there are always deep networks that should be *at least as good* as any shallow network. But for some reason, SGD does not seem able to find them.\n",
    "\n",
-    "> jargon: Identity mapping: a function that just returns its input without changing it at all. Also known as *identity function*.\n",
+    "> jargon: Identity mapping: Returning the input without changing it at all. This process is performed by an _identity function_.\n",
    "\n",
-    "Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of `conv(x)` with `x + conv(x)`, where `conv` is the function from the previous chapter which does a 2nd convolution, then relu, then batchnorm. Furthermore, recall that batchnorm does `gamma*y + beta`. What if we initialized `gamma` for every one of these batchnorm layers to zero? Then our `conv(x)` for those extra 36 layers will always be equal to zero, which means `x+conv(x)` will always be equal to `x`.\n",
+    "Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of `conv(x)` with `x + conv(x)`, where `conv` is the function from the previous chapter that adds a second convolution, then a ReLU, then a batchnorm layer. Furthermore, recall that batchnorm does `gamma*y + beta`. What if we initialized `gamma` to zero for every one of those final batchnorm layers? Then our `conv(x)` for those extra 36 layers will always be equal to zero, which means `x+conv(x)` will always be equal to `x`.\n",
    "\n",
-    "What has that gained us, then? The key thing is that those 36 extra layers, as they stand, are an *identity mapping*, but they have *parameters*, which means they are *trainable*. So, we can start with our best 20 layer model, add these 36 extra layers which initially do nothing at all, and then *fine tune the whole 56 layer model*. If those extra 36 layers can be useful, then they can learn parameters to do so!\n",
+    "What has that gained us? The key thing is that those 36 extra layers, as they stand, are an *identity mapping*, but they have *parameters*, which means they are *trainable*. So, we can start with our best 20-layer model, add these 36 extra layers which initially do nothing at all, and then *fine-tune the whole 56-layer model*. Those extra 36 layers can then learn the parameters that make them most useful.\n",
    "\n",
-    "The ResNet paper actually proposed a variant of this, which is to instead \"skip over\" every 2nd convolution, so effectively we get `x+conv2(conv1(x))`. This is shown by the diagram in <<resnet_block>> (from the paper)."
+    "The ResNet paper actually proposed a variant of this, which is to instead \"skip over\" every second convolution, so effectively we get `x+conv2(conv1(x))`. This is shown by the diagram in <<resnet_block>> (from the paper)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"A simple ResNet block\" width=\"331\" caption=\"A simple ResNet block\" id=\"resnet_block\" src=\"images/att_00043.png\">"
+    "<img alt=\"A simple ResNet block\" width=\"331\" caption=\"A simple ResNet block (courtesy of Kaiming He et al.)\" id=\"resnet_block\" src=\"images/att_00043.png\">"
   ]
  },
  {
@ -386,24 +386,24 @@
   "source": [
    "That arrow on the right is just the `x` part of `x+conv2(conv1(x))`, and is known as the *identity branch* or *skip connection*. The path on the left is the `conv2(conv1(x))` part. You can think of the identity path as providing a direct route from the input to the output.\n",
    "\n",
-    "In a ResNet, we don't actually train it by first training a smaller number of layers, and then add new layers on the end and fine-tune. Instead, we use ResNet blocks (like the above) throughout the CNN, initialized from scratch in the usual way, and trained with SGD in the usual way. We rely on the skip connections to make the network easier to train for SGD."
+    "In a ResNet, we don't actually proceed by first training a smaller number of layers, and then adding new layers on the end and fine-tuning. Instead, we use ResNet blocks like the one in <<resnet_block>> throughout the CNN, initialized from scratch in the usual way, and trained with SGD in the usual way. We rely on the skip connections to make the network easier to train with SGD."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "There's another (largely equivalent) way to think of these \"ResNet blocks\". This is how the paper describes it:\n",
+    "There's another (largely equivalent) way to think of these ResNet blocks. This is how the paper describes it:\n",
    "\n",
    "> : Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.\n",
    "\n",
-    "Again, this is rather inaccessible prose—so let's try to restate it in plain English! If the outcome of a given layer is `x`, when using a ResNet block that return `y = x+block(x)`, we're not asking the block to predict `y`, we are asking it to predict the difference between `y-x`. So the job of those blocks isn't to predict certain features anymore, but a little extra step that will minimize the error between `x` and the desired `y`. ResNet is, therefore, good at learning about slight differences between doing nothing and some other feature that the layer learns. Since we predict residuals (reminder: \"residual\" is predictions minus targets), this is why those kinds of models were named ResNets.\n",
+    "Again, this is rather inaccessible prose—so let's try to restate it in plain English! If the outcome of a given layer is `x`, when using a ResNet block that returns `y = x+block(x)` we're not asking the block to predict `y`, we are asking it to predict the difference between `y` and `x`. So the job of those blocks isn't to predict certain features, but to minimize the error between `x` and the desired `y`. A ResNet is, therefore, good at learning about slight differences between doing nothing and passing though a block of two convolutional layers (with trainable weights). This is how these models got their name: they're predicting residuals (reminder: \"residual\" is prediction minus target).\n",
    "\n",
-    "One key concept that both of these two ways of thinking about ResNets share is the idea of \"easy to learn\". This is an important theme. Recall the universal approximation theorem, which states that a sufficiently large network *can* learn anything. This is still true. But there turns out to be a very important difference between what a network *can learn* in principle, and what it is *easy for it to learn* under realistic data and training regimes. Many of the advances in neural networks over the last decade have been like the ResNet block: the result of realizing how to make something which was always possible actually feasible.\n",
+    "One key concept that both of these two ways of thinking about ResNets share is the idea of ease of learning. This is an important theme. Recall the universal approximation theorem, which states that a sufficiently large network can learn anything. This is still true, but there turns out to be a very important difference between what a network *can learn* in principle, and what it is *easy for it to learn* with realistic data and training regimes. Many of the advances in neural networks over the last decade have been like the ResNet block: the result of realizing how to make something yjay was always possible actually feasible.\n",
    "\n",
-    "> note: The original paper didn't actually do the trick of using zero for the initial value of gamma in the batchnorm layer; that came a couple of years later. So the original version of ResNet didn't quite begin training with a truly identity path through the ResNet blocks, but nonetheless having the ability to \"navigate through\" the skip connections did indeed make it train better. Adding the batchnorm gamma init trick made the models train at even higher learning rates.\n",
+    "> note: True Identity Path: The original paper didn't actually do the trick of using zero for the initial value of `gamma` in the last batchnorm layer of each block; that came a couple of years later. So, the original version of ResNet didn't quite begin training with a truly identity path through the ResNet blocks, but nonetheless having the ability to \"navigate through\" the skip connections did indeed make it train better. Adding the batchnorm `gamma` init trick made the models train at even higher learning rates.\n",
    "\n",
-    "Here's the definition of a simple ResNet block (where `norm_type=NormType.BatchZero` causes fastai to init the `gamma` weights of that batchnorm layer to zero):"
+    "Here's the definition of a simple ResNet block (where `norm_type=NormType.BatchZero` causes fastai to init the `gamma` weights of the last batchnorm layer to zero):"
   ]
  },
  {
@ -425,27 +425,20 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One problem with this, however, is that it can't handle a stride other than `1`, and it requires that `ni==nf`. Stop for a moment, to think carefully about why this is...\n",
+    "There are two problems with this, however: it can't handle a stride other than 1, and it requires that `ni==nf`. Stop for a moment to think carefully about why this is.\n",
    "\n",
-    "The issue is that with a stride of, say, `2`, on one of the convolutions, the grid size of the output activations will be half the size on each axis of the input. So then we can't add that back to `x` in `forward` because `x` and the output activations have different dimensions. The same basic issue occurs if `ni!=nf`: the shapes of the input and output connections won't allow us to add them together.\n",
+    "The issue is that with a stride of, say, 2 on one of the convolutions, the grid size of the output activations will be half the size on each axis of the input. So then we can't add that back to `x` in `forward` because `x` and the output activations have different dimensions. The same basic issue occurs if `ni!=nf`: the shapes of the input and output connections won't allow us to add them together.\n",
    "\n",
-    "To fix this, we need a way to change the shape of `x` to match the result of `self.convs`. Halving the grid size can be done using an average pooling layer with a stride of 2: that is, a layer which takes 2x2 patches from the input, and replaces them with their average.\n",
+    "To fix this, we need a way to change the shape of `x` to match the result of `self.convs`. Halving the grid size can be done using an average pooling layer with a stride of 2: that is, a layer that takes 2\\*2 patches from the input and replaces them with their average.\n",
    "\n",
-    "Changing the number of channels can be done by using a convolution. We want this skip connection to be as close to an identity map as possible, however, which means making this convolution as simple as possible. The simplest possible convolution is one where the kernel size is `1`. That means that the kernel is size `ni*nf*1*1`, so it's only doing a dot product over the channels of each input pixel--it's not combining across pixels at all. This kind of *1x1 convolution* is very widely used in modern CNNs, so take a moment to think about how it works."
+    "Changing the number of channels can be done by using a convolution. We want this skip connection to be as close to an identity map as possible, however, which means making this convolution as simple as possible. The simplest possible convolution is one where the kernel size is 1. That means that the kernel is size `ni*nf*1*1`, so it's only doing a dot product over the channels of each input pixel—it's not combining across pixels at all. This kind of *1x1 convolution* is very widely used in modern CNNs, so take a moment to think about how it works."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> question: Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "> jargon: 1x1 convolution: A convolution with a kernel size of one."
+    "> jargon: 1x1 convolution: A convolution with a kernel size of 1."
   ]
  },
  {
@ -489,7 +482,7 @@
   "source": [
    "Note that we're using the `noop` function here, which simply returns its input unchanged (*noop* is a computer science term that stands for \"no operation\"). In this case, `idconv` does nothing at all if `nf==nf`, and `pool` does nothing if `stride==1`, which is what we wanted in our skip connection.\n",
    "\n",
-    "Also, you'll see that we've removed relu (`act_cls=None`) from the final convolution in `convs` and from `idconv`, and moved it to *after* we add the skip connection. The thinking behind this is that the whole ResNet block is like a layer, and you want your activation to be *after* your layer.\n",
+    "Also, you'll see that we've removed the ReLU (`act_cls=None`) from the final convolution in `convs` and from `idconv`, and moved it to *after* we add the skip connection. The thinking behind this is that the whole ResNet block is like a layer, and you want your activation to be after your layer.\n",
    "\n",
    "Let's replace our `block` with `ResBlock`, and try it out:"
   ]
@ -577,7 +570,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It's not much better. But the whole point of this was to allow us to train *deeper* models, and we're not really taking advantage of that yet. To create a deeper model that's, say, twice as deep, all we need to do is replace our `block` with two `ResBlock`s in a row:"
+    "It's not much better. But the whole point of this was to allow us to train *deeper* models, and we're not really taking advantage of that yet. To create a model that's, say, twice as deep, all we need to do is replace our `block` with two `ResBlock`s in a row:"
   ]
  },
  {
@ -666,23 +659,23 @@
   "source": [
    "Now we're making good progress!\n",
    "\n",
-    "The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting point for the breakthroughs were experimental observations. Observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kind of networks can be trained, in the case of the ResNet authors. This ability to design and analyse thoughtful experiments, or even just to see an unexpected result say \"hmmm, that's interesting\" — and then, most importantly, to figure out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be a strong practitioner, not just a theoretician.\n",
+    "The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting points for the breakthroughs were experimental observations: observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kinds of networks can be trained, in the case of the ResNet authors. This ability to design and analyze thoughtful experiments, or even just to see an unexpected result, say \"Hmmm, that's interesting,\" and then, most importantly, set about figuring out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be a strong practitioner, not just a theoretician.\n",
    "\n",
-    "Since the ResNet was introduced, there's been many papers studying it and applying it to many domains. One of the most interesting, published in 2018, is [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). It shows that using skip connections help smoothen the loss function, which makes training easier as it avoids falling into a very sharp area. <<resnet_surface>> shows a stunning picture from the paper, showing the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right)."
+    "Since the ResNet was introduced, it's been widely studied and applied to many domains. One of the most interesting papers, published in 2018, is Hao Li et al.'s [\"Visualizing the Loss Landscape of Neural Nets\"](https://arxiv.org/abs/1712.09913). It shows that using skip connections helps smooth the loss function, which makes training easier as it avoids falling into a very sharp area. <<resnet_surface>> shows a stunning picture from the paper, illustrating the difference between the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Impact of ResNet on loss landscape\" width=\"600\" caption=\"Impact of ResNet on loss landscape (curtesy of Hao Li et al.)\" id=\"resnet_surface\" src=\"images/att_00044.png\">"
+    "<img alt=\"Impact of ResNet on loss landscape\" width=\"600\" caption=\"Impact of ResNet on loss landscape (courtesy of Hao Li et al.)\" id=\"resnet_surface\" src=\"images/att_00044.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This first model is already good, but further research has discovered more tricks we can apply to make it better."
+    "Our first model is already good, but further research has discovered more tricks we can apply to make it better. We'll look at those next."
   ]
  },
  {
@ -696,21 +689,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In [Bag of Tricks for Image Classification with Convolutional Neural Networks](https://arxiv.org/abs/1812.01187), the authors study different variations of the ResNet architecture that come at almost no additional cost in terms of number of parameters or computation. By using this tweaked ResNet50 architecture and Mixup they achieve 94.6% top-5 accuracy on ImageNet, instead of 92.2% with a regular ResNet50 without Mixup. This result is better than regular ResNet models that are twice as deep (and twice as slow, and much more likely to overfit)."
+    "In [\"Bag of Tricks for Image Classification with Convolutional Neural Networks\"](https://arxiv.org/abs/1812.01187), Tong He et al. study different variations of the ResNet architecture that come at almost no additional cost in terms of number of parameters or computation. By using a tweaked ResNet-50 architecture and Mixup they achieved 94.6% top-5 accuracy on ImageNet, in comparison to 92.2% with a regular ResNet-50 without Mixup. This result is better than that achieved by regular ResNet models that are twice as deep (and twice as slow, and much more likely to overfit)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: top-5 accuracy: A metric testing how often the label we want is in the top 5 predictions of our model. It was used in the Imagenet competition, since many images contained multiple objects, or contained objects that could be easily confused or may even have been mislabeled with a similar label. In these situations, looking at top-1 accuracy may be inappropriate. However, recently CNNs have been getting so good that top-5 accuracy is nearly 100%, so some researchers are using top-1 accuracy for Imagenet too now."
+    "> jargon: top-5 accuracy: A metric testing how often the label we want is in the top 5 predictions of our model. It was used in the ImageNet competition because many of the images contained multiple objects, or contained objects that could be easily confused or may even have been mislabeled with a similar label. In these situations, looking at top-1 accuracy may be inappropriate. However, recently CNNs have been getting so good that top-5 accuracy is nearly 100%, so some researchers are using top-1 accuracy for ImageNet too now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So, as we scale up to the full ResNet, we won't show the original one, but the tweaked one, since it's substantially better. It differs a little bit from our previous implementation, in that instead of just starting with ResNet blocks, it begins with a few convolutional layers followed by a max pooling layer. This is what the first layers look like:"
+    "We'll use this tweaked version as we scale up to the full ResNet, because it's substantially better. It differs a little bit from our previous implementation, in that instead of just starting with ResNet blocks, it begins with a few convolutional layers followed by a max pooling layer. This is what the first layers, called the *stem* of the network, look like:"
   ]
  },
  {
@ -784,7 +777,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: Stem: The stem of a CNN are its first few layers. Generally, the stem has a different structure to the main body of the CNN."
+    "> jargon: Stem: The first few layers of a CNN. Generally, the stem has a different structure than the main body of the CNN."
   ]
  },
  {
@ -793,13 +786,13 @@
   "source": [
    "The reason that we have a stem of plain convolutional layers, instead of ResNet blocks, is based on a very important insight about all deep convolutional neural networks: the vast majority of the computation occurs in the early layers. Therefore, we should keep the early layers as fast and simple as possible.\n",
    "\n",
-    "To see why so much computation occurs in the early layers, consider the very first convolution on a 128 pixel input image. If it is a stride one convolution, then it will apply the kernel to every one of the 128×128 pixels. That's a lot of work! In the later layers, however, the grid size could be as small as 4x4 or even 2x2. So there are far fewer kernel applications to do.\n",
+    "To see why so much computation occurs in the early layers, consider the very first convolution on a 128-pixel input image. If it is a stride-1 convolution, then it will apply the kernel to every one of the 128×128 pixels. That's a lot of work! In the later layers, however, the grid size could be as small as 4\\*4 or even 2\\*2, so there are far fewer kernel applications to do.\n",
    "\n",
-    "On the other hand, the first layer convolution only has three input features, and 32 output features. Since it is a 3x3 kernel, this is 3×32×3×3 = 864 parameters in the weights. On the other hand, the last convolution will be 256 input features and 512 output features, which will be 1,179,648 weights! So the first layers contain vast majority of the computation, but the last layers contain the vast majority of the parameters.\n",
+    "On the other hand, the first-layer convolution only has 3 input features and 32 output features. Since it is a 3\\*3 kernel, this is 3×32×3×3 = 864 parameters in the weights. But the last convolution will have 256 input features and 512 output features, resulting in 1,179,648 weights! So the first layers contain the vast majority of the computation, but the last layers contain the vast majority of the parameters.\n",
    "\n",
-    "A ResNet block takes more computation than a plain convolutional block, since (in the stride two case) a ResNet block has three convolutions and a pooling layer. That's why we want to have plain convolutions to start off our ResNet.\n",
+    "A ResNet block takes more computation than a plain convolutional block, since (in the stride-2 case) a ResNet block has three convolutions and a pooling layer. That's why we want to have plain convolutions to start off our ResNet.\n",
    "\n",
-    "We're now ready to show the implementation of a modern ResNet, with the \"bag of tricks\". The ResNet use four groups of ResNet blocks, with 64, 128, 256 then 512 filters. Each groups starts with a stride 2 block, except for the first one, since it's just after a `MaxPooling` layer."
+    "We're now ready to show the implementation of a modern ResNet, with the \"bag of tricks.\" It uses four groups of ResNet blocks, with 64, 128, 256, then 512 filters. Each group starts with a stride-2 block, except for the first one, since it's just after a `MaxPooling` layer:"
   ]
  },
  {
@ -831,9 +824,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The `_make_layer` function is just there to create a series of `n_layers` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now--we'll discuss it in the next section. For now, it'll be `1`, so it doesn't do anything.)\n",
+    "The `_make_layer` function is just there to create a series of `n_layers` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now; we'll discuss it in the next section. For now, it'll be `1`, so it doesn't do anything.)\n",
    "\n",
-    "The various versions of the models (ResNet 18, 34, 50, etc) just change the number of blocks in each of those groups. This is the definition of a ResNet18:"
+    "The various versions of the models (ResNet-18, -34, -50, etc.) just change the number of blocks in each of those groups. This is the definition of a ResNet-18:"
   ]
  },
  {
@ -926,9 +919,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Even although we have more channels (and our model is therefore even more accurate), our training is just as fast as before, thanks to our optimized stem.\n",
+    "Even though we have more channels (and our model is therefore even more accurate), our training is just as fast as before, thanks to our optimized stem.\n",
    "\n",
-    "To make our model deeper without taking too much compute or memory, the ResNet paper introduced another kind of block for ResNets with a depth of 50 or more, using something called a bottleneck. "
+    "To make our model deeper without taking too much compute or memory, we can use another kind of layer introduced by the ResNet paper for ResNets with a depth of 50 or more: the bottleneck layer. "
   ]
  },
  {
@ -942,23 +935,23 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Instead of stacking two convolutions with a kernel size of 3, *bottleneck layers* use three different convolutions: two 1x1 (at the beginning and the end) and one 3x3, as shown in the right of <<resnet_compare>> the ResNet paper (using an example of 64 channel output, comparing to the regular ResBlock on the left)."
+    "Instead of stacking two convolutions with a kernel size of 3, bottleneck layers use three different convolutions: two 1\\*1 (at the beginning and the end) and one 3\\*3, as shown on the right in <<resnet_compare>>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Comparison of regular and bottleneck ResNet blocks\" width=\"550\" caption=\"Comparison of regular and bottleneck ResNet blocks\" id=\"resnet_compare\" src=\"images/att_00045.png\">"
+    "<img alt=\"Comparison of regular and bottleneck ResNet blocks\" width=\"550\" caption=\"Comparison of regular and bottleneck ResNet blocks (courtesy of Kaiming He et al.)\" id=\"resnet_compare\" src=\"images/att_00045.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Why is that useful? 1x1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first ResNet block we saw. This then lets us use more filters: as we see on the illustration, the number of filters in and out is 4 times higher (256) and the 1 by 1 convs are here to diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
+    "Why is that useful? 1\\*1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first ResNet block we saw. This then lets us use more filters: as we see in the illustration, the number of filters in and out is 4 times higher (256 instead of 64) diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
    "\n",
-    "Let's try replacing our ResBlock with this bottleneck design:"
+    "Let's try replacing our `ResBlock` with this bottleneck design:"
   ]
  },
  {
@ -978,7 +971,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We'll use this to create a ResNet50, which uses this bottleneck block, and uses group sizes of `(3,4,6,3)`. We now need to pass `4` in to the `expansion` parameter of `ResNet`, since we need to start with four times less channels, and we'll end with four times more channels.\n",
+    "We'll use this to create a ResNet-50 with group sizes of `(3,4,6,3)`. We now need to pass `4` in to the `expansion` parameter of `ResNet`, since we need to start with four times less channels and we'll end with four times more channels.\n",
    "\n",
    "Deeper networks like this don't generally show improvements when training for only 5 epochs, so we'll bump it up to 20 epochs this time to make the most of our bigger model. And to really get great results, let's use bigger images too:"
   ]
@ -996,7 +989,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We don't have to do anything to account for the larger 224 pixel images--thanks to our fully convolutional network, it just works. This is also why we were able to do *progressive resizing* earlier in the book--the models we used were fully convolutional, so we were even able to fine-tune models trained with different sizes."
+    "We don't have to do anything to account for the larger 224-pixel images; thanks to our fully convolutional network, it just works. This is also why we were able to do *progressive resizing* earlier in the book—the models we used were fully convolutional, so we were even able to fine-tune models trained with different sizes. We can now train our model and see the effects:"
   ]
  },
  {
@ -1189,7 +1182,7 @@
   "source": [
    "We're getting a great result now! Try adding Mixup, and then training this for a hundred epochs while you go get lunch. You'll have yourself a very accurate image classifier, trained from scratch.\n",
    "\n",
-    "The bottleneck design we've shown here is only used in ResNet50, 101, and 152 in all official models we've seen. ResNet18 and 34 use the non-bottleneck design seen in the previous section. However, we've noticed that the bottleneck layer generally works better even for the shallower networks. This just goes to show that the little details in papers tend to stick around for years, even if they're actually not quite the best design! Questioning assumptions and \"stuff everyone knows\" is always a good idea, because this is still a new field, and there's lots of details that aren't always done well."
+    "The bottleneck design we've shown here is typically only used in ResNet-50, -101, and -152 models. ResNet-18 and -34 models usually use the non-bottleneck design seen in the previous section. However, we've noticed that the bottleneck layer generally works better even for the shallower networks. This just goes to show that the little details in papers tend to stick around for years, even if they're actually not quite the best design! Questioning assumptions and \"stuff everyone knows\" is always a good idea, because this is still a new field, and there are lots of details that aren't always done well."
   ]
  },
  {
@ -1203,7 +1196,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You have know seen how the models we were using in computer vision since the first chapter are built, using skip connections to allow deeper models to be trained. Even if there has been a lot of research in better architectures, they all use one version or another of this trick, to make a direct path from the input to the end of the network. When using transfer learning, resnet is the pretrained model. In the next chapter, we will look at the final details of how the models we actually used were built from it."
+    "You have now seen how the models we have been using for computer vision since the first chapter are built, using skip connections to allow deeper models to be trained. Even if there has been a lot of research into better architectures, they all use one version or another of this trick, to make a direct path from the input to the end of the network. When using transfer learning, the ResNet is the pretrained model. In the next chapter, we will look at the final details of how the models we actually used were built from it."
   ]
  },
  {
@ -1217,27 +1210,28 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. How did we get to a single vector of activations in the convnets used for MNIST in previous chapters? Why isn't that suitable for Imagenette?\n",
+    "1. How did we get to a single vector of activations in the CNNs used for MNIST in previous chapters? Why isn't that suitable for Imagenette?\n",
    "1. What do we do for Imagenette instead?\n",
-    "1. What is adaptive pooling?\n",
-    "1. What is average pooling?\n",
+    "1. What is \"adaptive pooling\"?\n",
+    "1. What is \"average pooling\"?\n",
    "1. Why do we need `Flatten` after an adaptive average pooling layer?\n",
-    "1. What is a skip connection?\n",
+    "1. What is a \"skip connection\"?\n",
    "1. Why do skip connections allow us to train deeper models?\n",
    "1. What does <<resnet_depth>> show? How did that lead to the idea of skip connections?\n",
-    "1. What is an identity mapping?\n",
-    "1. What is the basic equation for a ResNet block (ignoring batchnorm and relu layers)?\n",
-    "1. What do ResNets have to do with \"residuals\"?\n",
-    "1. How do we deal with the skip connection when there is a stride 2 convolution? How about when the number of filters changes?\n",
-    "1. How can we express a 1x1 convolution in terms of a vector dot product?\n",
+    "1. What is \"identity mapping\"?\n",
+    "1. What is the basic equation for a ResNet block (ignoring batchnorm and ReLU layers)?\n",
+    "1. What do ResNets have to do with residuals?\n",
+    "1. How do we deal with the skip connection when there is a stride-2 convolution? How about when the number of filters changes?\n",
+    "1. How can we express a 1\\*1 convolution in terms of a vector dot product?\n",
+    "1. Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?\n",
    "1. What does the `noop` function return?\n",
    "1. Explain what is shown in <<resnet_surface>>.\n",
    "1. When is top-5 accuracy a better metric than top-1 accuracy?\n",
-    "1. What is the stem of a CNN?\n",
-    "1. Why use plain convs in the CNN stem, instead of ResNet blocks?\n",
+    "1. What is the \"stem\" of a CNN?\n",
+    "1. Why do we use plain convolutions in the CNN stem, instead of ResNet blocks?\n",
    "1. How does a bottleneck block differ from a plain ResNet block?\n",
    "1. Why is a bottleneck block faster?\n",
-    "1. How do fully convolution nets (and nets with adaptive pooling in general) allow for progressive resizing?"
+    "1. How do fully convolutional nets (and nets with adaptive pooling in general) allow for progressive resizing?"
   ]
  },
  {
@ -1251,9 +1245,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride 2 layers). How does it compare to a network without such a pooling layer?\n",
-    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1x1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
-    "1. Write a \"top 5 accuracy\" function using plain PyTorch or plain Python.\n",
+    "1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride-2 layers). How does it compare to a network without such a pooling layer?\n",
+    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1\\*1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
+    "1. Write a \"top-5 accuracy\" function using plain PyTorch or plain Python.\n",
    "1. Train a model on Imagenette for more epochs, with and without label smoothing. Take a look at the Imagenette leaderboards and see how close you can get to the best results shown. Read the linked pages describing the leading approaches."
   ]
  },
--- a/15_arch_details.ipynb
+++ b/15_arch_details.ipynb
@ -28,11 +28,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We are now in the exciting position that we can fully understand the entire architectures that we have been using for our state-of-the-art models for computer vision, natural language processing, and tabular analysis. In this chapter, we're going to fill in all the missing details on how fastai's application models work and show you how to build the models they use.\n",
+    "We are now in the exciting position that we can fully understand the architectures that we have been using for our state-of-the-art models for computer vision, natural language processing, and tabular analysis. In this chapter, we're going to fill in all the missing details on how fastai's application models work and show you how to build the models they use.\n",
    "\n",
    "We will also go back to the custom data preprocessing pipeline we saw in <<chapter_midlevel_data>> for Siamese networks and show you how you can use the components in the fastai library to build custom pretrained models for new tasks.\n",
    "\n",
-    "We will go voer each application in turn, starting with computer vision."
+    "We'll start with computer vision."
   ]
  },
  {
@ -46,7 +46,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In computer vision, we used the functions `cnn_learner` and `unet_learner` to build our models, depending on the task. Let's see how they start from a pretrained ResNet to build the `Learner` objects we have used in part 1 and 2 of this book."
+    "For computer vision application we use the functions `cnn_learner` and `unet_learner` to build our models, depending on the task. In this section we'll explore how to build the `Learner` objects we used in Parts 1 and 2 of this book."
   ]
  },
  {
@ -60,9 +60,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's take a look at what happens when we use the `cnn_learner` function. We pass it an architecture to use for the *body* of the network. Most of the time we use a resnet, which we already know how to create, so we don't need to delve into that any further. Pretrained weights are downloaded as required and loaded into the resnet.\n",
+    "Let's take a look at what happens when we use the `cnn_learner` function. We begin by passing this function an architecture to use for the *body* of the network. Most of the time we use a ResNet, which you already know how to create, so we don't need to delve into that any further. Pretrained weights are downloaded as required and loaded into the ResNet.\n",
    "\n",
-    "Then, for transfer learning, the network needs to be *cut*. This refers to slicing off the final layer, which is only responsible for ImageNet-specific categorisation. In fact, we do not only slice off this layer, but everything from the adaptive average pooling layer onwards. The reason for this will become clear in just a moment. Since different architectures might use different types of pooling layers, or even completely different kinds of *heads*, we don't just search for the adaptive pooling layer to decide where to cut the pretrained model. Instead, we have a dictionary of information that is used for each model to know where its body ends, and its head starts. We call this `model_meta` — here it is for resnet 50:"
+    "Then, for transfer learning, the network needs to be *cut*. This refers to slicing off the final layer, which is only responsible for ImageNet-specific categorization. In fact, we do not slice off only this layer, but everything from the adaptive average pooling layer onwards. The reason for this will become clear in just a moment. Since different architectures might use different types of pooling layers, or even completely different kinds of *heads*, we don't just search for the adaptive pooling layer to decide where to cut the pretrained model. Instead, we have a dictionary of information that is used for each model to determine where its body ends, and its head starts. We call this `model_meta`—here it is for resnet-50:"
   ]
  },
  {
@ -91,14 +91,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: Body and Head: The \"head\" of a neural net is the part that is specialized for a particular task. For a convnet, it's generally the part after the adaptive average pooling layer. The \"body\" is everything else, and includes the \"stem\" (which we learned about in <<chapter_resnet>>)."
+    "> jargon: Body and Head: The \"head\" of a neural net is the part that is specialized for a particular task. For a CNN, it's generally the part after the adaptive average pooling layer. The \"body\" is everything else, and includes the \"stem\" (which we learned about in <<chapter_resnet>>)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If we take all of the layers prior to the cutpoint of `-2`, we get the part of the model which fastai will keep for transfer learning. Now, we put on our new head. This is created using the function create_head:"
+    "If we take all of the layers prior to the cut point of `-2`, we get the part of the model that fastai will keep for transfer learning. Now, we put on our new head. This is created using the function `create_head`:"
   ]
  },
  {
@ -161,23 +161,23 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With this function you can choose how many additional linear layers are added to the end, how much dropout to use after each one, and what kind of pooling to use. By default, fastai will apply both average pooling, and max pooling, and will concatenate the two together (this is the `AdaptiveConcatPool2d` layer). This is not a particularly common approach, but it was developed independently at fastai and at other research labs in recent years, and tends to provide some small improvement over using just average pooling.\n",
+    "With this function you can choose how many additional linear layers are added to the end, how much dropout to use after each one, and what kind of pooling to use. By default, fastai will apply both average pooling, and max pooling, and will concatenate the two together (this is the `AdaptiveConcatPool2d` layer). This is not a particularly common approach, but it was developed independently at fastai and other research labs in recent years, and tends to provide some small improvement over using just average pooling.\n",
    "\n",
-    "Fastai is also a bit different to most libraries in adding two linear layers, rather than one, by default in the CNN head. The reason for this is that transfer learning can still be useful even, as we have seen, and transferring two very different domains to the pretrained model. However, just using a single linear layer is unlikely to be enough. So we have found that using two linear layers can allow transfer learning to be used more quickly and easily, in more situations."
+    "fastai is a bit different from most libraries in that by default it adds two linear layers, rather than one, in the CNN head. The reason for this is that transfer learning can still be useful even, as we have seen, when transferring the pretrained model to very different domains. However, just using a single linear layer is unlikely to be enough in these cases; we have found that using two linear layers can allow transfer learning to be used more quickly and easily, in more situations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> note: One parameter to create_head that is worth looking at is bn_final. Setting this to true will cause a batchnorm layer to be added as your final layer. This can be useful in helping your model to more easily ensure that it is scaled appropriately for your output activations. We haven't seen this approach published anywhere, as yet, but we have found that it works well in practice, wherever we have used it."
+    "> note: One Last Batchnorm?: One parameter to `create_head` that is worth looking at is `bn_final`. Setting this to `true` will cause a batchnorm layer to be added as your final layer. This can be useful in helping your model scale appropriately for your output activations. We haven't seen this approach published anywhere as yet, but we have found that it works well in practice wherever we have used it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's now have a look at what `unet_learner` did in the segmentation problem we showed in <<chapter_intro>>."
+    "Let's now take a look at what `unet_learner` did in the segmentation problem we showed in <<chapter_intro>>."
   ]
  },
  {
@ -191,51 +191,49 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One of the most interesting architectures in deep learning is the one that we used for segmentation in <<chapter_intro>>. Segmentation is a challenging task, because the output required is really an image, or a pixel grid, containing the predicted label for every pixel. There are other tasks which share a similar basic design, such as increasing the resolution of an image (*super resolution*), adding colour to a black-and-white image (*colorization*), or converting a photo into a synthetic painting (*style transfer*)--these tasks are covered by an online chapter of this book, so be sure to check it out after you've read this chapter. In each case, we are starting with an image, and converting it to some other image of the same dimensions or aspect ratio, but with the pixels converted in some way. We refer to these as *generative vision models*.\n",
+    "One of the most interesting architectures in deep learning is the one that we used for segmentation in <<chapter_intro>>. Segmentation is a challenging task, because the output required is really an image, or a pixel grid, containing the predicted label for every pixel. There are other tasks that share a similar basic design, such as increasing the resolution of an image (*super-resolution*), adding color to a black-and-white image (*colorization*), or converting a photo into a synthetic painting (*style transfer*)—these tasks are covered by an [online](https://book.fast.ai/) chapter of this book, so be sure to check it out after you've read this chapter. In each case, we are starting with an image and converting it to some other image of the same dimensions or aspect ratio, but with the pixels altered in some way. We refer to these as *generative vision models*.\n",
    "\n",
-    "The way we do this is to start with the exact same approach to developing a CNN head as we saw above. We start with a ResNet, for instance, and cut off the adaptive pooling layer and everything after that. And then we replace that with our custom head which does the generative task.\n",
+    "The way we do this is to start with the exact same approach to developing a CNN head as we saw in the previous problem. We start with a ResNet, for instance, and cut off the adaptive pooling layer and everything after that. Then we replace those layers with our custom head, which does the generative task.\n",
    "\n",
-    "There was a lot of handwaving in that last sentence! How on earth do we create a CNN head which generates an image? If we start with, say, a 224 pixel input image, then at the end of the resnet body we will have a 7x7 grid of convolutional activations. How can we convert that into a 224 pixel segmentation mask?\n",
+    "There was a lot of handwaving in that last sentence! How on earth do we create a CNN head that generates an image? If we start with, say, a 224-pixel input image, then at the end of the ResNet body we will have a 7\\*7 grid of convolutional activations. How can we convert that into a 224-pixel segmentation mask?\n",
    "\n",
-    "We will (naturally) do this with a neural network! So we need some kind of layer which can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7x7 grid with four pixels in a 2x2 square. Each of those four pixels would have the same value — this is known as nearest neighbour interpolation. PyTorch provides a layer which does this for us, so we could create a head which contains stride one convolutional layers (along with batchnorm and ReLU as usual) interspersed with 2x2 nearest neighbour interpolation layers. In fact, you could try this now! See if you can create a custom head designed like this, and see if it can complete the CamVid segmentation task. You should find that you get some reasonable results, although it won't be as good as our <<chapter_intro>> results.\n",
+    "Naturally, we do this with a neural network! So we need some kind of layer that can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7\\*7 grid with four pixels in a 2\\*2 square. Each of those four pixels will have the same value—this is known as *nearest neighbor interpolation*. PyTorch provides a layer that does this for us, so one option is to create a head that contains stride-1 convolutional layers (along with batchnorm and ReLU layers as usual) interspersed with 2\\*2 nearest neighbor interpolation layers. In fact, you can try this now! See if you can create a custom head designed like this, and try it on the CamVid segmentation task. You should find that you get some reasonable results, although they won't be as good as our <<chapter_intro>> results.\n",
    "\n",
-    "Another approach is to replace the nearest neighbour and convolution combination with a *transposed convolution* otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between every pixel in the input. This is easiest to see with a picture — <<transp_conv>> shows a diagram from the excellent convolutional arithmetic paper we have seen before, showing a 3x3 transposed convolution applied to a 3x3 image."
+    "Another approach is to replace the nearest neighbor and convolution combination with a *transposed convolution*, otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between all the pixels in the input. This is easiest to see with a picture—<<transp_conv>> shows a diagram from the excellent [convolutional arithmetic paper](https://arxiv.org/abs/1603.07285) we discussed in <<chapter_convolutions>>, showing a 3\\*3 transposed convolution applied to a 3\\*3 image."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"A transposed convolution\" width=\"815\" caption=\"A transposed convolution\" id=\"transp_conv\" src=\"images/att_00051.png\">"
+    "<img alt=\"A transposed convolution\" width=\"815\" caption=\"A transposed convolution (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"transp_conv\" src=\"images/att_00051.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you see, the result of this is to increase the size of the input. You can try this out now, by using fastai's ConvLayer class; pass the parameter `transpose=True` to create a transposed convolution, instead of a regular one, in your custom head.\n",
+    "As you see, the result of this is to increase the size of the input. You can try this out now by using fastai's `ConvLayer` class; pass the parameter `transpose=True` to create a transposed convolution, instead of a regular one, in your custom head.\n",
    "\n",
-    "Neither of these approaches, however, works really well. The problem is that our 7x7 grid simply doesn't have enough information to create a 224x224 pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use skip connections, like in a resnet, but skipping from the activations in the body of the resnet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This is known as a U-Net, and it was developed in the 2015 paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). Although the paper focussed on medical applications, the U-Net has revolutionized all kinds of generation vision models.\n",
-    "\n",
-    "<<unet>> shows the U-Net architecture (form the paper). "
+    "Neither of these approaches, however, works really well. The problem is that our 7\\*7 grid simply doesn't have enough information to create a 224\\*224-pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use *skip connections*, like in a ResNet, but skipping from the activations in the body of the ResNet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This approach, illustrated in <<unet>>, was developed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in the 2015 paper [\"U-Net: Convolutional Networks for Biomedical Image Segmentation\"](https://arxiv.org/abs/1505.04597). Although the paper focused on medical applications, the U-Net has revolutionized all kinds of generative vision models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"The U-net architecture\" width=\"630\" caption=\"The U-net architecture\" id=\"unet\" src=\"images/att_00052.png\">"
+    "<img alt=\"The U-Net architecture\" width=\"630\" caption=\"The U-Net architecture (courtesy of Olaf Ronneberger, Philipp Fischer, and Thomas Brox)\" id=\"unet\" src=\"images/att_00052.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2x2 max pooling instead of stride 2 convolutions, since this paper was written before ResNets came along) and it shows the transposed convolutional layers on the right (they're called \"up-conv\" in this picture). Then then extra skip connections are shown as grey arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-net\" when you see this picture!\n",
+    "This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2\\*2 max pooling instead of stride-2 convolutions, since this paper was written before ResNets came along) and the transposed convolutional (\"up-conv\") layers on the right. Then extra skip connections are shown as gray arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-Net!\"\n",
    "\n",
-    "With this architecture, the input to the transposed convolutions is not just the lower resolution grid in the preceding layer, but also the higher resolution grid in the resnet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class which auto-generates an architecture of the right size based on the data provided.\n",
+    "With this architecture, the input to the transposed convolutions is not just the lower-resolution grid in the preceding layer, but also the higher-resolution grid in the ResNet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class that autogenerates an architecture of the right size based on the data provided.\n",
    "\n",
-    "Let's focus now on an example where we leverage the fastai library to write a custom model:"
+    "Let's focus now on an example where we leverage the fastai library to write a custom model."
   ]
  },
  {
@ -300,9 +298,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's go back to the input pipeline we set up in <<chapter_midlevel_data>> for a Siamese network. If your remember, it consisted of pair of images with the label being `True` or `False`, depending on if they were in the same class or not.\n",
+    "Let's go back to the input pipeline we set up in <<chapter_midlevel_data>> for a Siamese network. If you remember, it consisted of pair of images with the label being `True` or `False`, depending on if they were in the same class or not.\n",
    "\n",
-    "Using what we just saw, let's build a custom model for this task and train it. How? We will use a pretrained architecture and pass our two images throught it. Then we can concatenate the results and send them to a custom head that will return two predictions. In terms of modules, this looks like this:"
+    "Using what we just saw, let's build a custom model for this task and train it. How? We will use a pretrained architecture and pass our two images through it. Then we can concatenate the results and send them to a custom head that will return two predictions. In terms of modules, this looks like this:"
   ]
  },
  {
@ -324,7 +322,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To create our encoder, we just need to take a pretrained model and cut it, as we explained before. The function `create_body` does that for us, we just have to pass it the place we want to cut. If we remember our look in the dictionary of metadata for pretrained models, the cut value for a resnet is -2:"
+    "To create our encoder, we just need to take a pretrained model and cut it, as we explained before. The function `create_body` does that for us; we just have to pass it the place where we want to cut. As we saw earlier, per the dictionary of metadata for pretrained models, the cut value for a resnet is `-2`:"
   ]
  },
  {
@ -340,7 +338,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then we can create our head. A look at the encoder tells us the last layer has 512 features, so this head will need to receive `512*4`. Why 4? First we have to multiply by 2 because we have two images. Then we need a second multiplication by 2 because of our concat-pool trick."
+    "Then we can create our head. A look at the encoder tells us the last layer has 512 features, so this head will need to receive `512*4`. Why 4? First we have to multiply by 2 because we have two images. Then we need a second multiplication by 2 because of our concat-pool trick. So we create the head as follows:"
   ]
  },
  {
@ -356,7 +354,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With our encoder and head, we can now build our model."
+    "With our encoder and head, we can now build our model:"
   ]
  },
  {
@ -372,7 +370,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Before using `Learner`, we have two more things to define. First, we must define the loss function we want to use. It's regular cross entropy, but since our targets are booleans, we need to convert them to integers or PyTorch will throw an error."
+    "Before using `Learner`, we have two more things to define. First, we must define the loss function we want to use. It's regular cross-entropy, but since our targets are Booleans, we need to convert them to integers or PyTorch will throw an error:"
   ]
  },
  {
@ -389,9 +387,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "More importantly, to take full advantage of transfer learning, we have to define a custom *splitter*. A splitter is a function that tells the fastai library how to split the model in several parameter groups. This is what is used behind the scenes not only train the head of a model when we do transfer learning. \n",
+    "More importantly, to take full advantage of transfer learning, we have to define a custom *splitter*. A splitter is a function that tells the fastai library how to split the model into parameter groups. These are used behind the scenes to train only the head of a model when we do transfer learning. \n",
    "\n",
-    "Here we want two parameter groups: one for the encoder and one for the head. We can thus define the following splitter (`params` is jsut a function that returns all parameters of a given module):"
+    "Here we want two parameter groups: one for the encoder and one for the head. We can thus define the following splitter (`params` is just a function that returns all parameters of a given module):"
   ]
  },
  {
@ -408,7 +406,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then we can define our `Learner` by passing the data, model, loss function, spliiter and any metric we want. Since we are not using a convenience function from fastai for transfer learning (like `cnn_learner`), we have to call `learn.freeze` manually. This will make sure only the last parameter groups (in this case, the head) is trained. "
+    "Then we can define our `Learner` by passing the data, model, loss function, splitter, and any metric we want. Since we are not using a convenience function from fastai for transfer learning (like `cnn_learner`), we have to call `learn.freeze` manually. This will make sure only the last parameter group (in this case, the head) is trained:"
   ]
  },
  {
@ -495,7 +493,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Before unfreezing and training a bit more with discriminative learning rates..."
+    "Before unfreezing and fine-tuning the whole model a bit more with discriminative learning rates (that is: a lower learning rate for the body and a higher one for the head):"
   ]
  },
  {
@ -565,14 +563,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "94.8% is very good when we remember a classifier trained the same way (with no data augmentation) had an arror rate of 7%."
+    "94.8\\% is very good when we remember a classifier trained the same way (with no data augmentation) had an error rate of 7%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we've seen how to create complete state of the art computer vision models, let's move on to NLP."
+    "Now that we've seen how to create complete state-of-the-art computer vision models, let's move on to NLP."
   ]
  },
  {
@ -586,38 +584,31 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Converting an AWD-LSTM language model into a transfer learning classifier as we have done in <<chapter_nlp>> follows a very similar process to what we saw for `cnn_learner` in the first section of this chapter. We do not need a \"meta\" dictionary in this case, because we do not have such a variety of architectures to support in the body. All we need to do is to select the stacked RNN for the encoder in the language model, which is a single PyTorch module. This encoder will provide an activation for every word of the input, because a language model needs to output a prediction for every next word.\n",
+    "Converting an AWD-LSTM language model into a transfer learning classifier, as we did in <<chapter_nlp>>, follows a very similar process to what we did with `cnn_learner` in the first section of this chapter. We do not need a \"meta\" dictionary in this case, because we do not have such a variety of architectures to support in the body. All we need to do is select the stacked RNN for the encoder in the language model, which is a single PyTorch module. This encoder will provide an activation for every word of the input, because a language model needs to output a prediction for every next word.\n",
    "\n",
-    "To create a classifier from this we use an approach described in the ULMFiT paper as \"BPTT for Text Classification (BPT3C)\". The paper describes this:"
+    "To create a classifier from this we use an approach described in the [ULMFiT paper](https://arxiv.org/abs/1801.06146) as \"BPTT for Text Classification (BPT3C)\":"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> : In order to make fine-tuning a classifier for large documents feasible, we propose BPTT for Text Classification (BPT3C): We divide the document into fixed-length batches of size `b`. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences."
+    "> : We divide the document into fixed-length batches of size *b*. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In practice, what this is saying is that the classifier contains a for loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models — but this time, we do not pool over CNN grid cells, but over RNN sequences.\n",
+    "In other words, the classifier contains a `for` loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models—but this time, we do not pool over CNN grid cells, but over RNN sequences.\n",
    "\n",
-    "For this for loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own label. However, it's very likely that those texts won't have the good taste of being all of the same length, which means we won't be able to put them all in the same array, like we did with the language model.\n",
+    "For this `for` loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own labels. However, it's very likely that those texts won't all be of the same length, which means we won't be able to put them all in the same array, like we did with the language model.\n",
    "\n",
-    "That's where padding is going to help: when grabbing a bunch of texts, we determine the one with the greater length, then we fill the ones that are shorter with a special token called `xxpad`. To avoid having an extreme case where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation) we alter the randomness by making sure texts of comparable size are put together. It will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely random.\n",
+    "That's where padding is going to help: when grabbing a bunch of texts, we determine the one with the greatest length, then we fill the ones that are shorter with a special token called `xxpad`. To avoid extreme cases where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation), we alter the randomness by making sure texts of comparable size are put together. The texts will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely so.\n",
    "\n",
    "This is done automatically behind the scenes by the fastai library when creating our `DataLoaders`."
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The last application where we used fastai's model we haven't shown you yet is tabular."
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -629,9 +620,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Finally, we can look at `fastai.tabular` models. (We don't need to look at collaborative filtering separately, since we've already seen that these models are just tabular models, or use dot product, which we've implemented earlier from scratch.\n",
+    "Finally, let's take a look at `fastai.tabular` models. (We don't need to look at collaborative filtering separately, since we've already seen that these models are just tabular models, or use the dot product approach, which we've implemented earlier from scratch.)\n",
    "\n",
-    "Here is the forward method for `TabularModel`:\n",
+    "Here is the `forward` method for `TabularModel`:\n",
    "\n",
    "```python\n",
    "if self.n_emb != 0:\n",
@ -644,7 +635,7 @@
    "return self.layers(x)\n",
    "```\n",
    "\n",
-    "We won't show `__init__` here, since it's not that interesting, but will look at each line of code in turn in `forward`:"
+    "We won't show `__init__` here, since it's not that interesting, but will look at each line of code in `forward`in turn. The first line:"
   ]
  },
  {
@ -655,92 +646,90 @@
    "if self.n_emb != 0:\n",
    "```\n",
    "\n",
-    "This is just testing whether there are any embeddings to deal with — we can skip this section if we only have continuous variables.\n",
+    "is just testing whether there are any embeddings to deal with—we can skip this section if we only have continuous variables. `self.embeds` contains the embedding matrices, so this gets the activations of each:\n",
    " \n",
    "```python\n",
    "    x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]\n",
    "```\n",
    "\n",
-    "`self.embeds` contains the embedding matrices, so this gets the activations of each…\n",
+    "and concatenates them into a single tensor:\n",
    "\n",
    "```python\n",
    "    x = torch.cat(x, 1)\n",
    "```\n",
    "\n",
-    "…and concatenates them into a single tensor.\n",
+    "Then dropout is applied. You can pass `emb_drop` to `__init__` to change this value:\n",
    "\n",
    "```python\n",
    "    x = self.emb_drop(x)\n",
    "```\n",
    "\n",
-    "Then dropout is applied. You can pass `emb_drop` to `__init__` to change this value.\n",
+    "Now we test whether there are any continuous variables to deal with:\n",
    "\n",
    "```python\n",
    "if self.n_cont != 0:\n",
    "```\n",
    "\n",
-    "Now we test whether there are any continuous variables to deal with.\n",
+    "They are passed through a batchnorm layer:\n",
    "\n",
    "```python\n",
    "    x_cont = self.bn_cont(x_cont)\n",
    "```\n",
    "\n",
-    "They are passed through a batchnorm layer…\n",
+    "and concatenated with the embedding activations, if there were any:\n",
    "\n",
    "```python\n",
    "    x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont\n",
    "```\n",
    "\n",
-    "…and concatenated with the embedding activations, if there were any.\n",
+    "Finally, this is passed through the linear layers (each of which includes batchnorm, if `use_bn` is `True`, and dropout, if `ps` is set to some value or list of values):\n",
    "\n",
    "```python\n",
    "return self.layers(x)\n",
    "\n",
    "```\n",
    "\n",
-    "Finally, this is passed through the linear layers (each of which includes batchnorm, if `use_bn` is True, and dropout, if `ps` is set to some value or list of values).\n",
-    "\n",
-    "Congratulations! Now, you know every single piece of the architectures used in the fastai library!"
+    "Congratulations! Now you know every single piece of the architectures used in the fastai library!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Wrapping up Architectures"
+    "## Wrapping Up Architectures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you can see, the details of deep learning architectures need not scare you now. You can look inside the code of fastai and PyTorch and see just what is going on. More importantly, try to understand why that is going on. Take a look at the papers that are being implemented in the code, and try to see how the code matches up to the algorithms that are described.\n",
+    "As you can see, the details of deep learning architectures need not scare you now. You can look inside the code of fastai and PyTorch and see just what is going on. More importantly, try to understand *why* it's going on. Take a look at the papers that are being referenced in the code, and try to see how the code matches up to the algorithms that are described.\n",
    "\n",
-    "Now that we have investigated all of the pieces of a model and the data that is passed into it, we can consider what this means for practical deep learning. If you have unlimited data, unlimited memory, and unlimited time, then the advice is easy: train a huge model on all of your data for a really long time. The reason that deep learning is not straightforward is because your data, memory, and time is limited. If you are running out of memory or time, then the solution is to train a smaller model. If you are not able to train for long enough to overfit, then you are not taking advantage of the capacity of your model.\n",
+    "Now that we have investigated all of the pieces of a model and the data that is passed into it, we can consider what this means for practical deep learning. If you have unlimited data, unlimited memory, and unlimited time, then the advice is easy: train a huge model on all of your data for a really long time. But the reason that deep learning is not straightforward is because your data, memory, and time are typically limited. If you are running out of memory or time, then the solution is to train a smaller model. If you are not able to train for long enough to overfit, then you are not taking advantage of the capacity of your model.\n",
    "\n",
-    "So step one is to get to the point that you can overfit. Then, the question is how to reduce that overfitting. <<reduce_overfit>> shows how we recommend prioritising the steps from there."
+    "So, step one is to get to the point where you can overfit. Then the question is how to reduce that overfitting. <<reduce_overfit>> shows how we recommend prioritizing the steps from there."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Steps to reducing over-fitting\" width=\"400\" caption=\"Steps to reducing over-fitting\" id=\"reduce_overfit\" src=\"images/att_00047.png\">"
+    "<img alt=\"Steps to reducing overfitting\" width=\"400\" caption=\"Steps to reducing overfitting\" id=\"reduce_overfit\" src=\"images/att_00047.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Many practitioners when faced with an overfitting model start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularisation. Using a smaller model should be absolutely the last step you take, unless your model is taking up too much time or memory. Reducing the size of your model as reducing the ability of your model to learn subtle relationships in your data.\n",
+    "Many practitioners, when faced with an overfitting model, start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularization. Using a smaller model should be absolutely the last step you take, unless training your model is taking up too much time or memory. Reducing the size of your model reduces the ability of your model to learn subtle relationships in your data.\n",
    "\n",
-    "Instead, your first step should be to seek to create more data. That could involve adding more labels to data that you already have in your organisation, finding additional tasks that your model could be asked to solve (or to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data via using more or different data augmentation. Thanks to the development of mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.\n",
+    "Instead, your first step should be to seek to *create more data*. That could involve adding more labels to data that you already have, finding additional tasks that your model could be asked to solve (or, to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data by using more or different data augmentation techniques. Thanks to the development of Mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.\n",
    "\n",
-    "Once you've got as much data as you think you can reasonably get a hold of, and are using it as effectively as possible by taking advantage of all of the labels that you can find, and all of the augmentation that make sense, if you are still overfitting and you should think about using more generalisable architectures. For instance, adding batch normalisation may improve generalisation.\n",
+    "Once you've got as much data as you think you can reasonably get hold of, and are using it as effectively as possible by taking advantage of all the labels that you can find and doing all the augmentation that makes sense, if you are still overfitting you should think about using more generalizable architectures. For instance, adding batch normalization may improve generalization.\n",
    "\n",
-    "If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularisation. Generally speaking, adding dropout to the last layer or two will do a good job of regularising your model. However, as we learnt from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help regularise even better. Generally speaking, a larger model with more regularisation is more flexible, and can therefore be more accurate than a smaller model with less regularisation.\n",
+    "If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularization. Generally speaking, adding dropout to the last layer or two will do a good job of regularizing your model. However, as we learned from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help even more. Generally speaking, a larger model with more regularization is more flexible, and can therefore be more accurate than a smaller model with less regularization.\n",
    "\n",
-    "Only after considering all of these options would be recommend that you try using smaller versions of your architectures."
+    "Only after considering all of these options would we recommend that you try using a smaller version of your architecture."
   ]
  },
  {
@ -754,24 +743,24 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. What is the head of a neural net?\n",
-    "1. What is the body of a neural net?\n",
+    "1. What is the \"head\" of a neural net?\n",
+    "1. What is the \"body\" of a neural net?\n",
    "1. What is \"cutting\" a neural net? Why do we need to do this for transfer learning?\n",
-    "1. What is \"model_meta\"? Try printing it to see what's inside.\n",
+    "1. What is `model_meta`? Try printing it to see what's inside.\n",
    "1. Read the source code for `create_head` and make sure you understand what each line does.\n",
-    "1. Look at the output of create_head and make sure you understand why each layer is there, and how the create_head source created it.\n",
-    "1. Figure out how to change the dropout, layer size, and number of layers created by create_cnn, and see if you can find values that result in better accuracy from the pet recognizer.\n",
-    "1. What does AdaptiveConcatPool2d do?\n",
-    "1. What is nearest neighbor interpolation? How can it be used to upsample convolutional activations?\n",
-    "1. What is a transposed convolution? What is another name for it?\n",
+    "1. Look at the output of `create_head` and make sure you understand why each layer is there, and how the `create_head` source created it.\n",
+    "1. Figure out how to change the dropout, layer size, and number of layers created by `create_cnn`, and see if you can find values that result in better accuracy from the pet recognizer.\n",
+    "1. What does `AdaptiveConcatPool2d` do?\n",
+    "1. What is \"nearest neighbor interpolation\"? How can it be used to upsample convolutional activations?\n",
+    "1. What is a \"transposed convolution\"? What is another name for it?\n",
    "1. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.\n",
-    "1. Draw the u-net architecture.\n",
-    "1. What is BPTT for Text Classification (BPT3C)?\n",
+    "1. Draw the U-Net architecture.\n",
+    "1. What is \"BPTT for Text Classification\" (BPT3C)?\n",
    "1. How do we handle different length sequences in BPT3C?\n",
    "1. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.\n",
    "1. How is `self.layers` defined in `TabularModel`?\n",
    "1. What are the five steps for preventing over-fitting?\n",
-    "1. Why don't we reduce architecture complexity before trying other approaches to preventing over-fitting?"
+    "1. Why don't we reduce architecture complexity before trying other approaches to preventing overfitting?"
   ]
  },
  {
@ -786,10 +775,10 @@
   "metadata": {},
   "source": [
    "1. Write your own custom head and try training the pet recognizer with it. See if you can get a better result than fastai's default.\n",
-    "1. Try switching between AdaptiveConcatPool2d and AdaptiveAvgPool2d in a CNN head and see what difference it makes.\n",
-    "1. Write your own custom splitter to create a separate parameter group for every resnet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
-    "1. Read the online chapter about generative image models, and create your own colorizer, super resolution model, or style transfer model.\n",
-    "1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on Camvid."
+    "1. Try switching between `AdaptiveConcatPool2d` and `AdaptiveAvgPool2d` in a CNN head and see what difference it makes.\n",
+    "1. Write your own custom splitter to create a separate parameter group for every ResNet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
+    "1. Read the online chapter about generative image models, and create your own colorizer, super-resolution model, or style transfer model.\n",
+    "1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on CamVid."
   ]
  },
  {
--- a/clean/08_collab.ipynb
+++ b/clean/08_collab.ipynb
@ -1444,7 +1444,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Boot Strapping a Collaborative Filtering Model"
+    "## Bootstrapping a Collaborative Filtering Model"
   ]
  },
  {
--- a/clean/11_midlevel_data.ipynb
+++ b/clean/11_midlevel_data.ipynb
@ -843,7 +843,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Use the mid-level API to prepare the data in `DataLoaders` on the pets dataset. On the adult dataset (used in chapter 1).\n",
+    "1. Use the mid-level API to prepare the data in `DataLoaders` on your own datasets. Try this with the Pet dataset and the Adult dataset from Chapter 1.\n",
    "1. Look at the Siamese tutorial in the fastai documentation to learn how to customize the behavior of `show_batch` and `show_results` for new type of items. Implement it in your own project."
   ]
  },
--- a/clean/13_convolutions.ipynb
+++ b/clean/13_convolutions.ipynb
@ -1815,14 +1815,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### A Note about Twitter"
+    "### A Note About Twitter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Colour Images"
+    "## Color Images"
   ]
  },
  {
@ -2590,27 +2590,27 @@
   "source": [
    "1. What is a \"feature\"?\n",
    "1. Write out the convolutional kernel matrix for a top edge detector.\n",
-    "1. Write out the mathematical operation applied by a 3 x 3 kernel to a single pixel in an image.\n",
-    "1. What is the value of a convolutional kernel apply to a 3 x 3 matrix of zeros?\n",
-    "1. What is padding?\n",
-    "1. What is stride?\n",
+    "1. Write out the mathematical operation applied by a 3\\*3 kernel to a single pixel in an image.\n",
+    "1. What is the value of a convolutional kernel apply to a 3\\*3 matrix of zeros?\n",
+    "1. What is \"padding\"?\n",
+    "1. What is \"stride\"?\n",
    "1. Create a nested list comprehension to complete any task that you choose.\n",
-    "1. What are the shapes of the input and weight parameters to PyTorch's 2D convolution?\n",
-    "1. What is a channel?\n",
+    "1. What are the shapes of the `input` and `weight` parameters to PyTorch's 2D convolution?\n",
+    "1. What is a \"channel\"?\n",
    "1. What is the relationship between a convolution and a matrix multiplication?\n",
-    "1. What is a convolutional neural network?\n",
+    "1. What is a \"convolutional neural network\"?\n",
    "1. What is the benefit of refactoring parts of your neural network definition?\n",
    "1. What is `Flatten`? Where does it need to be included in the MNIST CNN? Why?\n",
    "1. What does \"NCHW\" mean?\n",
    "1. Why does the third layer of the MNIST CNN have `7*7*(1168-16)` multiplications?\n",
-    "1. What is a receptive field?\n",
+    "1. What is a \"receptive field\"?\n",
    "1. What is the size of the receptive field of an activation after two stride 2 convolutions? Why?\n",
-    "1. Run conv-example.xlsx yourself and experiment with \"trace precedents\".\n",
+    "1. Run *conv-example.xlsx* yourself and experiment with *trace precedents*.\n",
    "1. Have a look at Jeremy or Sylvain's list of recent Twitter \"like\"s, and see if you find any interesting resources or ideas there.\n",
    "1. How is a color image represented as a tensor?\n",
    "1. How does a convolution work with a color input?\n",
-    "1. What method can we use to see that data in DataLoaders?\n",
-    "1. Why do we double the number of filters after each stride 2 conv?\n",
+    "1. What method can we use to see that data in `DataLoaders`?\n",
+    "1. Why do we double the number of filters after each stride-2 conv?\n",
    "1. Why do we use a larger kernel in the first conv with MNIST (with `simple_cnn`)?\n",
    "1. What information does `ActivationStats` save for each layer?\n",
    "1. How can we access a learner's callback after training?\n",
@ -2621,7 +2621,7 @@
    "1. What is 1cycle training?\n",
    "1. What are the benefits of training with a high learning rate?\n",
    "1. Why do we want to use a low learning rate at the end of training?\n",
-    "1. What is cyclical momentum?\n",
+    "1. What is \"cyclical momentum\"?\n",
    "1. What callback tracks hyperparameter values during training (along with other information)?\n",
    "1. What does one column of pixels in the `color_dim` plot represent?\n",
    "1. What does \"bad training\" look like in `color_dim`? Why?\n",
@ -2643,8 +2643,7 @@
   "source": [
    "1. What features other than edge detectors have been used in computer vision (especially before deep learning became popular)?\n",
    "1. There are other normalization layers available in PyTorch. Try them out and see what works best. Learn about why other normalization layers have been developed, and how they differ from batch normalization.\n",
-    "1. Try moving the activation function after the batch normalization layer in `conv`. Does it make a difference? See what you can find out about what order is recommended, and why.\n",
-    "1. Batch normalization isn't defined for a batch size of one, since the standard deviation isn't defined for a single item. "
+    "1. Try moving the activation function after the batch normalization layer in `conv`. Does it make a difference? See what you can find out about what order is recommended, and why."
   ]
  },
  {
--- a/clean/14_resnet.ipynb
+++ b/clean/14_resnet.ipynb
@ -16,7 +16,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Resnets"
+    "# ResNets"
   ]
  },
  {
@ -237,7 +237,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Skip-Connections"
+    "### Skip Connections"
   ]
  },
  {
@ -829,27 +829,28 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. How did we get to a single vector of activations in the convnets used for MNIST in previous chapters? Why isn't that suitable for Imagenette?\n",
+    "1. How did we get to a single vector of activations in the CNNs used for MNIST in previous chapters? Why isn't that suitable for Imagenette?\n",
    "1. What do we do for Imagenette instead?\n",
-    "1. What is adaptive pooling?\n",
-    "1. What is average pooling?\n",
+    "1. What is \"adaptive pooling\"?\n",
+    "1. What is \"average pooling\"?\n",
    "1. Why do we need `Flatten` after an adaptive average pooling layer?\n",
-    "1. What is a skip connection?\n",
+    "1. What is a \"skip connection\"?\n",
    "1. Why do skip connections allow us to train deeper models?\n",
    "1. What does <<resnet_depth>> show? How did that lead to the idea of skip connections?\n",
-    "1. What is an identity mapping?\n",
-    "1. What is the basic equation for a ResNet block (ignoring batchnorm and relu layers)?\n",
-    "1. What do ResNets have to do with \"residuals\"?\n",
-    "1. How do we deal with the skip connection when there is a stride 2 convolution? How about when the number of filters changes?\n",
-    "1. How can we express a 1x1 convolution in terms of a vector dot product?\n",
+    "1. What is \"identity mapping\"?\n",
+    "1. What is the basic equation for a ResNet block (ignoring batchnorm and ReLU layers)?\n",
+    "1. What do ResNets have to do with residuals?\n",
+    "1. How do we deal with the skip connection when there is a stride-2 convolution? How about when the number of filters changes?\n",
+    "1. How can we express a 1\\*1 convolution in terms of a vector dot product?\n",
+    "1. Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?\n",
    "1. What does the `noop` function return?\n",
    "1. Explain what is shown in <<resnet_surface>>.\n",
    "1. When is top-5 accuracy a better metric than top-1 accuracy?\n",
-    "1. What is the stem of a CNN?\n",
-    "1. Why use plain convs in the CNN stem, instead of ResNet blocks?\n",
+    "1. What is the \"stem\" of a CNN?\n",
+    "1. Why do we use plain convolutions in the CNN stem, instead of ResNet blocks?\n",
    "1. How does a bottleneck block differ from a plain ResNet block?\n",
    "1. Why is a bottleneck block faster?\n",
-    "1. How do fully convolution nets (and nets with adaptive pooling in general) allow for progressive resizing?"
+    "1. How do fully convolutional nets (and nets with adaptive pooling in general) allow for progressive resizing?"
   ]
  },
  {
@ -863,9 +864,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride 2 layers). How does it compare to a network without such a pooling layer?\n",
-    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1x1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
-    "1. Write a \"top 5 accuracy\" function using plain PyTorch or plain Python.\n",
+    "1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride-2 layers). How does it compare to a network without such a pooling layer?\n",
+    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1\\*1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
+    "1. Write a \"top-5 accuracy\" function using plain PyTorch or plain Python.\n",
    "1. Train a model on Imagenette for more epochs, with and without label smoothing. Take a look at the Imagenette leaderboards and see how close you can get to the best results shown. Read the linked pages describing the leading approaches."
   ]
  },
--- a/clean/15_arch_details.ipynb
+++ b/clean/15_arch_details.ipynb
@ -367,7 +367,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Wrapping up Architectures"
+    "## Wrapping Up Architectures"
   ]
  },
  {
@ -381,24 +381,24 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. What is the head of a neural net?\n",
-    "1. What is the body of a neural net?\n",
+    "1. What is the \"head\" of a neural net?\n",
+    "1. What is the \"body\" of a neural net?\n",
    "1. What is \"cutting\" a neural net? Why do we need to do this for transfer learning?\n",
-    "1. What is \"model_meta\"? Try printing it to see what's inside.\n",
+    "1. What is `model_meta`? Try printing it to see what's inside.\n",
    "1. Read the source code for `create_head` and make sure you understand what each line does.\n",
-    "1. Look at the output of create_head and make sure you understand why each layer is there, and how the create_head source created it.\n",
-    "1. Figure out how to change the dropout, layer size, and number of layers created by create_cnn, and see if you can find values that result in better accuracy from the pet recognizer.\n",
-    "1. What does AdaptiveConcatPool2d do?\n",
-    "1. What is nearest neighbor interpolation? How can it be used to upsample convolutional activations?\n",
-    "1. What is a transposed convolution? What is another name for it?\n",
+    "1. Look at the output of `create_head` and make sure you understand why each layer is there, and how the `create_head` source created it.\n",
+    "1. Figure out how to change the dropout, layer size, and number of layers created by `create_cnn`, and see if you can find values that result in better accuracy from the pet recognizer.\n",
+    "1. What does `AdaptiveConcatPool2d` do?\n",
+    "1. What is \"nearest neighbor interpolation\"? How can it be used to upsample convolutional activations?\n",
+    "1. What is a \"transposed convolution\"? What is another name for it?\n",
    "1. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.\n",
-    "1. Draw the u-net architecture.\n",
-    "1. What is BPTT for Text Classification (BPT3C)?\n",
+    "1. Draw the U-Net architecture.\n",
+    "1. What is \"BPTT for Text Classification\" (BPT3C)?\n",
    "1. How do we handle different length sequences in BPT3C?\n",
    "1. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.\n",
    "1. How is `self.layers` defined in `TabularModel`?\n",
    "1. What are the five steps for preventing over-fitting?\n",
-    "1. Why don't we reduce architecture complexity before trying other approaches to preventing over-fitting?"
+    "1. Why don't we reduce architecture complexity before trying other approaches to preventing overfitting?"
   ]
  },
  {
@ -413,10 +413,10 @@
   "metadata": {},
   "source": [
    "1. Write your own custom head and try training the pet recognizer with it. See if you can get a better result than fastai's default.\n",
-    "1. Try switching between AdaptiveConcatPool2d and AdaptiveAvgPool2d in a CNN head and see what difference it makes.\n",
-    "1. Write your own custom splitter to create a separate parameter group for every resnet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
-    "1. Read the online chapter about generative image models, and create your own colorizer, super resolution model, or style transfer model.\n",
-    "1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on Camvid."
+    "1. Try switching between `AdaptiveConcatPool2d` and `AdaptiveAvgPool2d` in a CNN head and see what difference it makes.\n",
+    "1. Write your own custom splitter to create a separate parameter group for every ResNet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
+    "1. Read the online chapter about generative image models, and create your own colorizer, super-resolution model, or style transfer model.\n",
+    "1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on CamVid."
   ]
  },
  {