Update

2020-05-19 16:56:41 -07:00 · 2020-05-19 16:56:41 -07:00 · cd1aa1f758
commit cd1aa1f758
parent 4b1345a068
20 changed files with 497 additions and 510 deletions
--- a/01_intro.ipynb
+++ b/01_intro.ipynb
@ -2864,7 +2864,7 @@
    "1. What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy?\n",
    "1. What do you need in order to train a model?\n",
    "1. How could a feedback loop impact the rollout of a predictive policing model?\n",
-    "1. Do we always have to use 224\\*224-pixel images with the cat recognition model?\n",
+    "1. Do we always have to use 224×224-pixel images with the cat recognition model?\n",
    "1. What is the difference between classification and regression?\n",
    "1. What is a validation set? What is a test set? Why do we need them?\n",
    "1. What will fastai do if you don't provide a validation set?\n",
--- a/04_mnist_basics.ipynb
+++ b/04_mnist_basics.ipynb
@ -1489,7 +1489,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Perhaps the most important attribute of a tensor is its *shape*. This tells you the length of each axis. In this case, we can see that we have 6,131 images, each of size 28\\*28 pixels. There is nothing specifically about this tensor that says that the first axis is the number of images, the second is the height, and the third is the width—the semantics of a tensor are entirely up to us, and how we construct it. As far as PyTorch is concerned, it is just a bunch of numbers in memory.\n",
+    "Perhaps the most important attribute of a tensor is its *shape*. This tells you the length of each axis. In this case, we can see that we have 6,131 images, each of size 28×28 pixels. There is nothing specifically about this tensor that says that the first axis is the number of images, the second is the height, and the third is the width—the semantics of a tensor are entirely up to us, and how we construct it. As far as PyTorch is concerned, it is just a bunch of numbers in memory.\n",
    "\n",
    "The *length* of a tensor's shape is its rank:"
   ]
@ -2093,7 +2093,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It's good to get in the habit of checking shapes as you go. Here we see two tensors, one representing the 3s validation set of 1,010 images of size 28\\*28, and one representing the 7s validation set of 1,028 images of size 28\\*28.\n",
+    "It's good to get in the habit of checking shapes as you go. Here we see two tensors, one representing the 3s validation set of 1,010 images of size 28×28, and one representing the 7s validation set of 1,028 images of size 28×28.\n",
    "\n",
    "We ultimately want to write a function, `is_3`, that will decide if an arbitrary image is a 3 or a 7. It will do this by deciding which of our two \"ideal digits\" this arbitrary image is closer to. For that we need to define a notion of distance—that is, a function that calculates the distance between two images.\n",
    "\n",
@ -2216,7 +2216,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We are calculating the difference between our \"ideal 3\" and each of the 1,010 3s in the validation set, for each of 28\\*28 images, resulting in the shape `[1010,28,28]`.\n",
+    "We are calculating the difference between our \"ideal 3\" and each of the 1,010 3s in the validation set, for each of 28×28 images, resulting in the shape `[1010,28,28]`.\n",
    "\n",
    "There are a couple of important points about how broadcasting is implemented, which make it valuable not just for expressivity but also for performance:\n",
    "\n",
@ -5782,7 +5782,7 @@
    "1. What is the difference between tensor rank and shape? How do you get the rank from the shape?\n",
    "1. What are RMSE and L1 norm?\n",
    "1. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?\n",
-    "1. Create a 3\\*3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.\n",
+    "1. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.\n",
    "1. What is broadcasting?\n",
    "1. Are metrics generally calculated using the training set, or the validation set? Why?\n",
    "1. What is SGD?\n",
--- a/07_sizing_and_tta.ipynb
+++ b/07_sizing_and_tta.ipynb
@ -49,8 +49,8 @@
    "When fast.ai first started there were three main datasets that people used for building and testing computer vision models:\n",
    "\n",
    "- ImageNet:: 1.3 million images of various sizes around 500 pixels across, in 1,000 categories, which took a few days to train\n",
-    "- MNIST:: 50,000 28\\*28-pixel grayscale handwritten digits\n",
-    "- CIFAR10:: 60,000 32\\*32-pixel color images in 10 classes\n",
+    "- MNIST:: 50,000 28×28-pixel grayscale handwritten digits\n",
+    "- CIFAR10:: 60,000 32×32-pixel color images in 10 classes\n",
    "\n",
    "The problem was that the smaller datasets didn't actually generalize effectively to the large ImageNet dataset. The approaches that worked well on ImageNet generally had to be developed and trained on ImageNet. This led to many people believing that only researchers with access to giant computing resources could effectively contribute to developing image classification algorithms.\n",
    "\n",
--- a/10_nlp.ipynb
+++ b/10_nlp.ipynb
@ -1396,7 +1396,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The 8perplexity* metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`). We  also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy.\n",
+    "The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The *perplexity* metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`). We  also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy.\n",
    "\n",
    "Let's go back to the process diagram from the beginning of this chapter. The first arrow has been completed for us and made available as a pretrained model in fastai, and we've just built the `DataLoaders` and `Learner` for the second stage. Now we're ready to fine-tune our language model!"
   ]
--- a/13_convolutions.ipynb
+++ b/13_convolutions.ipynb
@ -65,7 +65,7 @@
    "\n",
    "It turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n",
    "\n",
-    "A convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3\\*3 matrix in the top right of <<basic_conv>>."
+    "A convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <<basic_conv>>."
   ]
  },
  {
@ -79,9 +79,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The 7\\*7 grid to the left is the *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel by each element of a 3\\*3 block of the image. The results of these multiplications are then added together. The diagram in <<basic_conv>> shows an example of applying a kernel to a single location in the image, the 3\\*3 block around cell 18.\n",
+    "The 7×7 grid to the left is the *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel by each element of a 3×3 block of the image. The results of these multiplications are then added together. The diagram in <<basic_conv>> shows an example of applying a kernel to a single location in the image, the 3×3 block around cell 18.\n",
    "\n",
-    "Let's do this with code. First, we create a little 3\\*3 matrix like so:"
+    "Let's do this with code. First, we create a little 3×3 matrix like so:"
   ]
  },
  {
@ -148,7 +148,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we're going to take the top 3\\*3-pixel square of our image, and multiply each of those values by each item in our kernel. Then we'll add them up, like so:"
+    "Now we're going to take the top 3×3-pixel square of our image, and multiply each of those values by each item in our kernel. Then we'll add them up, like so:"
   ]
  },
  {
@ -1323,9 +1323,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you can see, this little calculation is returning a high number where the 3\\*3-pixel square represents a top edge (i.e., where there are low values at the top of the square, and high values immediately underneath). That's because the `-1` values in our kernel have little impact in that case, but the `1` values have a lot.\n",
+    "As you can see, this little calculation is returning a high number where the 3×3-pixel square represents a top edge (i.e., where there are low values at the top of the square, and high values immediately underneath). That's because the `-1` values in our kernel have little impact in that case, but the `1` values have a lot.\n",
    "\n",
-    "Let's look a tiny bit at the math. The filter will take any window of size 3\\*3 in our images, and if we name the pixel values like this:\n",
+    "Let's look a tiny bit at the math. The filter will take any window of size 3×3 in our images, and if we name the pixel values like this:\n",
    "\n",
    "$$\\begin{matrix} a1 & a2 & a3 \\\\ a4 & a5 & a6 \\\\ a7 & a8 & a9 \\end{matrix}$$\n",
    "\n",
@ -1370,7 +1370,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "But note that we can't apply it to the corner (e.g., location 0,0), since there isn't a complete 3\\*3 square there."
+    "But note that we can't apply it to the corner (e.g., location 0,0), since there isn't a complete 3×3 square there."
   ]
  },
  {
@ -1384,7 +1384,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can map `apply_kernel()` across the coordinate grid. That is, we'll be taking our 3\\*3 kernel, and applying it to each 3\\*3 section of our image. For instance, <<nopad_conv>> shows the positions a 3\\*3 kernel can be applied to in the first row of a 5\\*5 image."
+    "We can map `apply_kernel()` across the coordinate grid. That is, we'll be taking our 3×3 kernel, and applying it to each 3×3 section of our image. For instance, <<nopad_conv>> shows the positions a 3×3 kernel can be applied to in the first row of a 5×5 image."
   ]
  },
  {
@ -1504,21 +1504,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we mentioned before, a convolution is the operation of applying such a kernel over a grid in this way. In the paper [\"A Guide to Convolution Arithmetic for Deep Learning\"](https://arxiv.org/abs/1603.07285) there are many great diagrams showing how image kernels can be applied. Here's an example from the paper showing (at the bottom) a light blue 4\\*4 image, with a dark blue 3\\*3 kernel being applied, creating a 2\\*2 green output activation map at the top. "
+    "As we mentioned before, a convolution is the operation of applying such a kernel over a grid in this way. In the paper [\"A Guide to Convolution Arithmetic for Deep Learning\"](https://arxiv.org/abs/1603.07285) there are many great diagrams showing how image kernels can be applied. Here's an example from the paper showing (at the bottom) a light blue 4×4 image, with a dark blue 3×3 kernel being applied, creating a 2×2 green output activation map at the top. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"Result of applying a 3\\*3 kernel to a 4\\*4 image\" width=\"782\" caption=\"Result of applying a 3\\*3 kernel to a 4\\*4 image (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_ex_four_conv\" src=\"images/att_00028.png\">"
+    "<img alt=\"Result of applying a 3×3 kernel to a 4×4 image\" width=\"782\" caption=\"Result of applying a 3×3 kernel to a 4×4 image (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_ex_four_conv\" src=\"images/att_00028.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Look at the shape of the result. If the original image has a height of `h` and a width of `w`, how many 3\\*3 windows can we find? As you can see from the example, there are `h-2` by `w-2` windows, so the image we get has a result as a height of `h-2` and a width of `w-2`."
+    "Look at the shape of the result. If the original image has a height of `h` and a width of `w`, how many 3×3 windows can we find? As you can see from the example, there are `h-2` by `w-2` windows, so the image we get has a result as a height of `h-2` and a width of `w-2`."
   ]
  },
  {
@ -1633,7 +1633,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One batch contains 64 images, each of 1 channel, with 28\\*28 pixels. `F.conv2d` can handle multichannel (i.e., color) images too. A *channel* is a single basic color in an image—for regular full-color images there are three channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions `[channels, rows, columns]`.\n",
+    "One batch contains 64 images, each of 1 channel, with 28×28 pixels. `F.conv2d` can handle multichannel (i.e., color) images too. A *channel* is a single basic color in an image—for regular full-color images there are three channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions `[channels, rows, columns]`.\n",
    "\n",
    "We'll see how to handle more than one channel later in this chapter. Kernels passed to `F.conv2d` need to be rank-4 tensors: `[channels_in, features_out, rows, columns]`. `edge_kernels` is currently missing one of these. We need to tell PyTorch that the number of input channels in the kernel is one, which we can do by inserting an axis of size one (this is known as a *unit axis*) in the first location, where the PyTorch docs show `in_channels` is expected. To insert a unit axis into a tensor, we use the `unsqueeze` method:"
   ]
@ -1699,7 +1699,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The output shape shows we gave 64 images in the mini-batch, 4 kernels, and 26\\*26 edge maps (we started with 28\\*28 images, but lost one pixel from each side as discussed earlier). We can see we get the same results as when we did this manually:"
+    "The output shape shows we gave 64 images in the mini-batch, 4 kernels, and 26×26 edge maps (we started with 28×28 images, but lost one pixel from each side as discussed earlier). We can see we get the same results as when we did this manually:"
   ]
  },
  {
@ -1763,14 +1763,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With a 5\\*5 input, 4\\*4 kernel, and 2 pixels of padding, we end up with a 6\\*6 activation map, as we can see in <<four_by_five_conv>>."
+    "With a 5×5 input, 4×4 kernel, and 2 pixels of padding, we end up with a 6×6 activation map, as we can see in <<four_by_five_conv>>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"A 4\\*4 kernel with 5\\*5 input and 2 pixels of padding\" width=\"783\" caption=\"A 4\\*4 kernel with 5\\*5 input and 2 pixels of padding (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"four_by_five_conv\" src=\"images/att_00029.png\">"
+    "<img alt=\"A 4×4 kernel with 5×5 input and 2 pixels of padding\" width=\"783\" caption=\"A 4×4 kernel with 5×5 input and 2 pixels of padding (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"four_by_five_conv\" src=\"images/att_00029.png\">"
   ]
  },
  {
@ -1779,14 +1779,14 @@
   "source": [
    "If we add a kernel of size `ks` by `ks` (with `ks` an odd number), the necessary padding on each side to keep the same shape is `ks//2`. An even number for `ks` would require a different amount of padding on the top/bottom and left/right, but in practice we almost never use an even filter size.\n",
    "\n",
-    "So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application, as in <<three_by_five_conv>>. This is known as a *stride-2* convolution. The most common kernel size in practice is 3\\*3, and the most common padding is 1. As you'll see, stride-2 convolutions are useful for decreasing the size of our outputs, and stride-1 convolutions are useful for adding layers without changing the output size."
+    "So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application, as in <<three_by_five_conv>>. This is known as a *stride-2* convolution. The most common kernel size in practice is 3×3, and the most common padding is 1. As you'll see, stride-2 convolutions are useful for decreasing the size of our outputs, and stride-1 convolutions are useful for adding layers without changing the output size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img alt=\"A 3\\*3 kernel with 5\\*5 input, stride-2 convolution, and 1 pixel of padding\" width=\"774\" caption=\"A 3\\*3 kernel with 5\\*5 input, stride-2 convolution, and 1 pixel of padding (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_by_five_conv\" src=\"images/att_00030.png\">"
+    "<img alt=\"A 3×3 kernel with 5×5 input, stride-2 convolution, and 1 pixel of padding\" width=\"774\" caption=\"A 3×3 kernel with 5×5 input, stride-2 convolution, and 1 pixel of padding (courtesy of Vincent Dumoulin and Francesco Visin)\" id=\"three_by_five_conv\" src=\"images/att_00030.png\">"
   ]
  },
  {
@ -1816,7 +1816,7 @@
   "source": [
    "To explain the math behing convolutions, fast.ai student Matt Kleinsmith came up with the very clever idea of showing [CNNs from different viewpoints](https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c). In fact, it's so clever, and so helpful, we're going to show it here too!\n",
    "\n",
-    "Here's our 3\\*3 pixel image, with each pixel labeled with a letter:"
+    "Here's our 3×3 pixel image, with each pixel labeled with a letter:"
   ]
  },
  {
@ -2017,7 +2017,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One thing to note here is that we didn't need to specify 28\\*28 as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel, so it needs to know how many pixels there are, but a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels and the kernel size, as we saw in the previous section.\n",
+    "One thing to note here is that we didn't need to specify 28×28 as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel, so it needs to know how many pixels there are, but a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels and the kernel size, as we saw in the previous section.\n",
    "\n",
    "Think about what the output shape is going to be, then let's try it and see:"
   ]
@ -2046,7 +2046,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is not something we can use to do classification, since we need a single output activation per image, not a 28\\*28 map of activations. One way to deal with this is to use enough stride-2 convolutions such that the final layer is size 1. That is, after one stride-2 convolution the size will be 14\\*14, after two it will be 7\\*7, then 4\\*4, 2\\*2, and finally size 1.\n",
+    "This is not something we can use to do classification, since we need a single output activation per image, not a 28×28 map of activations. One way to deal with this is to use enough stride-2 convolutions such that the final layer is size 1. That is, after one stride-2 convolution the size will be 14×14, after two it will be 7×7, then 4×4, 2×2, and finally size 1.\n",
    "\n",
    "Let's try that now. First, we'll define a function with the basic parameters we'll use in each convolution:"
   ]
@ -2325,7 +2325,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So we have 1 input channel, 4 output channels, and a 3\\*3 kernel. Let's check the weights of the first convolution:"
+    "So we have 1 input channel, 4 output channels, and a 3×3 kernel. Let's check the weights of the first convolution:"
   ]
  },
  {
@ -2418,7 +2418,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here, the cell with the green border is the cell we clicked on, and the blue highlighted cells are its *precedents*—that is, the cells used to calculate its value. These cells are the corresponding 3\\*3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Let's now click *trace precedents* again, to see what cells are used to calculate these inputs. <<preced2>> shows what happens."
+    "Here, the cell with the green border is the cell we clicked on, and the blue highlighted cells are its *precedents*—that is, the cells used to calculate its value. These cells are the corresponding 3×3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Let's now click *trace precedents* again, to see what cells are used to calculate these inputs. <<preced2>> shows what happens."
   ]
  },
  {
@ -2432,7 +2432,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this example, we have just two convolutional layers, each of stride 2, so this is now tracing right back to the input image. We can see that a 7\\*7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This 7\\*7 area is the *receptive field* in the input of the green activation in Conv2. We can also see that a second filter kernel is needed now, since we have two layers.\n",
+    "In this example, we have just two convolutional layers, each of stride 2, so this is now tracing right back to the input image. We can see that a 7×7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This 7×7 area is the *receptive field* in the input of the green activation in Conv2. We can also see that a second filter kernel is needed now, since we have two layers.\n",
    "\n",
    "As you see from this example, the deeper we are in the network (specifically, the more stride-2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation in that layer is. We now know that in the deeper layers of the network we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of saying the same thing we mentionedin the previous section: when we introduce a stride-2 conv in our network, we should also increase the number of channels."
   ]
@ -2826,9 +2826,9 @@
    "\n",
    "As we discussed, we generally want to double the number of filters each time we have a stride-2 layer. One way to increase the number of filters throughout our network is to double the number of activations in the first layer–then every layer after that will end up twice as big as in the previous version as well.\n",
    "\n",
-    "But there is a subtle problem with this. Consider the kernel that is being applied to each pixel. By default, we use a 3\\*3-pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four output filters. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to eight filters. Then when we apply our kernel we will be using nine pixels to calculate eight numbers. That means it isn't really learning much at all: the output size is almost the same as the input size. Neural networks will only create useful features if they're forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.\n",
+    "But there is a subtle problem with this. Consider the kernel that is being applied to each pixel. By default, we use a 3×3-pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four output filters. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to eight filters. Then when we apply our kernel we will be using nine pixels to calculate eight numbers. That means it isn't really learning much at all: the output size is almost the same as the input size. Neural networks will only create useful features if they're forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.\n",
    "\n",
-    "To fix this, we can use a larger kernel in the first layer. If we use a kernel of 5\\*5 pixels then there are 25 pixels being used at each kernel application. Creating eight filters from this will mean the neural net will have to find some useful features:"
+    "To fix this, we can use a larger kernel in the first layer. If we use a kernel of 5×5 pixels then there are 25 pixels being used at each kernel application. Creating eight filters from this will mean the neural net will have to find some useful features:"
   ]
  },
  {
@ -3674,8 +3674,8 @@
   "source": [
    "1. What is a \"feature\"?\n",
    "1. Write out the convolutional kernel matrix for a top edge detector.\n",
-    "1. Write out the mathematical operation applied by a 3\\*3 kernel to a single pixel in an image.\n",
-    "1. What is the value of a convolutional kernel apply to a 3\\*3 matrix of zeros?\n",
+    "1. Write out the mathematical operation applied by a 3×3 kernel to a single pixel in an image.\n",
+    "1. What is the value of a convolutional kernel apply to a 3×3 matrix of zeros?\n",
    "1. What is \"padding\"?\n",
    "1. What is \"stride\"?\n",
    "1. Create a nested list comprehension to complete any task that you choose.\n",
--- a/14_resnet.ipynb
+++ b/14_resnet.ipynb
@ -103,14 +103,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When we looked at MNIST we were dealing with 28\\*28-pixel images. For Imagenette we are going to be training with 128\\*128-pixel images. Later, we would like to be able to use larger images as well—at least as big as 224\\*224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
+    "When we looked at MNIST we were dealing with 28×28-pixel images. For Imagenette we are going to be training with 128×128-pixel images. Later, we would like to be able to use larger images as well—at least as big as 224×224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
    "\n",
    "The approach we used was to ensure that there were enough stride-2convolutions such that the final layer would have a grid size of 1. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so, a matrix of activations for a mini-batch). We could do the same thing for Imagenette, but that's would cause two problems:\n",
    "\n",
-    "- We'd need lots of stride-2 layers to make our grid 1\\*1 at the end—perhaps more than we would otherwise choose.\n",
+    "- We'd need lots of stride-2 layers to make our grid 1×1 at the end—perhaps more than we would otherwise choose.\n",
    "- The model would not work on images of any size other than the size we originally trained on.\n",
    "\n",
-    "One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than 1\\*1. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always took. The most famous example is the 2013 ImageNet winner VGG, still sometimes used today. But there was another problem with this architecture: not only did it not work with images other than those of the same size used in the training set, but it required a lot of memory, because flattening out the convolutional layer resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
+    "One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than 1×1. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always took. The most famous example is the 2013 ImageNet winner VGG, still sometimes used today. But there was another problem with this architecture: not only did it not work with images other than those of the same size used in the training set, but it required a lot of memory, because flattening out the convolutional layer resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
    "\n",
    "This problem was solved through the creation of *fully convolutional networks*. The trick in fully convolutional networks is to take the average of activations across a convolutional grid. In other words, we can simply use this function:"
   ]
@ -170,9 +170,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height, and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing 1\\*1 dimension like we did in our previous model. \n",
+    "Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height, and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing 1×1 dimension like we did in our previous model. \n",
    "\n",
-    "This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size. For instance, max pooling layers of size 2, which were very popular in older CNNs, reduce the size of our image by half on each dimension by taking the maximum of each 2\\*2 window (with a stride of 2).\n",
+    "This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size. For instance, max pooling layers of size 2, which were very popular in older CNNs, reduce the size of our image by half on each dimension by taking the maximum of each 2×2 window (with a stride of 2).\n",
    "\n",
    "As before, we can define a `Learner` with our custom model and then train it on the data we grabbed earlier:"
   ]
@ -429,7 +429,7 @@
    "\n",
    "The issue is that with a stride of, say, 2 on one of the convolutions, the grid size of the output activations will be half the size on each axis of the input. So then we can't add that back to `x` in `forward` because `x` and the output activations have different dimensions. The same basic issue occurs if `ni!=nf`: the shapes of the input and output connections won't allow us to add them together.\n",
    "\n",
-    "To fix this, we need a way to change the shape of `x` to match the result of `self.convs`. Halving the grid size can be done using an average pooling layer with a stride of 2: that is, a layer that takes 2\\*2 patches from the input and replaces them with their average.\n",
+    "To fix this, we need a way to change the shape of `x` to match the result of `self.convs`. Halving the grid size can be done using an average pooling layer with a stride of 2: that is, a layer that takes 2×2 patches from the input and replaces them with their average.\n",
    "\n",
    "Changing the number of channels can be done by using a convolution. We want this skip connection to be as close to an identity map as possible, however, which means making this convolution as simple as possible. The simplest possible convolution is one where the kernel size is 1. That means that the kernel is size `ni*nf*1*1`, so it's only doing a dot product over the channels of each input pixel—it's not combining across pixels at all. This kind of *1x1 convolution* is very widely used in modern CNNs, so take a moment to think about how it works."
   ]
@ -786,9 +786,9 @@
   "source": [
    "The reason that we have a stem of plain convolutional layers, instead of ResNet blocks, is based on a very important insight about all deep convolutional neural networks: the vast majority of the computation occurs in the early layers. Therefore, we should keep the early layers as fast and simple as possible.\n",
    "\n",
-    "To see why so much computation occurs in the early layers, consider the very first convolution on a 128-pixel input image. If it is a stride-1 convolution, then it will apply the kernel to every one of the 128×128 pixels. That's a lot of work! In the later layers, however, the grid size could be as small as 4\\*4 or even 2\\*2, so there are far fewer kernel applications to do.\n",
+    "To see why so much computation occurs in the early layers, consider the very first convolution on a 128-pixel input image. If it is a stride-1 convolution, then it will apply the kernel to every one of the 128×128 pixels. That's a lot of work! In the later layers, however, the grid size could be as small as 4×4 or even 2×2, so there are far fewer kernel applications to do.\n",
    "\n",
-    "On the other hand, the first-layer convolution only has 3 input features and 32 output features. Since it is a 3\\*3 kernel, this is 3×32×3×3 = 864 parameters in the weights. But the last convolution will have 256 input features and 512 output features, resulting in 1,179,648 weights! So the first layers contain the vast majority of the computation, but the last layers contain the vast majority of the parameters.\n",
+    "On the other hand, the first-layer convolution only has 3 input features and 32 output features. Since it is a 3×3 kernel, this is 3×32×3×3 = 864 parameters in the weights. But the last convolution will have 256 input features and 512 output features, resulting in 1,179,648 weights! So the first layers contain the vast majority of the computation, but the last layers contain the vast majority of the parameters.\n",
    "\n",
    "A ResNet block takes more computation than a plain convolutional block, since (in the stride-2 case) a ResNet block has three convolutions and a pooling layer. That's why we want to have plain convolutions to start off our ResNet.\n",
    "\n",
@ -935,7 +935,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Instead of stacking two convolutions with a kernel size of 3, bottleneck layers use three different convolutions: two 1\\*1 (at the beginning and the end) and one 3\\*3, as shown on the right in <<resnet_compare>>."
+    "Instead of stacking two convolutions with a kernel size of 3, bottleneck layers use three different convolutions: two 1×1 (at the beginning and the end) and one 3×3, as shown on the right in <<resnet_compare>>."
   ]
  },
  {
@ -949,7 +949,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Why is that useful? 1\\*1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first ResNet block we saw. This then lets us use more filters: as we see in the illustration, the number of filters in and out is 4 times higher (256 instead of 64) diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
+    "Why is that useful? 1×1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first ResNet block we saw. This then lets us use more filters: as we see in the illustration, the number of filters in and out is 4 times higher (256 instead of 64) diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
    "\n",
    "Let's try replacing our `ResBlock` with this bottleneck design:"
   ]
@ -1222,7 +1222,7 @@
    "1. What is the basic equation for a ResNet block (ignoring batchnorm and ReLU layers)?\n",
    "1. What do ResNets have to do with residuals?\n",
    "1. How do we deal with the skip connection when there is a stride-2 convolution? How about when the number of filters changes?\n",
-    "1. How can we express a 1\\*1 convolution in terms of a vector dot product?\n",
+    "1. How can we express a 1×1 convolution in terms of a vector dot product?\n",
    "1. Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?\n",
    "1. What does the `noop` function return?\n",
    "1. Explain what is shown in <<resnet_surface>>.\n",
@ -1246,7 +1246,7 @@
   "metadata": {},
   "source": [
    "1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride-2 layers). How does it compare to a network without such a pooling layer?\n",
-    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1\\*1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
+    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1×1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
    "1. Write a \"top-5 accuracy\" function using plain PyTorch or plain Python.\n",
    "1. Train a model on Imagenette for more epochs, with and without label smoothing. Take a look at the Imagenette leaderboards and see how close you can get to the best results shown. Read the linked pages describing the leading approaches."
   ]
--- a/15_arch_details.ipynb
+++ b/15_arch_details.ipynb
@ -195,11 +195,11 @@
    "\n",
    "The way we do this is to start with the exact same approach to developing a CNN head as we saw in the previous problem. We start with a ResNet, for instance, and cut off the adaptive pooling layer and everything after that. Then we replace those layers with our custom head, which does the generative task.\n",
    "\n",
-    "There was a lot of handwaving in that last sentence! How on earth do we create a CNN head that generates an image? If we start with, say, a 224-pixel input image, then at the end of the ResNet body we will have a 7\\*7 grid of convolutional activations. How can we convert that into a 224-pixel segmentation mask?\n",
+    "There was a lot of handwaving in that last sentence! How on earth do we create a CNN head that generates an image? If we start with, say, a 224-pixel input image, then at the end of the ResNet body we will have a 7×7 grid of convolutional activations. How can we convert that into a 224-pixel segmentation mask?\n",
    "\n",
-    "Naturally, we do this with a neural network! So we need some kind of layer that can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7\\*7 grid with four pixels in a 2\\*2 square. Each of those four pixels will have the same value—this is known as *nearest neighbor interpolation*. PyTorch provides a layer that does this for us, so one option is to create a head that contains stride-1 convolutional layers (along with batchnorm and ReLU layers as usual) interspersed with 2\\*2 nearest neighbor interpolation layers. In fact, you can try this now! See if you can create a custom head designed like this, and try it on the CamVid segmentation task. You should find that you get some reasonable results, although they won't be as good as our <<chapter_intro>> results.\n",
+    "Naturally, we do this with a neural network! So we need some kind of layer that can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7×7 grid with four pixels in a 2×2 square. Each of those four pixels will have the same value—this is known as *nearest neighbor interpolation*. PyTorch provides a layer that does this for us, so one option is to create a head that contains stride-1 convolutional layers (along with batchnorm and ReLU layers as usual) interspersed with 2×2 nearest neighbor interpolation layers. In fact, you can try this now! See if you can create a custom head designed like this, and try it on the CamVid segmentation task. You should find that you get some reasonable results, although they won't be as good as our <<chapter_intro>> results.\n",
    "\n",
-    "Another approach is to replace the nearest neighbor and convolution combination with a *transposed convolution*, otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between all the pixels in the input. This is easiest to see with a picture—<<transp_conv>> shows a diagram from the excellent [convolutional arithmetic paper](https://arxiv.org/abs/1603.07285) we discussed in <<chapter_convolutions>>, showing a 3\\*3 transposed convolution applied to a 3\\*3 image."
+    "Another approach is to replace the nearest neighbor and convolution combination with a *transposed convolution*, otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between all the pixels in the input. This is easiest to see with a picture—<<transp_conv>> shows a diagram from the excellent [convolutional arithmetic paper](https://arxiv.org/abs/1603.07285) we discussed in <<chapter_convolutions>>, showing a 3×3 transposed convolution applied to a 3×3 image."
   ]
  },
  {
@ -215,7 +215,7 @@
   "source": [
    "As you see, the result of this is to increase the size of the input. You can try this out now by using fastai's `ConvLayer` class; pass the parameter `transpose=True` to create a transposed convolution, instead of a regular one, in your custom head.\n",
    "\n",
-    "Neither of these approaches, however, works really well. The problem is that our 7\\*7 grid simply doesn't have enough information to create a 224\\*224-pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use *skip connections*, like in a ResNet, but skipping from the activations in the body of the ResNet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This approach, illustrated in <<unet>>, was developed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in the 2015 paper [\"U-Net: Convolutional Networks for Biomedical Image Segmentation\"](https://arxiv.org/abs/1505.04597). Although the paper focused on medical applications, the U-Net has revolutionized all kinds of generative vision models."
+    "Neither of these approaches, however, works really well. The problem is that our 7×7 grid simply doesn't have enough information to create a 224×224-pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use *skip connections*, like in a ResNet, but skipping from the activations in the body of the ResNet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This approach, illustrated in <<unet>>, was developed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in the 2015 paper [\"U-Net: Convolutional Networks for Biomedical Image Segmentation\"](https://arxiv.org/abs/1505.04597). Although the paper focused on medical applications, the U-Net has revolutionized all kinds of generative vision models."
   ]
  },
  {
@ -229,7 +229,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2\\*2 max pooling instead of stride-2 convolutions, since this paper was written before ResNets came along) and the transposed convolutional (\"up-conv\") layers on the right. Then extra skip connections are shown as gray arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-Net!\"\n",
+    "This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2×2 max pooling instead of stride-2 convolutions, since this paper was written before ResNets came along) and the transposed convolutional (\"up-conv\") layers on the right. Then extra skip connections are shown as gray arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-Net!\"\n",
    "\n",
    "With this architecture, the input to the transposed convolutions is not just the lower-resolution grid in the preceding layer, but also the higher-resolution grid in the ResNet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class that autogenerates an architecture of the right size based on the data provided.\n",
    "\n",
--- a/17_foundations.ipynb
+++ b/17_foundations.ipynb
@ -30,16 +30,16 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This chapter begins a journey where we will go from the very basics and dig inside what was hidden in the models we used in the previous chapters. We will be covering many of the same things we've seen before, but this time around we'll be looking much more closely at the implementation details, and much less closely at the practical issues of how and why things are as they are.\n",
+    "This chapter begins a journey where we will dig deep into the internals of the models we used in the previous chapters. We will be covering many of the same things we've seen before, but this time around we'll be looking much more closely at the implementation details, and much less closely at the practical issues of how and why things are as they are.\n",
    "\n",
-    "We will build everything from scratch, only using basic indexing into a tensor. We write a neural net from the foundations, then we will implement our own backpropagation from scratch, so we'll know what is happening in PyTorch when we do `loss.backward()`. We'll also see how to extend PyTorch with custom *autograd* functions that allow you to specify your own forward and backward computations."
+    "We will build everything from scratch, only using basic indexing into a tensor. We]ll write a neural net from the ground up, then implement backpropagation manually, so we know exactly what's happening in PyTorch when we call `loss.backward`. We'll also see how to extend PyTorch with custom *autograd* functions that allow us to specify our own forward and backward computations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## A Neural Net Layer from Scratch"
+    "## Building a Neural Net Layer from Scratch"
   ]
  },
  {
@ -60,17 +60,17 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A neuron receives a given number of inputs and has an internal weight for each of them. It then sums those weighted inputs to produce an output and add an inner bias. In math, this can be written:\n",
+    "A neuron receives a given number of inputs and has an internal weight for each of them. It sums those weighted inputs to produce an output and adds an inner bias. In math, this can be written as:\n",
    "\n",
    "$$ out = \\sum_{i=1}^{n} x_{i} w_{i} + b$$\n",
    "\n",
-    "if we name our inputs $(x_{1},\\dots,x_{n})$, our weights $(w_{1},\\dots,w_{n})$ and our bias $b$. In code this translates into:\n",
+    "if we name our inputs $(x_{1},\\dots,x_{n})$, our weights $(w_{1},\\dots,w_{n})$, and our bias $b$. In code this translates into:\n",
    "\n",
    "```python\n",
    "output = sum([x*w for x,w in zip(inputs,weights)]) + bias\n",
    "```\n",
    "\n",
-    "This output is then fed into a non-linear function before being sent to another neuron called an *activation function*, and the most common function used in Deep Learning for this the *Rectified Linear Unit* or *ReLU*, which, as we've seen, is a fancy way of saying\n",
+    "This output is then fed into a nonlinear function called an *activation function* before being sent to another neuron. In deep learning the most common of these is the *rectified Linear unit*, or *ReLU*, which, as we've seen, is a fancy way of saying:\n",
    "```python\n",
    "def relu(x): return x if x >= 0 else 0\n",
    "```"
@ -80,25 +80,25 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A Deep Learning model is then built by stacking a lot of those neurons in successive layers. We create a first layer with a certain number of neurons (usually called *hidden size*) and link all the inputs to each of those neurons. Such a layer is often called *fully connected layer* or a *dense layer* (for densely connected) or a *linear layer*. \n",
+    "A deep learning model is then built by stacking a lot of those neurons in successive layers. We create a first layer with a certain number of neurons (known as *hidden size*) and link all the inputs to each of those neurons. Such a layer is often called a *fully connected layer* or a *dense layer* (for densely connected), or a *linear layer*. \n",
    "\n",
-    "If you have done a little bit of linear algebra, you may remember than when you have a lot of:\n",
+    "It requires to compute, for each `input` in our batch and each neuron with a give `weight`, the dot product:\n",
    "\n",
    "```python\n",
    "sum([x*w for x,w in zip(input,weight)])\n",
    "```\n",
    "\n",
-    "...for each `input` in our batch and the `weight` of each neuron, it's the equivalent of one *matrix multiplication*. More precisely, if our inputs are in a matrix `x` which is `batch_size` by `n_inputs`, and if we have grouped the weights of our neurons in a matrix `w` which is `n_neurons` by `n_inputs` (each neuron must have the same number of weights as they have inputs) and all the biases in a vector `b` of size `n_neurons`, then the output of this fully connected layer is\n",
+    "If you have done a little bit of linear algebra, you may remember that having a lot of those dot products happens when you do a *matrix multiplication*. More precisely, if our inputs are in a matrix `x` with a size of `batch_size` by `n_inputs`, and if we have grouped the weights of our neurons in a matrix `w` of size `n_neurons` by `n_inputs` (each neuron must have the same number of weights as it has inputs) and all the biases in a vector `b` of size `n_neurons`, then the output of this fully connected layer is:\n",
    "\n",
    "```python\n",
    "y = x @ w.t() + b\n",
    "```\n",
    "\n",
-    "where `@` represents the matrix product and `w.t()` is the transpose matrix of `w`. The output `y` is then of size `batch_size` by `n_neurons` and in position `(i,j)`, we have (for the mathy folks out there):\n",
+    "where `@` represents the matrix product and `w.t()` is the transpose matrix of `w`. The output `y` is then of size `batch_size` by `n_neurons`, and in position `(i,j)` we have (for the mathy folks out there):\n",
    "\n",
    "$$y_{i,j} = \\sum_{k=1}^{n} x_{i,k} w_{k,j} + b_{j}$$\n",
    "\n",
-    "or in code:\n",
+    "Or in code:\n",
    "\n",
    "```python\n",
    "y[i,j] = sum([a * b for a,b in zip(x[i,:],w[j,:])]) + b[j]\n",
@ -124,7 +124,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's write a function that computes the matrix product of two tensors, before we allow ourselves to use the PyTorch version of it. We will only use the indexing in PyTorch tensors."
+    "Let's write a function that computes the matrix product of two tensors, before we allow ourselves to use the PyTorch version of it. We will only use the indexing in PyTorch tensors:"
   ]
  },
  {
@ -141,7 +141,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We'll need three nested for loops: one for the row indices, one for the column indices and one for the inner sum. `ac`, `ar` stand for number of columns of `a`, number of rows of `a` respectively (same convention for `b`) and we make sure the matrix product is possible by checking that `a` has as many columns as `b` has rows."
+    "We'll need three nested `for` loops: one for the row indices, one for the column indices, and one for the inner sum. `ac` and `ar` stand for number of columns of `a` and number of rows of `a`, respectively (the same convention is followed for `b`), and we make sure calculating the matrix product is possible by checking that `a` has as many columns as `b` has rows:"
   ]
  },
  {
@ -165,7 +165,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To test this out, we'll pretend (using random matrices) that we're working with a small batch of 5 MNIST images, flattened into `28*28` vectors, and a linear model to turn them into 10 activations:"
+    "To test this out, we'll pretend (using random matrices) that we're working with a small batch of 5 MNIST images, flattened into 28×28 vectors, with linear model to turn them into 10 activations:"
   ]
  },
  {
@ -182,7 +182,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's time our function, using the Jupyter \"magic\" `%time`:"
+    "Let's time our function, using the Jupyter \"magic\" command `%time`:"
   ]
  },
  {
@ -207,7 +207,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...and how does that compare to PyTorch's builtin?"
+    "And see how that compares to PyTorch's built-in `@`:"
   ]
  },
  {
@ -231,11 +231,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we can see, in Python three nested loops is a very bad idea! Python is a slow language, and this isn't going to be very efficient. We see here that PyTorch is around 100,000 times faster than Python--and that's before we even start using the GPU!\n",
+    "As we can see, in Python three nested loops is a very bad idea! Python is a slow language, and this isn't going to be very efficient. We see here that PyTorch is around 100,000 times faster than Python—and that's before we even start using the GPU!\n",
    "\n",
-    "Where does this difference come from? That's because PyTorch didn't write its matrix multiplication in Python but in C++ to make it fast. In general, whenever we do some computations on tensors, we will need to *vectorize* them so that we can take advantage of the speed of PyTorch, usually by using two techniques: elementwise arithmetic and broadcasting. \n",
-    "\n",
-    "We will show how to do this on our example of matrix multiplication."
+    "Where does this difference come from? PyTorch didn't write its matrix multiplication in Python, but rather in C++ to make it fast. In general, whenever we do computations on tensors we will need to *vectorize* them so that we can take advantage of the speed of PyTorch, usually by using two techniques: elementwise arithmetic and broadcasting."
   ]
  },
  {
@ -249,7 +247,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "All the basic operators (+,-,\\*,/,>,<,==) can be applied element-wise. That means if we write `a+b` for two tensors `a` and `b` that have the same shape, we will get a tensor with the sums of one element of `a` with one element of `b."
+    "All the basic operators (`+`, `-`, `*`, `/`, `>`, `<`, `==`) can be applied elementwise. That means if we write `a+b` for two tensors `a` and `b` that have the same shape, we will get a tensor composed of the sums the elements of `a` and `b`:"
   ]
  },
  {
@ -278,7 +276,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The booleans operators will return an array of booleans:"
+    "The Booleans operators will return an array of Booleans:"
   ]
  },
  {
@ -305,7 +303,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If we want to know if every element of `a` is less than the corresponding element in `b`, or if two tensors are equals, we need to combine those elementwise operations with `torch.all`."
+    "If we want to know if every element of `a` is less than the corresponding element in `b`, or if two tensors are equal, we need to combine those elementwise operations with `torch.all`:"
   ]
  },
  {
@ -332,7 +330,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Note that reduction operations (that returns only one element) like `all()`, `sum()` or `mean()` return tensors with only one element calles rank-0 tensors. If you want to convert it to a plain Python boolean or number, you need to call `.item()`."
+    "Reduction operations like `all()`, `sum()` and `mean()` return tensors with only one element, called rank-0 tensors. If you want to convert this to a plain Python Boolean or number, you need to call `.item()`:"
   ]
  },
  {
@ -359,7 +357,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The elementwise operations work on tensors of any ranks, as long as they have the same shape."
+    "The elementwise operations work on tensors of any rank, as long as they have the same shape:"
   ]
  },
  {
@ -389,7 +387,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "However you can't have element-wise operations of tensors that don't have the same shape (unless they are broadcastable, see below)."
+    "However you can't perform elementwise operations on tensors that don't have the same shape (unless they are broadcastable, as discussed in the next section):"
   ]
  },
  {
@ -418,11 +416,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With element-wise arithmetic, we can remove one of our three nested loops: we can multiply the tensors that correspond to the `i`-th row of `a` and the `j`-th column of `b` before summing all the elements, which will speed up things because the inner loop will now be executed by PyTorch at C speed. \n",
+    "With elementwise arithmetic, we can remove one of our three nested loops: we can multiply the tensors that correspond to the `i`-th row of `a` and the `j`-th column of `b` before summing all the elements, which will speed things up because the inner loop will now be executed by PyTorch at C speed. \n",
    "\n",
-    "To access one row/column, we can simply write `a[i,:]` or `b[:,j]`. The column means take everything in that dimension. We could restrict and only take a slice on this particular dimension by passing a range like `1:5` instead of just `:`. In that case, we would take the elements in column 1 to 4 (the last part is always excluded). \n",
+    "To access one column or row, we can simply write `a[i,:]` or `b[:,j]`. The `:` means take everything in that dimension. We could restrict this and take only a slice of that particular dimension by passing a range, like `1:5`, instead of just `:`. In that case, we would take the elements in columns or rows 1 to 4 (the second number is noninclusive). \n",
    "\n",
-    "One simplification is that we can always omit trailing columns, so `a[i,:]` can be abbreviated to `a[i]`. With all of that, we can write a new version of our matrix multiplication:"
+    "One simplification is that we can always omit a trailing colon, so `a[i,:]` can be abbreviated to `a[i]`. With all of that in mind, we can write a new version of our matrix multiplication:"
   ]
  },
  {
@ -462,7 +460,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We are already ~700 times faster, just by removing that inner for loop! And that is just the beginning. By combining this with broadcasting, we can remove another loop and get an even more important speed-up."
+    "We're already ~700 times faster, just by removing that inner `for` loop! And that's just the beginning—with broadcasting we can remove another loop and get an even more important speed up."
   ]
  },
  {
@ -476,9 +474,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we discussed in <<chapter_mnist_basics>>, broadcasting is a term introduced by the [numpy library](https://docs.scipy.org/doc/) that describes how tensor of different ranks are treated during arithmetic operations. For instance, it's obvious there is no way to add a 3 by 3 matrix with a 4 by 5 matrix, but what if we want to add one scalar (which can be represented as a 1 by 1 tensor) with a matrix? Or a vector of size 3 with a 3 by 4 matrix? In both cases, we can find a way to make sense of what the operation could be.\n",
+    "As we discussed in <<chapter_mnist_basics>>, broadcasting is a term introduced by the [NumPy library](https://docs.scipy.org/doc/) that describes how tensors of different ranks are treated during arithmetic operations. For instance, it's obvious there is no way to add a 3×3 matrix with a 4×5 matrix, but what if we want to add one scalar (which can be represented as a 1×1 tensor) with a matrix? Or a vector of size 3 with a 3×4 matrix? In both cases, we can find a way to make sense of this operation.\n",
    "\n",
-    "Broadcasting gives specific rules to codify when shapes are compatible when trying to do an element-wise operation, and how the tensor of the smaller shape is expanded to match the tensor of the bigger shape. It's essential to master those rules if you want to be able to write code that executes quickly. In this section, we'll expand our previous treatment of broadcasting to understand these rules."
+    "Broadcasting gives specific rules to codify when shapes are compatible when trying to do an elementwise operation, and how the tensor of the smaller shape is expanded to match the tensor of the bigger shape. It's essential to master those rules if you want to be able to write code that executes quickly. In this section, we'll expand our previous treatment of broadcasting to understand these rules."
   ]
  },
  {
@ -492,7 +490,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Broadcasting with a scalar is the easiest broadcating: when we have a tensor `a` and a scalar, we just imagine a tensor of the same shape as `a` filled with that scalar and perform the operation."
+    "Broadcasting with a scalar is the easiest type of broadcating. When we have a tensor `a` and a scalar, we just imagine a tensor of the same shape as `a` filled with that scalar and perform the operation:"
   ]
  },
  {
@ -520,7 +518,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "How are we able to do this comparison?  0 is being *broadcast* to have the same dimensions as `a`. Note that this is done without creating a tensor full of zeros in memory (that would be very inefficient). \n",
+    "How are we able to do this comparison? `0` is being *broadcast* to have the same dimensions as `a`. Note that this is done without creating a tensor full of zeros in memory (that would be very inefficient). \n",
    "\n",
    "This is very useful if you want to normalize your dataset by subtracting the mean (a scalar) from the entire data set (a matrix) and dividing by the standard deviation (another scalar):"
   ]
@ -552,7 +550,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now you could have different means for each row of the matrix, in which case you would need to broadcast a vector to a matrix."
+    "What if have different means for each row of the matrix? in that case you will need to broadcast a vector to a matrix."
   ]
  },
  {
@ -566,7 +564,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can also broadcast a vector to a matrix:"
+    "We can broadcast a vector to a matrix as follows:"
   ]
  },
  {
@ -617,7 +615,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here the elements of `c` are expanded to make three rows that match, and this way the operation is possible. Again, behind the scenes PyTorch doesn't create three copies of `c` in memory. This is done by the `expand_as` method behind the scenes:"
+    "Here the elements of `c` are expanded to make three rows that match, making the operation possible. Again, PyTorch doesn't actually create three copies of `c` in memory. This is done by the `expand_as` method behind the scenes:"
   ]
  },
  {
@ -677,7 +675,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Even if it has officially 9 elements, the memory used is only 3 scalars. It's possible with a clever trick by giving a *stride* of 0 on that dimension (which means that when it looks for the next row by adding the stride, it doesn't move)."
+    "Even though the tensor officially has nine elements, only three scalars are stored in memory. This is possible thanks to the clever trick of giving that dimension a *stride* of 0 (which means that when PyTorch looks for the next row by adding the stride, it doesn't move):"
   ]
  },
  {
@ -704,7 +702,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Since `m` is of size 3 by 3, there were two ways to do broadcasting. The fact it was done on the last dimension is a convention that comes from the rules of broadcasting and has nothing to do with the way we ordered our tensors:"
+    "Since `m` is of size 3×3, there are two ways to do broadcasting. The fact it was done on the last dimension is a convention that comes from the rules of broadcasting and has nothing to do with the way we ordered our tensors. If instead we do this, we get the same result:"
   ]
  },
  {
@ -733,7 +731,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We get the same result. In fact it's only possible to broadcast a vector of size `n` with a matrix of size `m` by `n`:"
+    "In fact, it's only possible to broadcast a vector of size `n` with a matrix of size `m` by `n`:"
   ]
  },
  {
@ -793,7 +791,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If we want to broadcast in the other dimension, we have to change the shape of our vector to make it a 3 by 1 matrix. This is done with the `unsqueeze` method in PyTorch."
+    "If we want to broadcast in the other dimension, we have to change the shape of our vector to make it a 3×1 matrix. This is done with the `unsqueeze` method in PyTorch:"
   ]
  },
  {
@ -823,7 +821,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And this time, `c` is expanded on the columns side."
+    "This time, `c` is expanded on the column side:"
   ]
  },
  {
@ -852,7 +850,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Like before the corresponding storage contains only three scalars."
+    "Like before, only three scalars are stored in memory:"
   ]
  },
  {
@ -883,7 +881,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And the expanded tensor has the right shape by giving it a stride of 0 on the column dimension."
+    "And the expanded tensor has the right shape because the column dimension has a stride of 0:"
   ]
  },
  {
@ -910,7 +908,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The way broadcasting works is that if we need to add dimensions, the default is to add them at the beginning. When we were broadcasting before, it was doing `c.unsqueeze(0)` behind the scenes."
+    "With broadcasting, by default if we need to add dimensions, they are added at the beginning. When we were broadcasting before, Pytorch was doing `c.unsqueeze(0)` behind the scenes:"
   ]
  },
  {
@ -938,7 +936,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The `unsqueeze` command can be replaced by `None` indexing."
+    "The `unsqueeze` command can be replaced by `None` indexing:"
   ]
  },
  {
@ -965,7 +963,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You can always omit traiiling columns, and `...` means all preceding dimensions:"
+    "You can always omit trailing colons, and `...` means all preceding dimensions:"
   ]
  },
  {
@ -992,7 +990,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With this, we can remove another for loop in our matrix multiplication function: instead of multiplying `a[i]` with `b[:,j]`, we can multiply `a[i]` with the whole matrix `b` using broadcasting, then sum all the results."
+    "With this, we can remove another `for` loop in our matrix multiplication function. Now, instead of multiplying `a[i]` with `b[:,j]`, we can multiply `a[i]` with the whole matrix `b` using broadcasting, then sum the results:"
   ]
  },
  {
@ -1033,26 +1031,26 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We're now 3,700 times faster than our first implementation! Now let's discuss the exact rules of broadcasting."
+    "We're now 3,700 times faster than our first implementation! Before we move on, let's discuss the rules of broadcasting in a little more detail."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "#### Broadcasting Rules"
+    "#### Broadcasting rules"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When operating on two tensors, PyTorch compares their shapes element-wise. It starts with the *trailing dimensions*, and works its way backward, adding 1 when it meets empty dimensions. Two dimensions are *compatible* when\n",
+    "When operating on two tensors, PyTorch compares their shapes elementwise. It starts with the *trailing dimensions* and works its way backward, adding 1 when it meets empty dimensions. Two dimensions are *compatible* when one of the following is true:\n",
    "\n",
-    "- they are equal, or\n",
-    "- one of them is 1, in which case that dimension is broadcasted to make it the same size\n",
+    "- They are equal.\n",
+    "- One of them is 1, in which case that dimension is broadcast to make it the same as the other.\n",
    "\n",
-    "Arrays do not need to have the same number of dimensions. For example, if you have a `256*256*3` array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:\n",
+    "Arrays do not need to have the same number of dimensions. For example, if you have a 256×256×3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with three values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:\n",
    "\n",
    "```\n",
    "Image  (3d tensor): 256 x 256 x 3\n",
@ -1060,7 +1058,7 @@
    "Result (3d tensor): 256 x 256 x 3\n",
    "```\n",
    "    \n",
-    "However, a 2d tensor of size 256 x 256 isn't compatible with our image.\n",
+    "However, a 2D tensor of size 256×256 isn't compatible with our image:\n",
    "\n",
    "```\n",
    "Image  (3d tensor): 256 x 256 x   3\n",
@ -1068,7 +1066,7 @@
    "Error\n",
    "```\n",
    "\n",
-    "In the first examples we had with a `3x3` matrix and vector of size `3`, broadcast is done on the rows:\n",
+    "In our earlier examples we had with a 3×3 matrix and a vector of size 3, broadcasting was done on the rows:\n",
    "\n",
    "```\n",
    "Matrix (2d tensor):   3 x 3\n",
@ -1076,14 +1074,14 @@
    "Result (2d tensor):   3 x 3\n",
    "```\n",
    "\n",
-    "As a little exercise around those rules, try to determine what dimensions to add (and where) when you need to normalize a batch of images of size `64 x 3 x 256 x 256` with vectors of three elements (one for the mean and one for the standard deviation)."
+    "As an exercise, try to determine what dimensions to add (and where) when you need to normalize a batch of images of size `64 x 3 x 256 x 256` with vectors of three elements (one for the mean and one for the standard deviation)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Another useful thing for tensor manipulations is the use of Einstein summations."
+    "Another useful wat of simplifying tensor manipulations is the use of Einstein summations convention."
   ]
  },
  {
@ -1097,21 +1095,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Before using the PyTorch operation @ or `torch.matmul`, there is a last way we can implement this matrix multiplication: einstein summation (einsum). This is a compact representation for combining products and sums in a general way. We write an equation like this\n",
+    "Before using the PyTorch operation `@` or `torch.matmul`, there is one last way we can implement matrix multiplication: Einstein summation (`einsum`). This is a compact representation for combining products and sums in a general way. We write an equation like this:\n",
    "\n",
    "```\n",
    "ik,kj -> ij\n",
    "```\n",
    "\n",
-    "The left hand side represents the operands dimensions, separated by commas. Here we have two tensors taht each have two dimensions (i,k and k,j).  The right hand side represents the result dimensions, so here we have a tensor with two dimensions i,j. \n",
+    "The lefthand side represents the operands dimensions, separated by commas. Here we have two tensors that each have two dimensions (`i,k` and `k,j`).  The righthand side represents the result dimensions, so here we have a tensor with two dimensions `i,j`. \n",
    "\n",
-    "There are essentially three rules of Einstein summation notation, namely:\n",
+    "The rules of Einstein summation notation are as follows:\n",
    "\n",
    "1. Repeated indices are implicitly summed over.\n",
    "1. Each index can appear at most twice in any term.\n",
-    "1. Each term must contain identical non-repeated indices.\n",
+    "1. Each term must contain identical nonrepeated indices.\n",
    "\n",
-    "So in the example above, since `k` is repeated, we sum over that index. In the end the above formula represents the matrix obtained when we put in (i,j) the sum of all the coefficients (i,k) in the first tensor multiplied by the coefficients (k,j) in the second tensor... which is the matrix product!"
+    "So in our example, since `k` is repeated, we sum over that index. In the end the formula represents the matrix obtained when we put in `(i,j)` the sum of all the coefficients `(i,k)` in the first tensor multiplied by the coefficients `(k,j)` in the second tensor... which is the matrix product! Here is how we can code this in PyTorch:"
   ]
  },
  {
@ -1127,23 +1125,25 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have only one member in the left hand side. For instance\n",
+    "Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have just one member on the lefthand side. For instance, this:\n",
    "\n",
    "```python\n",
    "torch.einsum('ij->ji', a)\n",
    "```\n",
    "\n",
-    "returns the transpose of the matrix `a`. You can also have three or more members:\n",
+    "returns the transpose of the matrix `a`. You can also have three or more members. This:\n",
    "\n",
    "```python\n",
    "torch.einsum('bi,ij,bj->b', a, b, c)\n",
    "```\n",
    "\n",
-    "will return a vector of size `b` where the `k`-th coordinate is the sum of the `a[k,i] b[i,j] c[k,j]`. This notation is getting really convenient when you have more dimensions because of batches, for instance if you have two batches of matrices and want compute the matrix product per batch, you would go: \n",
+    "will return a vector of size `b` where the `k`-th coordinate is the sum of `a[k,i] b[i,j] c[k,j]`. This notation is particularly convenient when you have more dimensions because of batches. For example, if you have two batches of matrices and want compute the matrix product per batch, you would could this: \n",
    "\n",
    "```python\n",
    "torch.einsum('bik,bkj->bij', a, b)\n",
-    "```"
+    "```\n",
+    "\n",
+    "Let's go back to our new `matmul` implementation using `einsum` and look at its speed:"
   ]
  },
  {
@ -1167,14 +1167,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we see, not only is it practical, but it's *very* fast. `einsum` is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA. (But it's generally not as fast as carefully optimized CUDA code, as you see in the matrix multiplication example)."
+    "As you can see, not only is it practical, but it's *very* fast. `einsum` is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA. (But it's generally not as fast as carefully optimized CUDA code, as you see from the results in \"Matrix Multiplication from Scratch\".)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we know how to implement a matrix multiplication from scratch, we are ready to build our neural net, specifically its forward and backward passes, just using matrix multiplications."
+    "Now that we know how to implement a matrix multiplication from scratch, we are ready to build our neural net—specifically its forward and backward passes—using just matrix multiplications."
   ]
  },
  {
@ -1188,7 +1188,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we saw in <<chapter_mnist_basics>>, to train it, we will need to compute all the gradients of a given a loss with respect to its parameters, which is known as the *backward pass*. The *forward pass* is computing the output of the model on a given input, which is just based on the matrix products we saw. As we define our first neural net, we will also delve in the problem of properly initializing the weights, which is crucial to make training start properly."
+    "As we saw in <<chapter_mnist_basics>>, to train a model, we will need to compute all the gradients of a given a loss with respect to its parameters, which is known as the *backward pass*. The *forward pass* is where we compute the output of the model on a given input, based on the matrix products. As we define our first neural net, we will also delve into the problem of properly initializing the weights, which is crucial for making training start properly."
   ]
  },
  {
@ -1202,7 +1202,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We will take the example of a two-layer neural net first. As we saw, one layer can be expressed as `y = x @ w + b` with `x` out inputs, `y` our outputs, `w` the weights of the layer (which is of size number of inputs by neuron of neurons if we don't transpose like before) and `b` is the bias vector. "
+    "We will take the example of a two-layer neural net first. As we've seen, one layer can be expressed as `y = x @ w + b`, with `x` our inputs, `y` our outputs, `w` the weights of the layer (which is of size number of inputs by number of neurons if we don't transpose like before), and `b` is the bias vector:"
   ]
  },
  {
@ -1218,9 +1218,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can stack two layers on top of the other, but since mathematically, the composition of two linear operations is another linear operation, this only makes sense if we put something non-linear in the middle called an activation function. The activation function most popularly used is a ReLU, which, as we saw, is just the maximum of `x` and `0`. \n",
+    "We can stack the second layer on top of the first, but since mathematically the composition of two linear operations is another linear operation, this only makes sense if we put something nonlinear in the middle, called an activation function. As mentioned at the beginning of the chapter, in deep learning applications the activation function most commonly used is a ReLU, which returns the maximum of `x` and `0`. \n",
    "\n",
-    "We won't actually train our model in this chapter so we use random tensors for our inputs and targets. Let's say our inputs are 200 vectors of size 100, which we group in one batch, and our targets are 200 random floats."
+    "We won't actually train our model in this chapter, so we'll use random tensors for our inputs and targets. Let's say our inputs are 200 vectors of size 100, which we group into one batch, and our targets are 200 random floats:"
   ]
  },
  {
@ -1237,7 +1237,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For our two-layers model we will need two weight matrices and two bias vectors. Let's say we have a hidden size of 50 and the output size is 1 (for one of our input, the corresponding output is one float in this toy example). We initialize the weights randomly and the bias at zero. "
+    "For our two-layer model we will need two weight matrices and two bias vectors. Let's say we have a hidden size of 50 and the output size is 1 (for one of our inputs, the corresponding output is one float in this toy example). We initialize the weights randomly and the bias at zero:"
   ]
  },
  {
@ -1284,9 +1284,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Note that this formula works with our batch of inputs, and returns a batch of hidden state: `l1` is a matrix of 200 (our batch size) by 50 (our hidden size).\n",
+    "Note that this formula works with our batch of inputs, and returns a batch of hidden state: `l1` is a matrix of size 200 (our batch size) by 50 (our hidden size).\n",
    "\n",
-    "There is a problem with the way our model was initialized however. To understand it, we need to look at the mean and standard deviation (std) of `l1`."
+    "There is a problem with the way our model was initialized, however. To understand it, we need to look at the mean and standard deviation (std) of `l1`:"
   ]
  },
  {
@ -1313,9 +1313,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The mean is close to zero, which is understandable since both our input and weight matrix have a mean close to zero. However the standard deviation, which represents how far away our activation go from the mean, went from 1 to 10. This is a really big problem because that's with just one layer. Modern neural nets can have hundred of layers, so if each of them multiply the scale of our activations by 10, by the end of the last layer we won't have numbers representable by a computer.\n",
+    "The mean is close to zero, which is understandable since both our input and weight matrices have means close to zero. But the standard deviation, which represents how far away our activations go from the mean, went from 1 to 10. This is a really big problem because that's with just one layer. Modern neural nets can have hundred of layers, so if each of them multiplies the scale of our activations by 10, by the end of the last layer we won't have numbers representable by a computer.\n",
    "\n",
-    "Indeed, if we make just 50 multiplications between x and random matrices of size 100 x 100:"
+    "Indeed, if we make just 50 multiplications between `x` and random matrices of size 100×100, we'll have:"
   ]
  },
  {
@ -1348,7 +1348,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The result is nans everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights. But if we use too small weights we will have the opposite problem: the scale of our activations will get from 1 to 0.1 and after 100 layers, we'll be left with zeros everywhere."
+    "The result is `nan`s everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights? But if we use too small weights, we will have the opposite problem—the scale of our activations will go from 1 to 0.1, and after 100 layers we'll be left with zeros everywhere:"
   ]
  },
  {
@ -1381,7 +1381,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So we have to scale our weights matrices exactly right so that the standard deviation of our activations stays at 1. We can compute the exact value mathematically, and this has been done by Xavier Glorot and Yoshua Bengio in [Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). The right scale for a given layer is $1/\\sqrt{n_{in}}$, where $n_{in}$ represents the number of inputs.\n",
+    "So we have to scale our weight matrices exactly right so that the standard deviation of our activations stays at 1. We can compute the exact value to use mathematically, as illustrated by Xavier Glorot and Yoshua Bengio in [\"Understanding the Difficulty of Training Deep Feedforward Neural Networks\"](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). The right scale for a given layer is $1/\\sqrt{n_{in}}$, where $n_{in}$ represents the number of inputs.\n",
    "\n",
    "In our case, if we have 100 inputs, we should scale our weight matrices by 0.1:"
   ]
@ -1416,7 +1416,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Finally some numbers that are neither zeros nor infinity! Notice how stable the scale of our activations is, even after those 50 fake layers:"
+    "Finally some numbers that are neither zeros nor `nan`s! Notice how stable the scale of our activations is, even after those 50 fake layers:"
   ]
  },
  {
@ -1443,7 +1443,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You can play a little bit with the values of the scale and notice that even a slight variation from 0.1 will get you either to very small or very large numbers, so initializing the weights properly is extremely important. Let's go back to our neural net. Since we messed a bit with our inputs we need to redefine them:"
+    "If you play a little bit with the value for scale you'll notice that even a slight variation from 0.1 will get you either to very small or very large numbers, so initializing the weights properly is extremely important. \n",
+    "\n",
+    "Let's go back to our neural net. Since we messed a bit with our inputs, we need to redefine them:"
   ]
  },
  {
@ -1460,7 +1462,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "and for our weights, we use the right scale, which is known as *Xavier initialization* (or *Glorot initialization*)."
+    "And for our weights, we'll use the right scale, which is known as *Xavier initialization* (or *Glorot initialization*):"
   ]
  },
  {
@ -1480,7 +1482,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now if compute the result of the first layer, we can check the mean and standard deviation are under control:"
+    "Now if we compute the result of the first layer, we can check that the mean and standard deviation are under control:"
   ]
  },
  {
@ -1508,7 +1510,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Very good, now we need to go through a relu, so let's define one. A relu removes the negatives and replace them by 0, which is another way of saying it clamps our tensor at 0."
+    "Very good. Now we need to go through a ReLU, so let's define one. A ReLU removes the negatives and replaces them with zeros, which is another way of saying it clamps our tensor at zero:"
   ]
  },
  {
@ -1524,7 +1526,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now let's make our activations go through a relu"
+    "We pass our activations through this:"
   ]
  },
  {
@ -1552,7 +1554,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And we're back to square one: the mean of our activation has gone to 0.4 (which is understandable since we removed the negatives) and the std went down to 0.5-0.6. So like before, after a few layers we will probably get to 0:"
+    "And we're back to square one: the mean of our activations has gone to 0.4 (which is understandable since we removed the negatives) and the std went down to 0.58. So like before, after a few layers we will probably wind up with zeros:"
   ]
  },
  {
@ -1585,7 +1587,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So our initialization wasn't right. This is because at the time the previous article was written, the popular activation in a neural net was the hyperbolic tangent (which is the one they use) and that initialization doesn't account for our ReLU. Fortunately someone else has done the math for us and computed the right scale we should use. Kaiming He et al. in [Delving Deep into Rectifiers: Surpassing Human-Level Performance](https://arxiv.org/abs/1502.01852) (which we've seen before--it's the article that introduced the ResNet) show we should use the following scale instead: $\\sqrt{2 / n_{in}}$ where $n_{in}$ is the number of inputs of our model."
+    "This means our initialization wasn't right. Why? At the time Glorot and Bengio wrote their article, the popular activation in a neural net was the hyperbolic tangent (tanh, which is the one they used), and that initialization doesn't account for our ReLU. Fortunately, someone else has done the math for us and computed the right scale for us to use. In [\"Delving Deep into Rectifiers: Surpassing Human-Level Performance\"](https://arxiv.org/abs/1502.01852) (which we've seen before—it's the article that introduced the ResNet), Kaiming He et al. show that we should use the following scale instead: $\\sqrt{2 / n_{in}}$, where $n_{in}$ is the number of inputs of our model. Let's see what this gives us:"
   ]
  },
  {
@ -1618,7 +1620,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And indeed if we use it we can check our numbers aren't all zeroed this time. So let's go back to the definition of our neural net and use this initialization (which is named *Kaiming initialization* or *He initialization*)."
+    "That's better: our numbers aren't all zeroed this time. So let's go back to the definition of our neural net and use this initialization (which is named *Kaiming initialization* or *He initialization*):"
   ]
  },
  {
@ -1647,7 +1649,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now after going through the first linear layer and relu, let's look at the scale of our activations:"
+    "Let's look at the scale of our activations after going through the first linear layer and ReLU:"
   ]
  },
  {
@ -1676,7 +1678,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that our weights are properly initialized, we can define our whole model:"
+    "Much better! Now that our weights are properly initialized, we can define our whole model:"
   ]
  },
  {
@ -1696,9 +1698,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is the forward pass, now all there is left to do is to compare our output to the labels we have (random numbers, in this example) with a loss function. In this case, we will use the mean squared error. (It's a toy problem in any case and this is the easiest loss function to use for what is next, computing the gradients.)\n",
+    "This is the forward pass. Now all that's left to do is to compare our output to the labels we have (random numbers, in this example) with a loss function. In this case, we will use the mean squared error. (It's a toy problem, and this is the easiest loss function to use for what is next, computing the gradients.)\n",
    "\n",
-    "The only subtlety is that our output and target don't have exactly the same shape: after going though the model, we get an output like this."
+    "The only subtlety is that our outputs and targets don't have exactly the same shape—after going though the model, we get an output like this:"
   ]
  },
  {
@ -1726,7 +1728,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To get rid of this trailing 1 dimension, we use the `squeeze` function."
+    "To get rid of this trailing 1 dimension, we use the `squeeze` function:"
   ]
  },
  {
@ -1742,7 +1744,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And now we are ready to compute our loss."
+    "And now we are ready to compute our loss:"
   ]
  },
  {
@ -1758,23 +1760,23 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "That is all for the forward pass, let now look at the gradients."
+    "That's all for the forward pass—let's now look at the gradients."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Gradients and Backward Pass"
+    "### Gradients and the Backward Pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We've seen that PyTorch computes all the gradient we need with a magic call to `loss.backward()` but how is it done behind the scenes?\n",
+    "We've seen that PyTorch computes all the gradient we need with a magic call to `loss.backward`, but let's explore what's happening behind the scenes.\n",
    "\n",
-    "Now comes the part where we need to compute the gradients of the loss with respect to all the weights of our model, so all the floats in `w1`, `b1`, `w2` and `b2`. For this, we will need a bit of math, specifically the chain rule. If you don't remember it from high school, this is the rule of calculus that guides how we can compute the derivative of a composed function:\n",
+    "Now comes the part where we need to compute the gradients of the loss with respect to all the weights of our model, so all the floats in `w1`, `b1`, `w2`, and `b2`. For this, we will need a bit of math—specifically the *chain rule*. This is the rule of calculus that guides how we can compute the derivative of a composed function:\n",
    "\n",
    "$$(g \\circ f)'(x) = g'(f(x)) f'(x)$$"
   ]
@ -1790,22 +1792,22 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Our loss is a big composition of different functions: mean-squared error (which is in turn the composition of a mean and a power of two), the second linear layer, a relu and the first linear layer. For instance, we want the gradients of the loss with respect to `b2` and our loss is defined by\n",
+    "Our loss is a big composition of different functions: mean squared error (which is in turn the composition of a mean and a power of two), the second linear layer, a ReLU and the first linear layer. For instance, if we want the gradients of the loss with respect to `b2` and our loss is defined by:\n",
    "\n",
    "```\n",
    "loss = mse(out,y) = mse(lin(l2, w2, b2), y)\n",
    "```\n",
    "\n",
-    "The chain rule tells us that we have\n",
+    "The chain rule tells us that we have:\n",
    "$$\\frac{\\text{d} loss}{\\text{d} b_{2}} = \\frac{\\text{d} loss}{\\text{d} out} \\times \\frac{\\text{d} out}{\\text{d} b_{2}} = \\frac{\\text{d}}{\\text{d} out} mse(out, y) \\times \\frac{\\text{d}}{\\text{d} b_{2}} lin(l_{2}, w_{2}, b_{2})$$\n",
    "\n",
    "To compute the gradients of the loss with respect to $b_{2}$, we first need the gradients of the loss with respect to our output $out$. It's the same if we want the gradients of the loss with respect to $w_{2}$. Then, to get the gradients of the loss with respect to $b_{1}$ or $w_{1}$, we will need the gradients of the loss with respect to $l_{1}$, which in turn requires the gradients of the loss with respect to $l_{2}$, which will need the gradients of the loss with respect to $out$.\n",
    "\n",
-    "So to compute all the gradients we need for the update, we need to begin from the output of the model and work our way *backward*, one layer after the other, which is why this step is known as *backpropagation*. We can automate it by having each function we implemented (`relu`, `mse`, `lin`) provide its backward step, that is how to derive the gradients of the loss with respect to the input(s) from the gradient of the loss with respect to the output.\n",
+    "So to compute all the gradients we need for the update, we need to begin from the output of the model and work our way *backward*, one layer after the other—which is why this step is known as *backpropagation*. We can automate it by having each function we implemented (`relu`, `mse`, `lin`) provide its backward step: that is, how to derive the gradients of the loss with respect to the input(s) from the gradients of the loss with respect to the output.\n",
    "\n",
    "Here we populate those gradients in an attribute of each tensor, a bit like PyTorch does with `.grad`. \n",
    "\n",
-    "The first are the gradients of the loss with respect to the output of our model (which is the input of the loss function). We have to undo the squeeze we did in `mse` then we use the formula that gives us the derivative of $x^{2}$: $2x$. The derivative of the mean is just 1/n where n is the number of elements in our input."
+    "The first are the gradients of the loss with respect to the output of our model (which is the input of the loss function). We undo the `squeeze` we did in `mse`, then we use the formula that gives us the derivative of $x^{2}$: $2x$. The derivative of the mean is just $1/n$ where $n$ is the number of elements in our input:"
   ]
  },
  {
@ -1823,7 +1825,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For the gradients of the relu and our linear layer, we use the gradients of the loss with respect to the output (in `out.g`) and apply the chain rule to compute the gradients of the loss with respect to the output (in `inp.g`). The chain rule tells us that `inp.g = relu'(inp) * out.g`. The derivative of relu is either 0 (when inputs are negative) or 1 (when inputs are positive), so this gives us:"
+    "For the gradients of the ReLU and our linear layer, we use the gradients of the loss with respect to the output (in `out.g`) and apply the chain rule to compute the gradients of the loss with respect to the output (in `inp.g`). The chain rule tells us that `inp.g = relu'(inp) * out.g`. The derivative of `relu` is either 0 (when inputs are negative) or 1 (when inputs are positive), so this gives us:"
   ]
  },
  {
@ -1841,7 +1843,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The scheme is the same to compute the gradients of the loss with respect to the inputs, weights and bias in the linear layer. We won't linger on the mathematical formulas that define them since they're not important for our purposes--but do check out Khan Academy's excellent calculus lessons if you're interested in this topic."
+    "The scheme is the same to compute the gradients of the loss with respect to the inputs, weights, and bias in the linear layer:"
   ]
  },
  {
@ -1857,6 +1859,13 @@
    "    b.g = out.g.sum(0)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We won't linger on the mathematical formulas that define them since they're not important for our purposes, but do check out Khan Academy's excellent calculus lessons if you're interested in this topic."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -1868,7 +1877,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "An extremely useful library for working with calculus is *SymPy*. SymPy is a library for symbolic computation, which is defined in the SymPy documentation:"
+    "SymPy is a library for symbolic computation that is extremely useful library when working with calculus. Per the [documentation](https://docs.sympy.org/latest/tutorial/intro.html):"
   ]
  },
  {
@ -1882,7 +1891,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To do symbolic computation, first define a *symbol*, and then do a computation, like so:"
+    "To do symbolic computation, we first define a *symbol*, and then do a computation, like so:"
   ]
  },
  {
@ -1914,7 +1923,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here, SymPy has taken the derivative of `x**2` for us! SymPy can take the derivative of complicated compound expressions, and can also simplify and factor equations, and much more. There's really not much reason for anyone to do calculus manually nowadays--for calculating gradients, PyTorch does it for us, and for showing the equation, SymPy does it for us!"
+    "Here, SymPy has taken the derivative of `x**2` for us! It can take the derivative of complicated compound expressions, simplify and factor equations, and much more. There's really not much reason for anyone to do calculus manually nowadays—for calculating gradients, PyTorch does it for us, and for showing the equations, SymPy does it for us!"
   ]
  },
  {
@ -1928,7 +1937,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Once we have have defined those functions we can use them to write the backward pass. Since each gradient is automatically populated in the right tensor, we don't need to store the results of those `_grad` functions anywhere, we just need to execute them in the reverse order as the forward pass, to make sure that in each function, `out.g` exists."
+    "Once we have have defined those functions, we can use them to write the backward pass. Since each gradient is automatically populated in the right tensor, we don't need to store the results of those `_grad` functions anywhere—we just need to execute them in the reverse order of the forward pass, to make sure that in each function `out.g` exists:"
   ]
  },
  {
@ -1956,28 +1965,28 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And now we can access to the gradients of our model parameters in `w1.g`, `b1.g`, `w2.g`, `b2.g`."
+    "And now we can access the gradients of our model parameters in `w1.g`, `b1.g`, `w2.g`, and `b2.g`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have sucessfuly defined our model, now let's make it a bit more like a PyTorch module."
+    "We have sucessfuly defined our model—now let's make it a bit more like a PyTorch module."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Refactor the Model"
+    "### Refactoring the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The three functions we used have two associated functions: a forward pass and a backward pass. Instead of writing them separately, we can create a class to wrap them together. That class can also store the inputs and outputs for the backward pass, this way we will just have to call `backward()`."
+    "The three functions we used have two associated functions: a forward pass and a backward pass. Instead of writing them separately, we can create a class to wrap them together. That class can also store the inputs and outputs for the backward pass. This way, we will just have to call `backward`:"
   ]
  },
  {
@ -1999,7 +2008,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The `__call__` name is a magic name in PyThon that will make our class callable. This what will be executed when we type `y = Relu()(x)`. We can do the same for our linear layer and the MSE loss."
+    "`__call__` is a magic name in Python that will make our class callable. This is what will be executed when we type `y = Relu()(x)`. We can do the same for our linear layer and the MSE loss:"
   ]
  },
  {
@ -2044,7 +2053,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then we can put everything in a model that we initiate with our tensors `w1`, `b1`, `w2`, `b2`."
+    "Then we can put everything in a model that we initiate with our tensors `w1`, `b1`, `w2`, `b2`:"
   ]
  },
  {
@ -2071,7 +2080,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "What is really nice about this refactoring and registering things as layers of our model is that the forward and backward pass are now really easy to write. If we want to instantiate our model, we just need to write:"
+    "What is really nice about this refactoring and registering things as layers of our model is that the forward and backward passes are now really easy to write. If we want to instantiate our model, we just need to write:"
   ]
  },
  {
@ -2087,7 +2096,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The forward pass would then be executed with:"
+    "The forward pass can then be executed with:"
   ]
  },
  {
@ -2126,7 +2135,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The three classes we wrote for `Lin`, `Mse` and `Relu` have a lot in common, so we could make them all inherit from the same basic class."
+    "The  `Lin`, `Mse` and `Relu` classes we wrote have a lot in common, so we could make them all inherit from the same base class:"
   ]
  },
  {
@ -2150,7 +2159,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then we just need to implement `forward` and `bwd` in each of our subclass."
+    "Then we just need to implement `forward` and `bwd` in each of our subclasses:"
   ]
  },
  {
@ -2197,9 +2206,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then our model can be the same as before. This is getting closer and closer to what PyTorch does. Each basic function we need to differentiate is written as a `torch.autograd.Function` object that has a `forward` and a `backward` method. PyTorch will then keep trace of any computation we do to be able to properly run the backward pass unless we set the `requires_grad` attribute of our tensors to `False`.\n",
+    "The rest of our model can be the same as before. This is getting closer and closer to what PyTorch does. Each basic function we need to differentiate is written as a `torch.autograd.Function` object that has a `forward` and a `backward` method. PyTorch will then keep trace of any computation we do to be able to properly run the backward pass, unless we set the `requires_grad` attribute of our tensors to `False`.\n",
    "\n",
-    "Writing one is (almost) as easy as we had before. The difference is that we choose what to save and what to put in a context variable (so that we make sure we don't save anything we don't need) and that we return the gradients in the `backward` pass. It's very rare to have to write your own `Function` but if you ever need something exotic or want to mess with the gradients of a regular function, here is how we write one:"
+    "Writing one of these is (almost) as easy as writing our original classes. The difference is that we choose what to save and what to put in a context variable (so that we make sure we don't save anything we don't need), and we return the gradients in the `backward` pass. It's very rare to have to write your own `Function` but if you ever need something exotic or want to mess with the gradients of a regular function, here is how to write one:"
   ]
  },
  {
@ -2227,12 +2236,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then the structure used to build a more complex model that takes advantage of those functions is a `torch.nn.Module`. This is the base structure for all models and all the neural nets you have seen up until now where from that class. It mostly helps to register all the trainable parameters, which as we've seen can be used in the training loop.\n",
+    "The structure used to build a more complex model that takes advantage of those `Function`s is a `torch.nn.Module`. This is the base structure for all models, and all the neural nets you have seen up until now were from that class. It mostly helps to register all the trainable parameters, which as we've seen can be used in the training loop.\n",
    "\n",
-    "To implement a `nn.Module` you just need to\n",
+    "To implement an `nn.Module` you just need to:\n",
    "\n",
-    "- Make sure the superclass `__init__` is called first when you initiliaze it,\n",
-    "- Define any parameter of the model as attributes with `nn.Parameter`,\n",
+    "- Make sure the superclass `__init__` is called first when you initiliaze it.\n",
+    "- Define any parameters of the model as attributes with `nn.Parameter`.\n",
    "- Define a `forward` function that returns the output of your model.\n",
    "\n",
    "As an example, here is the linear layer from scratch:"
@ -2292,7 +2301,7 @@
    "\n",
    "Note that in PyTorch, the weights are stored as an `n_out x n_in` matrix, which is why we have the transpose in the forward pass.\n",
    "\n",
-    "By using the linear layer from PyTorch (which uses the Kaiming initialization as well), the model we have seen during this chapter can be written like this:"
+    "By using the linear layer from PyTorch (which uses the Kaiming initialization as well), the model we have been building up during this chapter can be written like this:"
   ]
  },
  {
@ -2315,7 +2324,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "fastai provides its own variant of `Module` which is identical to `nn.Module`, but doesn't require you to call `super().__init__()` (it does that for you automatically):"
+    "fastai provides its own variant of `Module` that is identical to `nn.Module`, but doesn't require you to call `super().__init__()` (it does that for you automatically):"
   ]
  },
  {
@ -2337,7 +2346,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the last chapter, we will start from such a model and see how we build a training loop from scratch and refactor it to what we've been using in previous chapters."
+    "In the last chapter, we will start from such a model and see how to build a training loop from scratch and refactor it to what we've been using in previous chapters."
   ]
  },
  {
@ -2351,14 +2360,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have looked at the foundations of deep learning, beginning with matrix multiplication and implement the forward and backward passes of a neural net from scratch. We thenr efactored it to show how PyTorch works beneath the hood.\n",
+    "In this chapter we explored the foundations of deep learning, beginning with matrix multiplication and moving on to implementing the forward and backward passes of a neural net from scratch. We then refactored our code to show how PyTorch works beneath the hood.\n",
    "\n",
    "Here are a few things to remember:\n",
    "\n",
-    "- A neural net is basically a bunch of matrix multiplications with non-linearities in-between.\n",
-    "- Python is slow so to write fast code we have to vectorize it and take advantage of element-wise arithmetic or broadcasting.\n",
-    "- Two tensors are broadcastable if the dimensions starting from the end and going backward match (they are the same or one of them is 1). To make tensors broadcastable, we may need to add dimensions of size 1 with `unsqueeze` or a `None` index.\n",
-    "- Properly initializing a neural net is crucial to get training started. Kaiming initialization should be used when we have ReLU non-linearities.\n",
+    "- A neural net is basically a bunch of matrix multiplications with nonlinearities in between.\n",
+    "- Python is slow, so to write fast code we have to vectorize it and take advantage of techniques such as elementwise arithmetic and broadcasting.\n",
+    "- Two tensors are broadcastable if the dimensions starting from the end and going backward match (if they are the same, or one of them is 1). To make tensors broadcastable, we may need to add dimensions of size 1 with `unsqueeze` or a `None` index.\n",
+    "- Properly initializing a neural net is crucial to get training started. Kaiming initialization should be used when we have ReLU nonlinearities.\n",
    "- The backward pass is the chain rule applied multiple times, computing the gradients from the output of our model and going back, one layer at a time.\n",
    "- When subclassing `nn.Module` (if not using fastai's `Module`) we have to call the superclass `__init__` method in our `__init__` method and we have to define a `forward` function that takes an input and returns the desired result."
   ]
@ -2377,41 +2386,41 @@
    "1. Write the Python code to implement a single neuron.\n",
    "1. Write the Python code to implement ReLU.\n",
    "1. Write the Python code for a dense layer in terms of matrix multiplication.\n",
-    "1. Write the Python code for a dense layer in plain Python (that is with list comprehensions and functionality built into Python).\n",
-    "1. What is the hidden size of a layer?\n",
-    "1. What does the `t` method to in PyTorch?\n",
+    "1. Write the Python code for a dense layer in plain Python (that is, with list comprehensions and functionality built into Python).\n",
+    "1. What is the \"hidden size\" of a layer?\n",
+    "1. What does the `t` method do in PyTorch?\n",
    "1. Why is matrix multiplication written in plain Python very slow?\n",
-    "1. In matmul, why is `ac==br`?\n",
-    "1. In Jupyter notebook, how do you measure the time taken for a single cell to execute?\n",
-    "1. What is elementwise arithmetic?\n",
+    "1. In `matmul`, why is `ac==br`?\n",
+    "1. In Jupyter Notebook, how do you measure the time taken for a single cell to execute?\n",
+    "1. What is \"elementwise arithmetic\"?\n",
    "1. Write the PyTorch code to test whether every element of `a` is greater than the corresponding element of `b`.\n",
    "1. What is a rank-0 tensor? How do you convert it to a plain Python data type?\n",
-    "1. What does this return, and why?: `tensor([1,2]) + tensor([1])`\n",
-    "1. What does this return, and why?: `tensor([1,2]) + tensor([1,2,3])`\n",
-    "1. How does elementwise arithmetic help us speed up matmul?\n",
+    "1. What does this return, and why? `tensor([1,2]) + tensor([1])`\n",
+    "1. What does this return, and why? `tensor([1,2]) + tensor([1,2,3])`\n",
+    "1. How does elementwise arithmetic help us speed up `matmul`?\n",
    "1. What are the broadcasting rules?\n",
    "1. What is `expand_as`? Show an example of how it can be used to match the results of broadcasting.\n",
    "1. How does `unsqueeze` help us to solve certain broadcasting problems?\n",
-    "1. How can you use indexing to do the same operation as `unsqueeze`?\n",
+    "1. How can we use indexing to do the same operation as `unsqueeze`?\n",
    "1. How do we show the actual contents of the memory used for a tensor?\n",
-    "1. When adding a vector of size 3 to a matrix of size 3 x 3, are the elements of the vector added to each row, or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)\n",
+    "1. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)\n",
    "1. Do broadcasting and `expand_as` result in increased memory use? Why or why not?\n",
-    "1. Implement matmul using Einstein summation.\n",
+    "1. Implement `matmul` using Einstein summation.\n",
    "1. What does a repeated index letter represent on the left-hand side of einsum?\n",
    "1. What are the three rules of Einstein summation notation? Why?\n",
-    "1. What is the forward pass, and the backward pass, of a neural network?\n",
+    "1. What are the forward pass and backward pass of a neural network?\n",
    "1. Why do we need to store some of the activations calculated for intermediate layers in the forward pass?\n",
-    "1. What is the downside of having activations with a standard deviation too far away from one?\n",
-    "1. How can weight initialisation help avoid this problem?\n",
-    "1. What is the formula to initialise weights such that we get a standard deviation of one, for a plain linear layer; for a linear layer followed by ReLU?\n",
+    "1. What is the downside of having activations with a standard deviation too far away from 1?\n",
+    "1. How can weight initialization help avoid this problem?\n",
+    "1. What is the formula to initialize weights such that we get a standard deviation of 1 for a plain linear layer, and for a linear layer followed by ReLU?\n",
    "1. Why do we sometimes have to use the `squeeze` method in loss functions?\n",
-    "1. What does the argument to the squeeze method do? Why might it be important to include this argument, even though PyTorch does not require it?\n",
-    "1. What is the chain rule? Show the equation in either of the two forms shown in this chapter.\n",
+    "1. What does the argument to the `squeeze` method do? Why might it be important to include this argument, even though PyTorch does not require it?\n",
+    "1. What is the \"chain rule\"? Show the equation in either of the two forms presented in this chapter.\n",
    "1. Show how to calculate the gradients of `mse(lin(l2, w2, b2), y)` using the chain rule.\n",
-    "1. What is the gradient of relu? Show in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)\n",
+    "1. What is the gradient of ReLU? Show it in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)\n",
    "1. In what order do we need to call the `*_grad` functions in the backward pass? Why?\n",
    "1. What is `__call__`?\n",
-    "1. What methods do we need to implement when writing a `torch.autograd.Function`?\n",
+    "1. What methods must we implement when writing a `torch.autograd.Function`?\n",
    "1. Write `nn.Linear` from scratch, and test it works.\n",
    "1. What is the difference between `nn.Module` and fastai's `Module`?"
   ]
@ -2427,10 +2436,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Implement relu as a `torch.autograd.Function` and train a model with it.\n",
-    "1. If you are mathematically inclined, find out what the gradients of a linear layer are in maths notation. Map that to the implementation we saw in this chapter.\n",
-    "1. Learn about the `unfold` method in PyTorch, and use it along with matrix multiplication to implement your own 2d convolution function, and train a CNN that uses it.\n",
-    "1. Implement all what is in this chapter using numpy instead of PyTorch. "
+    "1. Implement ReLU as a `torch.autograd.Function` and train a model with it.\n",
+    "1. If you are mathematically inclined, find out what the gradients of a linear layer are in mathematical notation. Map that to the implementation we saw in this chapter.\n",
+    "1. Learn about the `unfold` method in PyTorch, and use it along with matrix multiplication to implement your own 2D convolution function. Then train a CNN that uses it.\n",
+    "1. Implement everything in this chapter using NumPy instead of PyTorch. "
   ]
  },
  {
--- a/18_CAM.ipynb
+++ b/18_CAM.ipynb
--- a/19_learner.ipynb
+++ b/19_learner.ipynb
@ -14,23 +14,23 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# fastai Learner from Scratch"
+    "# A fastai Learner from Scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This final chapter (other than the conclusion, and the online chapters) is going to look a bit different. We will have far more code, and far less prose than previous chapters. We will introduce new Python keywords and libraries without discussing them. This chapter is meant to be the start of a significant research project for you. You see, we are going to implement many of the key pieces of the fastai and PyTorch APIs from scratch, building on nothing other than the components that we developed in <<chapter_foundations>>! The key goal here is to end up with our own `Learner` class, and some callbacks--enough to be able to train a model on Imagenette, including examples of each of the key techniques we've studied. On the way to building `Learner`, we will be creating `Module`, `Parameter`, and even our own parallel `DataLoader`… and much more.\n",
+    "This final chapter (other than the conclusion and the online chapters) is going to look a bit different. It contains far more code and far less prose than the previous chapters. We will introduce new Python keywords and libraries without discussing them. This chapter is meant to be the start of a significant research project for you. You see, we are going to implement many of the key pieces of the fastai and PyTorch APIs from scratch, building on nothing other than the components that we developed in <<chapter_foundations>>! The key goal here is to end up with your own `Learner` class, and some callbacks—enough to be able to train a model on Imagenette, including examples of each of the key techniques we've studied. On the way to building `Learner`, we will create our own version of `Module`, `Parameter`, and parallel `DataLoader` so you have a very good idea of what those PyTorch classes do.\n",
    "\n",
-    "The end of chapter questionnaire is particularly important for this chapter. This is where we will be getting you started on the many interesting directions that you could take, using this chapter as your starting out point. What we are really saying is: follow through with this chapter on your computer, not on paper, and do lots of experiments, web searches, and whatever else you need to understand what's going on. You've built up the skills and expertise to do this in the rest of this book, so we think you are going to go great!"
+    "The end-of-chapter questionnaire is particularly important for this chapter. This is where we will be pointing you in the many interesting directions that you could take, using this chapter as your starting point. We suggest that you follow t along with this chapter on your computer, and do lots of experiments, web searches, and whatever else you need to understand what's going on. You've built up the skills and expertise to do this in the rest of this book, so we think you are going to do great!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "First let's start with gathering (manually) the data."
+    "Let's begin by gathering (manually) some data."
   ]
  },
  {
@ -88,7 +88,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...but you could do the same thing using just Python's standard library, with `glob`:"
+    "Or we could do the same thing using just Python's standard library, with `glob`:"
   ]
  },
  {
@ -169,7 +169,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "That's going to be the basis of our independent variable. For our dependent variable, we can use `Path.parent` from pathlib. First we'll need our vocab:"
+    "That's going to be the basis of our independent variable. For our dependent variable, we can use `Path.parent` from `pathlib`. First we'll need our vocab:"
   ]
  },
  {
@ -246,7 +246,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A `Dataset` in PyTorch can be anything which supports indexing (`__getitem__`) and `len`:"
+    "A `Dataset` in PyTorch can be anything that supports indexing (`__getitem__`) and `len`:"
   ]
  },
  {
@ -348,7 +348,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you see, our dataset is returning the independent and dependent variable as a tuple, which is just what we need. We'll need to be able to collate these into a mini-batch. Generally this is done with `torch.stack`, which is what we'll use here:"
+    "As you see, our dataset is returning the independent and dependent variables as a tuple, which is just what we need. We'll need to be able to collate these into a mini-batch. Generally this is done with `torch.stack`, which is what we'll use here:"
   ]
  },
  {
@ -394,7 +394,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we have a dataset and a collation function, we're ready to create `DataLoader`. We'll add two more things here: optional `shuffle` for the training set, and `ProcessPoolExecutor` to do our preprocessing in parallel. A parallel data loader is very important, because opening and decoding a jpeg image is a slow process. One CPU core is not enough to decode images fast enough to keep a modern GPU busy."
+    "Now that we have a dataset and a collation function, we're ready to create `DataLoader`. We'll add two more things here: an optional `shuffle` for the training set, and a `ProcessPoolExecutor` to do our preprocessing in parallel. A parallel data loader is very important, because opening and decoding a JPEG image is a slow process. One CPU core is not enough to decode images fast enough to keep a modern GPU busy. Here's our `DataLoader` class:"
   ]
  },
  {
@ -421,7 +421,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's try it out with our training and validation datasets."
+    "Let's try it out with our training and validation datasets:"
   ]
  },
  {
@ -452,7 +452,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The speed of this data loader is not much slower than PyTorch's, but it's far simpler. So if you're debugging a complex data loading process, don't be afraid to try doing things manually to help see exactly what's going on.\n",
+    "This data loader is not much slower than PyTorch's, but it's far simpler. So if you're debugging a complex data loading process, don't be afraid to try doing things manually to help you see exactly what's going on.\n",
    "\n",
    "For normalization, we'll need image statistics. Generally it's fine to calculate these on a single training mini-batch, since precision isn't needed here:"
   ]
@ -482,7 +482,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Our `Normalize` class just needs to store these stats and apply them. To see why the `to_device` is needed, try commenting it out, and see what happens later in this notebook."
+    "Our `Normalize` class just needs to store these stats and apply them (to see why the `to_device` is needed, try commenting it out, and see what happens later in this notebook):"
   ]
  },
  {
@ -541,7 +541,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here `tfm_x` isn't just applying `Normalize`, but is also permuting the axis order from `NHWC` to `NCHW` (see <<chapter_convolutions>> if you need a reminder what these acronyms refer to). PIL uses `HWC` axis order, which we can't use with PyTorch, hence the need for this `permute`."
+    "Here `tfm_x` isn't just applying `Normalize`, but is also permuting the axis order from `NHWC` to `NCHW` (see <<chapter_convolutions>> if you need a reminder of what these acronyms refer to). PIL uses `HWC` axis order, which we can't use with PyTorch, hence the need for this `permute`."
   ]
  },
  {
@ -562,7 +562,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To create a model, we'll need `Module`. To create `Module`, we'll need `Parameter`, so let's start there. Recall that in <<chapter_collab>> we said that `Parameter` \"this class doesn't actually add any functionality (other than automatically calling `requires_grad_()` for us). It's only used as a 'marker' to show what to include in `parameters()`\". Here's a definition which does exactly that:"
+    "To create a model, we'll need `Module`. To create `Module`, we'll need `Parameter`, so let's start there. Recall that in <<chapter_collab>> we said that the `Parameter` class \"doesn't actually add any functionality (other than automatically calling `requires_grad_` for us). It's only used as a \"marker\" to show what to include in `parameters`.\" Here's a definition which does exactly that:"
   ]
  },
  {
@ -580,7 +580,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The implementation here is a bit awkward: we have to define the special `__new__` Python method, and use the internal PyTorch method `_make_subclass` because, as at the time of writing, PyTorch doesn't otherwise work correctly with this kind of subclassing, or provide an officially supported API to do this. This may be fixed by the time you read this book, so look on the book website to see if there are updated details.\n",
+    "The implementation here is a bit awkward: we have to define the special `__new__` Python method and use the internal PyTorch method `_make_subclass` because, as at the time of writing, PyTorch doesn't otherwise work correctly with this kind of subclassing or provide an officially supported API to do this. This may have been fixed by the time you read this, so look on the book's website to see if there are updated details.\n",
    "\n",
    "Our `Parameter` now behaves just like a tensor, as we wanted:"
   ]
@ -653,17 +653,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The key functionality is in the definition of `parameters()`:\n",
+    "The key functionality is in the definition of `parameters`:\n",
    "\n",
-    "    self.params + sum([m.parameters() for m in self.children], [])\n",
+    "```python\n",
+    "self.params + sum([m.parameters() for m in self.children], [])\n",
+    "```\n",
    "\n",
-    "This means that we can ask any `Module` for its parameters, and it will return them, including all children modules (recursively). But how does it know what its parameters are? It's thanks to implementing Python's special `__setattr__` method, which is called for us any time Python sets an attribute on a class. Our implementation includes this line:\n",
+    "This means that we can ask any `Module` for its parameters, and it will return them, including for all its child modules (recursively). But how does it know what its parameters are? It's thanks to implementing Python's special `__setattr__` method, which is called for us any time Python sets an attribute on a class. Our implementation includes this line:\n",
    "\n",
-    "    if isinstance(v,Parameter): self.register_parameters(v)\n",
+    "```python\n",
+    "if isinstance(v,Parameter): self.register_parameters(v)\n",
+    "```\n",
    "\n",
-    "As you see, this is where we use our new `Parameter` class as a \"marker\"--anything of this class is added to our `params`.\n",
+    "As you see, this is where we use our new `Parameter` class as a \"marker\"—anything of this class is added to our `params`.\n",
    "\n",
-    "Python's `__call__` allows us to define what happens when our object is treated as a function; we just call `forward` (which doesn't exist here, so it'll need to be added by subclasses). Before we do, we'll call a `hook`, if it's defined. Now you can see that PyTorch hooks aren't doing anything fancy at all--they're just calling any hooks have been registered.\n",
+    "Python's `__call__` allows us to define what happens when our object is treated as a function; we just call `forward` (which doesn't exist here, so it'll need to be added by subclasses). Before we do, we'll call a hook, if it's defined. Now you can see that PyTorch hooks aren't doing anything fancy at all—they're just calling any hooks have been registered.\n",
    "\n",
    "Other than these pieces of functionality, our `Module` also provides `cuda` and `training` attributes, which we'll use shortly.\n",
    "\n",
@ -695,7 +699,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We're not implementing `F.conv2d` from scratch, since you should have already done that (using `unfold`) in the questionnaire in <<chapter_foundations>>, but instead we're just creating a small class that wraps it up, along with bias, and weight initialization. Let's check that it works correctly with `Module.parameters()`:"
+    "We're not implementing `F.conv2d` from scratch, since you should have already done that (using `unfold`) in the questionnaire in <<chapter_foundations>>. Instead, we're just creating a small class that wraps it up along with bias and weight initialization. Let's check that it works correctly with `Module.parameters`:"
   ]
  },
  {
@ -723,7 +727,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...and that we can call it (which will result in `forward` being called):"
+    "And that we can call it (which will result in `forward` being called):"
   ]
  },
  {
@ -775,7 +779,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...and test it works:"
+    "and test it works:"
   ]
  },
  {
@ -804,7 +808,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's also create a testing module to check that if we include multiple parameters as attributes, then they are all correctly registered:"
+    "Let's also create a testing module to check that if we include multiple parameters as attributes, they are all correctly registered:"
   ]
  },
  {
@ -851,7 +855,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We should also find that calling `cuda()` on this class puts all these parameters on the GPU:"
+    "We should also find that calling `cuda` on this class puts all these parameters on the GPU:"
   ]
  },
  {
@ -917,21 +921,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The `forward` method here just calls each layer in turn. Note that we have to use the `register_modules` method we defined in `Module`, since otherwise the contents of `layers` won't appear in `parameters()`."
+    "The `forward` method here just calls each layer in turn. Note that we have to use the `register_modules` method we defined in `Module`, since otherwise the contents of `layers` won't appear in `parameters`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> important: Remember that we're not using any PyTorch functionality for modules here; we're defining everything ourselves. So if you're not sure what `register_modules` does, or why it's needed, have another look at our code for `Module` above to see what we wrote!"
+    "> important: All The Code is Here: Remember that we're not using any PyTorch functionality for modules here; we're defining everything ourselves. So if you're not sure what `register_modules` does, or why it's needed, have another look at our code for `Module` to see what we wrote!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can create a simplified `AdaptivePool` which only handles pooling to a `1x1` output, and flattens it as well, by just using `mean`:"
+    "We can create a simplified `AdaptivePool` that only handles pooling to a 1×1 output, and flattens it as well, by just using `mean`:"
   ]
  },
  {
@ -1000,7 +1004,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we can try adding a hook. Note that we've only left room for one hook in `Module`; you could make it a list, or else use something like `Pipeline` to run a few as a single function."
+    "Now we can try adding a hook. Note that we've only left room for one hook in `Module`; you could make it a list, or use something like `Pipeline` to run a few as a single function:"
   ]
  },
  {
@ -1071,7 +1075,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Well actually, there's no `log` here, since we're using the same definition as PyTorch. That means we need to put the `log` together with `softmax`:"
+    "Well actually, there's no log here, since we're using the same definition as PyTorch. That means we need to put the log together with softmax:"
   ]
  },
  {
@ -1100,7 +1104,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Combining these together give us our cross entropy loss:"
+    "Combining these gives us our cross-entropy loss:"
   ]
  },
  {
@ -1128,7 +1132,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Note that the formula \n",
+    "Note that the formula:\n",
    "\n",
    "$$\\log \\left ( \\frac{a}{b} \\right ) = \\log(a) - \\log(b)$$ \n",
    "\n",
@ -1160,11 +1164,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Then, there is a way to compute the log of the sum of exponentials in a more stable way, called the [LogSumExp trick](https://en.wikipedia.org/wiki/LogSumExp). The idea is to use the following formula:\n",
+    "Then, there is a more stable way to compute the log of the sum of exponentials, called the [LogSumExp](https://en.wikipedia.org/wiki/LogSumExp) trick. The idea is to use the following formula:\n",
    "\n",
    "$$\\log \\left ( \\sum_{j=1}^{n} e^{x_{j}} \\right ) = \\log \\left ( e^{a} \\sum_{j=1}^{n} e^{x_{j}-a} \\right ) = a + \\log \\left ( \\sum_{j=1}^{n} e^{x_{j}-a} \\right )$$\n",
    "\n",
-    "where a is the maximum of the $x_{j}$.\n",
+    "where $a$ is the maximum of $x_{j}$.\n",
    "\n",
    "\n",
    "Here's the same thing in code:"
@ -1227,7 +1231,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...so we can use it for our `log_softmax` function:"
+    "so we can use it for our `log_softmax` function:"
   ]
  },
  {
@ -1243,7 +1247,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...which gives the same result as before:"
+    "Which gives the same result as before:"
   ]
  },
  {
@ -1270,7 +1274,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We'll use these to create `cross_entropy`:"
+    "We can use these to create `cross_entropy`:"
   ]
  },
  {
@ -1300,7 +1304,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have data, model, and loss function; only one thing left before we can fit a model, and that's an optimizer! Here's SGD:"
+    "We have data, a model, and a loss function; we only need one more thing we can fit a model, and that's an optimizer! Here's SGD:"
   ]
  },
  {
@ -1321,7 +1325,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we've seen in this book, life is easier with a `Learner`, which needs to know our training and validation sets, which means we need `DataLoaders` to store them. We don't need any other functionality--just a place to store them and access them:"
+    "As we've seen in this book, life is easier with a `Learner`. The `Learner` class needs to know our training and validation sets, which means we need `DataLoaders` to store them. We don't need any other functionality, just a place to store them and access them:"
   ]
  },
  {
@ -1340,7 +1344,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we're ready to create our `Learner` class, which you can see in <<class_learner>>."
+    "Now we're ready to create our `Learner` class:"
   ]
  },
  {
@ -1391,17 +1395,19 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is our largest class we've created in the book, but each method is quite small, so by looking at each in turn we'll be able to following what's going on.\n",
+    "This is the largest class we've created in the book, but each method is quite small, so by looking at each in turn you should be able to follow what's going on.\n",
    "\n",
    "The main method we'll be calling is `fit`. This loops with:\n",
    "\n",
-    "    for self.epoch in range(n_epochs)\n",
+    "```python\n",
+    "for self.epoch in range(n_epochs)\n",
+    "```\n",
    "\n",
-    "...and at each epoch calls `self.one_epoch` for each of `train=True` and then `train=False`. Then `self.one_epoch` calls `self.one_batch` for each batch in `dls.train` or `dls.valid` as appropriate (after wrapping the `DataLoader` in `fastprogress.progress_bar`. Finally, `self.one_batch` follows the usual set of steps to fit one mini-batch that we've seen through this book.\n",
+    "and at each epoch calls `self.one_epoch` for each of `train=True` and then `train=False`. Then `self.one_epoch` calls `self.one_batch` for each batch in `dls.train` or `dls.valid`, as appropriate (after wrapping the `DataLoader` in `fastprogress.progress_bar`. Finally, `self.one_batch` follows the usual set of steps to fit one mini-batch that we've seen throughout this book.\n",
    "\n",
-    "Before and after each step, `Learner` calls `self(...)`, which calls `__call__` (which is standard Python functionality). `__call__` uses `getattr(cb,name)` on each callback in `self.cbs`, which is a Python builtin function which returns the attribute (a method, in this case) with the requested name. So, for instance, `self('before_fit')` will call `cb.before_fit()` for each callback where that method is defined.\n",
+    "Before and after each step, `Learner` calls `self`, which calls `__call__` (which is standard Python functionality). `__call__` uses `getattr(cb,name)` on each callback in `self.cbs`, which is a Python built-in function that returns the attribute (a method, in this case) with the requested name. So, for instance, `self('before_fit')` will call `cb.before_fit()` for each callback where that method is defined.\n",
    "\n",
-    "So we can see that `Learner` is really just using our standard training loop, except that it's also calling callbacks at appropriate times. So let's define some callbacks!"
+    "As you can see, `Learner` is really just using our standard training loop, except that it's also calling callbacks at appropriate times. So let's define some callbacks!"
   ]
  },
  {
@ -1417,9 +1423,11 @@
   "source": [
    "In `Learner.__init__` we have:\n",
    "\n",
-    "    for cb in cbs: cb.learner = self\n",
+    "```python\n",
+    "for cb in cbs: cb.learner = self\n",
+    "```\n",
    "\n",
-    "In other words, every callback knows what learner it is used in. This is critical, since otherwise a callback can't get information from the learner, or change things in the learner. Since getting information from the learner is so common, we make that easier by definined `Callback` as a subclass of `GetAttr`, with a default attribute of `learner`. `GetAttr` is a class in fastai which implements Python's standard `__getattr__` and `__dir__` methods for you, such such any time you try to access an attribute that doesn't exist, it passes the request along to whatever you have defined as `_default`."
+    "In other words, every callback knows what learner it is used in. This is critical, since otherwise a callback can't get information from the learner, or change things in the learner. Because getting information from the learner is so common, we make that easier by defining `Callback` as a subclass of `GetAttr`, with a default attribute of `learner`:"
   ]
  },
  {
@ -1431,6 +1439,13 @@
    "class Callback(GetAttr): _default='learner'"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`GetAttr` is a fastai class that implements Python's standard `__getattr__` and `__dir__` methods for you, such such any time you try to access an attribute that doesn't exist, it passes the request along to whatever you have defined as `_default`."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -1458,7 +1473,7 @@
   "source": [
    "In `SetupLearnerCB` we also move each mini-batch to the GPU, by calling `to_device(self.batch)` (we could also have used the longer `to_device(self.learner.batch)`. Note however that in the line `self.learner.batch = tfm_x(xb),yb` we can't remove `.learner`, because here we're *setting* the attribute, not getting it.\n",
    "\n",
-    "Before we try our `Learner` out, let's create a callback to track and print progress, otherwise we won't really know if it's working properly:"
+    "Before we try our `Learner` out, let's create a callback to track and print progress. Otherwise we won't really know if it's working properly:"
   ]
  },
  {
@ -1555,7 +1570,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If we're going to get good results, we'll want an LR Finder and one-cycle training. These are both *annealing* callbacks--that is, they are gradually changing hyperparameters as we train. Here's `LRFinder`:"
+    "If we're going to get good results, we'll want an LR finder and 1cycle training. These are both *annealing* callbacks—that is, they are gradually changing hyperparameters as we train. Here's `LRFinder`:"
   ]
  },
  {
@ -1584,7 +1599,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This shows how we're using `CancelFitException`, which is itself an empty class, only used to signify the type of exception. You can see in `Learner` that this exception is caught. (You should add and test `CancelBatchException`, `CancelEpochException`, etc yourself.) Let's try it out, by adding it to our list of callbacks:"
+    "This shows how we're using `CancelFitException`, which is itself an empty class, only used to signify the type of exception. You can see in `Learner` that this exception is caught. (You should add and test `CancelBatchException`, `CancelEpochException`, etc. yourself.) Let's try it out, by adding it to our list of callbacks:"
   ]
  },
  {
@ -1666,7 +1681,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...and have a look at the results:"
+    "And take a look at the results:"
   ]
  },
  {
@ -1730,7 +1745,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We'll try an LR of `0.1`:"
+    "We'll try an LR of 0.1:"
   ]
  },
  {
@ -1747,7 +1762,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's fit for a while and see how it looks (we won't show all the output in the book--try it in the notebook to see the results):"
+    "Let's fit for a while and see how it looks (we won't show all the output in the book—try it in the notebook to see the results):"
   ]
  },
  {
@ -1764,7 +1779,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Finally, we'll check that the learning rate followed the schedule we defined (as you see, we're not using cosine annealing here)."
+    "Finally, we'll check that the learning rate followed the schedule we defined (as you see, we're not using cosine annealing here):"
   ]
  },
  {
@ -1800,7 +1815,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have looked at how the key concepts of the fastai library are implemented by re-implementing them in this chapter. Since it's mostly full of code, you should definitely try to experiment with it by looking at the corresponding notebook on the book website. As a next step, be sure to check the intermediate and advanced tutorials in the fastai documentation, to learn how to customize every bit of the library now that you know the way it's built."
+    "We have explored the key concepts of the fastai library are implemented by re-implementing them in this chapter. Since it's mostly full of code, you should definitely try to experiment with it by looking at the corresponding notebook on the book's website. Now that you know how it's built, as a next step be sure to check out the intermediate and advanced tutorials in the fastai documentation to learn how to customize every bit of the libraryt."
   ]
  },
  {
@ -1814,37 +1829,37 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For the questions here that ask you to explain what some function or class is, you should also complete your own code experiments."
+    "> tip: Experiments: For the questions here that ask you to explain what some function or class is, you should also complete your own code experiments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. What is glob?\n",
+    "1. What is `glob`?\n",
    "1. How do you open an image with the Python imaging library?\n",
-    "1. What does L.map do?\n",
-    "1. What does Self do?\n",
-    "1. What is L.val2idx?\n",
-    "1. What methods do you need to implement to create your own Dataset?\n",
+    "1. What does `L.map` do?\n",
+    "1. What does `Self` do?\n",
+    "1. What is `L.val2idx`?\n",
+    "1. What methods do you need to implement to create your own `Dataset`?\n",
    "1. Why do we call `convert` when we open an image from Imagenette?\n",
    "1. What does `~` do? How is it useful for splitting training and validation sets?\n",
-    "1. Which of these classes does `~` work with: `L`, `Tensor`, numpy array, Python `list`, pandas `DataFrame`?\n",
-    "1. What is ProcessPoolExecutor?\n",
+    "1. Does `~` work with the `L` or `Tensor` classes? What about NumPy arrays, Python lists, or pandas DataFrames?\n",
+    "1. What is `ProcessPoolExecutor`?\n",
    "1. How does `L.range(self.ds)` work?\n",
    "1. What is `__iter__`?\n",
    "1. What is `first`?\n",
    "1. What is `permute`? Why is it needed?\n",
    "1. What is a recursive function? How does it help us define the `parameters` method?\n",
-    "1. Write a recursive function which returns the first 20 items of the Fibonacci sequence.\n",
+    "1. Write a recursive function that returns the first 20 items of the Fibonacci sequence.\n",
    "1. What is `super`?\n",
-    "1. Why do subclasses of Module need to override `forward` instead of defining `__call__`?\n",
-    "1. In `ConvLayer` why does `init` depend on `act`?\n",
+    "1. Why do subclasses of `Module` need to override `forward` instead of defining `__call__`?\n",
+    "1. In `ConvLayer`, why does `init` depend on `act`?\n",
    "1. Why does `Sequential` need to call `register_modules`?\n",
-    "1. Write a hook that prints the shape of every layers activations.\n",
-    "1. What is LogSumExp?\n",
-    "1. Why is log_softmax useful?\n",
-    "1. What is GetAttr? How is it helpful for callbacks?\n",
+    "1. Write a hook that prints the shape of every layer's activations.\n",
+    "1. What is \"LogSumExp\"?\n",
+    "1. Why is `log_softmax` useful?\n",
+    "1. What is `GetAttr`? How is it helpful for callbacks?\n",
    "1. Reimplement one of the callbacks in this chapter without inheriting from `Callback` or `GetAttr`.\n",
    "1. What does `Learner.__call__` do?\n",
    "1. What is `getattr`? (Note the case difference to `GetAttr`!)\n",
@ -1852,7 +1867,7 @@
    "1. Why do we check for `model.training` in `one_batch`?\n",
    "1. What is `store_attr`?\n",
    "1. What is the purpose of `TrackResults.before_epoch`?\n",
-    "1. What does `model.cuda()` do? How does it work?\n",
+    "1. What does `model.cuda` do? How does it work?\n",
    "1. Why do we need to check `model.training` in `LRFinder` and `OneCycle`?\n",
    "1. Use cosine annealing in `OneCycle`."
   ]
@ -1868,15 +1883,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Write `resnet18` from scratch (refer to <<chapter_resnet>> as needed), and train it with the Learner in this chapter.\n",
-    "1. Implement a batchnorm layer from scratch and use it in your resnet18.\n",
-    "1. Write a mixup callback for use in this chapter.\n",
-    "1. Add momentum to `SGD`.\n",
+    "1. Write `resnet18` from scratch (refer to <<chapter_resnet>> as needed), and train it with the `Learner` in this chapter.\n",
+    "1. Implement a batchnorm layer from scratch and use it in your `resnet18`.\n",
+    "1. Write a Mixup callback for use in this chapter.\n",
+    "1. Add momentum to SGD.\n",
    "1. Pick a few features that you're interested in from fastai (or any other library) and implement them in this chapter.\n",
    "1. Pick a research paper that's not yet implemented in fastai or PyTorch and implement it in this chapter.\n",
    "  - Port it over to fastai.\n",
-    "  - Submit a PR to fastai, or create your own extension module and release it. \n",
-    "  - Hint: you may find it helpful to use [nbdev](https://nbdev.fast.ai/) to create and deploy your package."
+    "  - Submit a pull request to fastai, or create your own extension module and release it. \n",
+    "  - Hint: you may find it helpful to use [`nbdev`](https://nbdev.fast.ai/) to create and deploy your package."
   ]
  },
  {
--- a/20_conclusion.ipynb
+++ b/20_conclusion.ipynb
@ -18,9 +18,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Congratulations! You've made it! If you have worked through all of the notebooks to this point, then you have joined a small, but growing group of people that are able to harness the power of deep learning to solve real problems. You may not feel that way; in fact you probably do not feel that way. We have seen again and again that students that complete the fast.ai courses dramatically underestimate how effective they are as deep learning practitioners. We've also seen that these people are often underestimated by those that have come out of a classic academic background. So for you to rise above your own expectations and the expectations of others what you do next, after closing this book, is even more important than what you've done to get to this point.\n",
+    "Congratulations! You've made it! If you have worked through all of the notebooks to this point, then you have joined the small, but growing group of people that are able to harness the power of deep learning to solve real problems. You may not feel that way yet—in fact you probably don't. We have seen again and again that students that complete the fast.ai courses dramatically underestimate how effective they are as deep learning practitioners. We've also seen that these people are often underestimated by others with a classic academic background. So if you are to rise above your own expectations and the expectations of others, what you do next, after closing this book, is even more important than what you've done to get to this point.\n",
    "\n",
-    "The most important thing is to keep the momentum going. In fact, as you know from your study of optimisers, momentum is something which can build upon itself! So think about what it is you can do now to maintain and accelerate your deep learning journey. <<do_next>> can give you a few ideas."
+    "The most important thing is to keep the momentum going. In fact, as you know from your study of optimizers, momentum is something that can build upon itself! So think about what you can do now to maintain and accelerate your deep learning journey. <<do_next>> can give you a few ideas."
   ]
  },
  {
@ -34,23 +34,23 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We've talked a lot in this book about the value of writing, whether it be code or prose. But perhaps you haven't quite written as much as you had hoped so far. That's okay! Now is a great chance to turn that around. You have a lot to say, at this point. Perhaps you have tried some experiments on a dataset which other people don't seem to have looked at in quite the same way — so tell the world about it! Or perhaps you are just curious to try out some ideas that you had been thinking about while you were reading; now is a great chance to turn those ideas into code.\n",
+    "We've talked a lot in this book about the value of writing, whether it be code or prose. But perhaps you haven't quite written as much as you had hoped so far. That's okay! Now is a great chance to turn that around. You have a lot to say, at this point. Perhaps you have tried some experiments on a dataset that other people don't seem to have looked at in quite the same way. Tell the world about it! Or perhaps thinking about trying out some ideas that occurred to you while you were reading—now is a great time to turn those ideas into code.\n",
    "\n",
-    "One fairly low-key place for your writing is the fast.ai forums at forums.fast.ai. You will find that the community there is very supportive and helpful, so please do drop by and let us know what you've been up to. Or see if you can answer a few questions for those folks who are earlier in their journey than you.\n",
+    "If you'd like to share your ideas, one fairly low-key place to do so is the [fast.ai forums](https://forums.fast.ai/). You will find that the community there is very supportive and helpful, so please do drop by and let us know what you've been up to. Or see if you can answer a few questions for those folks who are earlier in their journey than you.\n",
    "\n",
-    "And if you do have some success, big or small, in your deep learning journey, be sure to let us know! It's especially helpful if you post about it on the forums, because for others to learn about the successes of other students can be extremely motivating.\n",
+    "And if you do have some successes, big or small, in your deep learning journey, be sure to let us know! It's especially helpful if you post about them on the forums, because learning about the successes of other students can be extremely motivating.\n",
    "\n",
-    "Perhaps the most important approach for many people to stay connected with their learning journey is to build a community around it. For instance, you could try to set up a small deep learning Meetup in your local neighbourhood, or a study group, or even offer to do a talk at a local meet up about what you've learned so far, or some particular aspect that interested you. It is okay that you are not the world's leading expert just yet – the important thing to remember is that you now know about plenty of stuff that other people don't, so they are very likely to appreciate your perspective.\n",
+    "Perhaps the most important approach for many people to stay connected with their learning journey is to build a community around it. For instance, you could try to set up a small deep learning meetup in your local neighborhood, or a study group, or even offer to do a talk at a local meetup about what you've learned so far or some particular aspect that interested you. It's okay that you are not the world's leading expert just yet—the important thing to remember is that you now know about plenty of stuff that other people don't, so they are very likely to appreciate your perspective.\n",
    "\n",
-    "Another community event which many people find useful is a regular book club or paper reading club. You might find that there are some in your neighbourhood already, or otherwise you could try to get one started yourself. Even if there is just one other person doing it with you, it will help give you the support and encouragement to get going.\n",
+    "Another community event which many people find useful is a regular book club or paper reading club. You might find that there are some in your neighbourhood already, and if not you could try to get one started yourself. Even if there is just one other person doing it with you, it will help give you the support and encouragement to get going.\n",
    "\n",
-    "If you are not in a geography where it's easy to get together with like-minded folks in person, drop by the forums, because there are lots of people always starting up virtual study groups. These generally involve a bunch of people getting together over video chat once every week or so, and discussing some deep learning topic.\n",
+    "If you are not in a geography where it's easy to get together with like-minded folks in person, drop by the forums, because there are always people starting up virtual study groups. These generally involve a bunch of folks getting together over video chat once a week or so to discuss some deep learning topic.\n",
    "\n",
-    "Hopefully, by this point, you have a few little projects that you put together, and experiments that you've run. Our recommendation is generally to pick one of these and make it as good as you can. Really polish it up into the best piece of work that you can — something you are really proud of. This will force you to go much deeper into a topic, which will really test out your understanding, and give you the opportunity to see what you can do when you really put your mind to it.\n",
+    "Hopefully, by this point, you have a few little projects that you've put together and experiments that you've run. Our recommendation for the next step is to pick one of these and make it as good as you can. Really polish it up into the best piece of work that you can—something you are really proud of. This will force you to go much deeper into a topic, which will really test your understanding and give you the opportunity to see what you can do when you really put your mind to it.\n",
    "\n",
-    "Also, you may want to take a look at the fast.ai free online course which covers the same material as this book. Sometimes, seeing the same material in two different ways, can really help to crystallise the ideas. In fact, human learning researchers have found that this is one of the best ways to learn material — to see the same thing from different angles, described in different ways.\n",
+    "Also, you may want to take a look at the fast.ai free online course that covers the same material as this book. Sometimes, seeing the same material in two different ways can really help to crystallize the ideas. In fact, human learning researchers have found that one of the best ways to learn material is to see the same thing from different angles, described in different ways.\n",
    "\n",
-    "Your final mission, should you choose to accept it, is to take this book, and give it to somebody that you know — and let somebody else start their way down their own deep learning journey!"
+    "Your final mission, should you choose to accept it, is to take this book and give it to somebody that you know—and get somebody else starte on their own deep learning journey!"
   ]
  },
  {
--- a/app_blog.ipynb
+++ b/app_blog.ipynb
@ -27,6 +27,15 @@
    "# Creating a Blog"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Unfortunately, when it comes to blogging, it seems like you have to make a difficult decision: either use a platform that makes it easy but subjects you and your readers to advertisements, paywalls, and fees, or spend hours setting up your own hosting service and weeks learning about all kinds of intricate details. Perhaps the biggest benefit to the \"do-it-yourself\" approach is that you really own your own posts, rather than being at the whim of a service provider and their decisions about how to monetize your content in the future.\n",
+    "\n",
+    "It turns out, however, that you can have the best of both worlds! "
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -38,11 +47,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Unfortunately, when it comes to blogging, it seems like you have to make a difficult decision: either use a platform that makes it easy, but subjects you and your readers to advertisements, pay walls, and fees, or spend hours setting up your own hosting and weeks learning about all kinds of intricate details. Perhaps the biggest benefit to the \"do-it-yourself\" approach is that you really owning your own posts, rather than being at the whim of a service provider, and their decisions about how to monetize your content in the future.\n",
+    "A great solution is to host your blog on a platform called [GitHub Pages](https://pages.github.com/), which is free, has no ads or pay wall, and makes your data available in a standard way such that you can at any time move your blog to another host. But all the approaches I’ve seen to using GitHub Pages have required knowledge of the command line and arcane tools that only software developers are likely to be familiar with. For instance, GitHub's [own documentation](https://help.github.com/en/github/working-with-github-pages/creating-a-github-pages-site-with-jekyll) on setting up a blog includes a long list of instructions that involve installing the Ruby programming language, using the `git` command-line tool, copying over version numbers, and more—17 steps in total!\n",
    "\n",
-    "It turns out, however, that you can have the best of both worlds! You can host on a platform called [GitHub Pages](https://pages.github.com/), which is free, has no ads or pay wall, and makes your data available in a standard way such that you can at any time move your blog to another host. But all the approaches I’ve seen to using GitHub Pages have required knowledge of the command line and arcane tools that only software developers are likely to be familiar with. For instance, GitHub's [own documentation](https://help.github.com/en/github/working-with-github-pages/creating-a-github-pages-site-with-jekyll) on setting up a blog requires installing the Ruby programming language, using the git command line tool, copying over version numbers, and more. 17 steps in total!\n",
-    "\n",
-    "We’ve curated an easy approach, which allows you to use an **entirely browser-based interface** for all your blogging needs. You will be up and running with your new blog within about five minutes. It doesn’t cost anything, and you can easily add your own custom domain to it if you wish to. Here’s how to do it, using a template we've created called **fast\\_template**. (NB: be sure to check the [book website](https://book.fast.ai) for the latest blog recommendations, since new tools are always coming out; for instance, we're currently working with GitHub on creating a new tool called \"fastpages\" which is a more advanced version of `fast_template` that's particularly designed for people using Jupyter Notebooks)."
+    "To cut down the hassle, weve created an easy approach that allows you to use an *entirely browser-based interface* for all your blogging needs. You will be up and running with your new blog within about five minutes. It doesn’t cost anything, and you can easily add your own custom domain to it if you wish to. In this section, we'll explain how to do it, using a template we've created called *fast\\_template*. (NB: be sure to check the [book's website](https://book.fast.ai) for the latest blog recommendations, since new tools are always coming out; for instance, we're currently working with GitHub on creating a new tool called `fastpages` that is a more advanced version of `fast_template` particularly designed for people using Jupyter notebooks)."
   ]
  },
  {
@ -56,81 +63,59 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You’ll need an account on GitHub. So, head over there now, and create an account if you don’t have one already. Make sure that you are logged in. Normally, GitHub is used by software developers for writing code, and they use a sophisticated command line tool to work with it. But I'm going to show you an approach that doesn't use the command line at all!\n",
+    "You’ll need an account on GitHub, so head over there now and create an account if you don’t have one already. Make sure that you are logged in. Normally, GitHub is used by software developers for writing code, and they use a sophisticated command-line tool to work with it—but we're going to show you an approach that doesn't use the command line at all!\n",
    "\n",
-    "To get started, click on this link: [https://github.com/fastai/fast_template/generate](https://github.com/fastai/fast_template/generate) . This will allow you to create a place to store your blog, called a \"*repository*\". You will see the following screen; you have to enter your repository name using the **exact form you see below**, that is, the username you used at GitHub followed by `.github.io`.\n",
+    "To get started, point your browser to [https://github.com/fastai/fast_template/generate](https://github.com/fastai/fast_template/generate) (you need to be logged in to GitHub for the link to work). This will allow you to create a place to store your blog, called a *repository*. You will a screen like the one in <<githup_repo>>. Note that you have to enter your repository name using the *exact* format shown here—that is, your GitHub username followed by `.github.io`.\n",
    "\n",
    "<img width=\"440\" src=\"images/fast_template/image1.png\" id=\"githup_repo\" caption=\"Creating your repository\" alt=\"Screebshot of the GitHub page for creating a new repository\">\n",
    "\n",
-    "> Important: Note that if you don't use username.github.io as the name, it won't work!\n",
+    "Once you’ve entered that, and any description you like, click \"Create repository from template.\" You have the choice to make the repository \"private,\" but since you are creating a blog that you want other people to read, having the underlying files publicly available hopefully won't be a problem for you.\n",
    "\n",
-    "Once you’ve entered that, and any description you like, click on \"create repository from template\". You have the choice to make the repository \"private\" but since you are creating a blog that you want other people to read, having the underlying files publicly available hopefully won't be a problem for you.\n",
-    "\n",
-    "Now, let's set up your homepage!"
+    "Now, let's set up your home page!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Setting up Your Homepage"
+    "### Setting Up Your Home Page"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When readers first arrive at your blog the first thing that they will see is the content of a file called \"index.md\". This is a [markdown](https://guides.github.com/features/mastering-markdown/) file.  Markdown is a powerful yet simple way of creating formatted text, such as bullet points, italics, hyperlinks, and so forth. It is very widely used, including all the formatting in Jupyter notebooks, nearly every part of the GitHub site, and many other places all over the Internet. To create markdown text, you can just type in plain regular English. But then you can add some special characters to add special behavior. For instance, if you type a `*` character around a word or phrase then that will put it in *italics*. Let’s try it now.\n",
+    "When readers arrive at your blog the first thing that they will see is the content of a file called *index.md*. This is a [markdown](https://guides.github.com/features/mastering-markdown/) file.  Markdown is a powerful yet simple way of creating formatted text, such as bullet points, italics, hyperlinks, and so forth. It is very widely used, including for all the formatting in Jupyter notebooks, nearly every part of the GitHub site, and many other places all over the internet. To create markdown text, you can just type in plain English, then add some special characters to add special behavior. For instance, if you type a `*` character before and after a word or phrase, that will put it in *italics*. Let’s try it now.\n",
    "\n",
-    "To open the file, click its file name in GitHub.\n",
+    "To open the file, click its filename in GitHub. To edit it, click on the pencil icon at the far right hand side of the screen as shown in <<fastpage_edit>>.\n",
    "\n",
-    "<img width=\"140\" src=\"images/fast_template/image2.png\" alt=\"Screenshot showing a click on the index.md file\">\n",
-    "\n",
-    "To edit it, click on the pencil icon at the far right hand side of the\n",
-    "screen.\n",
-    "\n",
-    "<img width=\"800\" src=\"images/fast_template/image3.png\" alt=\"Screenshot showing where to click to edit the file\">"
+    "<img width=\"800\" src=\"images/fast_template/image3.png\" alt=\"Screenshot showing where to click to edit the file\" id=\"fastpage_edit\" caption=\"Click Edit this file\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You can add, edit, or replace the texts that you see. Click on the\n",
-    "\"preview changes\" button to see how well your markdown text will look\n",
-    "on your blog. Lines that you have added or changed will appear with a\n",
-    "green bar on the left-hand side.\n",
+    "You can add to, edit, or replace the texts that you see. Click \"Preview changes\" (<<fastpage_preview>>) to see what your markdown text will look like in your blog. Lines that you have added or changed will appear with a green bar on the lefthand side.\n",
    "\n",
-    "<img width=\"350\" src=\"images/fast_template/image4.png\" alt=\"Screenshot showing where to click to preview changes\">\n",
+    "<img width=\"350\" src=\"images/fast_template/image4.png\" alt=\"Screenshot showing where to click to preview changes\" id=\"fastpage_preview\" caption=\"Preview changes to catch any mistake\">\n",
    "\n",
-    "To save your changes to your blog, you must scroll to the bottom and\n",
-    "click on the \"commit changes\" green button. On GitHub, to \"commit\"\n",
-    "something means to save it to the GitHub server.\n",
+    "To save your changes, scroll to the bottom of the page and click \"Commit changes,\" as shown in <<fastpage_commit>>. On GitHub, to \"commit\" something means to save it to the GitHub server.\n",
    "\n",
-    "<img width=\"600\" src=\"images/fast_template/image5.png\" alt=\"Screenshot showing where to click to commit the changes\">"
+    "<img width=\"600\" src=\"images/fast_template/image5.png\" alt=\"Screenshot showing where to click to commit the changes\" id=\"fastpage_commit\" caption=\"Commit your changes to save them\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Next, you should configure your blog’s settings. To do so, click on the\n",
-    "file called \"\\_config.yml\", and then click on the edit button like you\n",
-    "did for the index file above. Change the title, description, and GitHub\n",
-    "username values. You need to leave the names before the colons in place\n",
-    "and type your new values in after the colon and space on each line. You\n",
-    "can also add to your email and Twitter username if you wish — but note\n",
-    "that these will appear on your public blog if you do fill them in here.\n",
+    "Next, you should configure your blog’s settings. To do so, click on the file called *\\_config.yml*, then click the edit button like you did for the index file. Change the title, description, and GitHub username values (see <<github_config>>. You need to leave the names before the colons in place, and type your new values in after the colon (and a space) on each line. You can also add to your email address and Twitter username if you wish, but note that these will appear on your public blog if you fill them in here.\n",
    "\n",
-    "<img width=\"800\" src=\"images/fast_template/image6.png\" id=\"github_config\" caption=\"Fill the config file\" alt=\"Screenshot showing the config file and how to fill it\">\n",
+    "<img width=\"800\" src=\"images/fast_template/image6.png\" id=\"github_config\" caption=\"Fill in the config file\" alt=\"Screenshot showing the config file and how to fill it in\">\n",
    "\n",
-    "After you’re done, commit your changes just like you did with the index\n",
-    "file before. Then wait about a minute, whilst GitHub processes your new\n",
-    "blog. Then you will be able to go to your blog in your web browser, by\n",
-    "opening the URL: username.github.io (replace \"username\" with your\n",
-    "GitHub username). You should see your blog!\n",
+    "After you’re done, commit your changes just like you did with the index file, then wait a minute or so while GitHub processes your new blog. Point your web browser to *&lt;username> .github.io* (replacing *&lt;username>* with your GitHub username). You should see your blog, which will look something like <<github_blog>>.\n",
    "\n",
-    "<img width=\"540\" src=\"images/fast_template/image7.png\" id=\"github_blog\" caption=\"Your blog is onlyine!\" alt=\"Screenshot showing the website username.github.io\">"
+    "<img width=\"540\" src=\"images/fast_template/image7.png\" id=\"github_blog\" caption=\"Your blog is online!\" alt=\"Screenshot showing the website username.github.io\">"
   ]
  },
  {
@ -144,76 +129,55 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now you’re ready to create your first post. All your posts will go in\n",
-    "the \"\\_posts\" folder. Click on that now, and then click on the \"create\n",
-    "file\" button. You need to be careful to name your file in the following\n",
-    "format: \"year-month-day-name.md\", where year is a four-digit number, and\n",
-    "month and day are two-digit numbers. \"Name\" can be anything you want,\n",
-    "that will help you remember what this post was about. The \"md\" extension\n",
-    "is for markdown documents.\n",
+    "Now you’re ready to create your first post. All your posts will go in the *\\_posts* folder. Click on that now, and then click the \"Create file\" button. You need to be careful to name your file using the format *&lt;year>-&lt;month>-&lt;day>-&lt;name>.md*, as shwon in <<fastpage_name>>, where *&lt;year>* is a four-digit number, and *&lt;month>* and *&lt;day>* are two-digit numbers. *&lt;name>* can be anything you want that will help you remember what this post was about. The *.md* extension is for markdown documents.\n",
    "\n",
-    "<img width=\"440\" src=\"images/fast_template/image8.png\" alt=\"Screenshot showing the right syntax to create a new blog post\">\n",
+    "<img width=\"440\" src=\"images/fast_template/image8.png\" alt=\"Screenshot showing the right syntax to create a new blog post\" id=\"fastpage_name\" caption=\"Naming your posts\">\n",
    "\n",
-    "You can then type the contents of your first post. The only rule is that\n",
-    "the first line of your post must be a markdown heading. This is created\n",
-    "by putting `# ` at the start of a line (that creates a level 1\n",
-    "heading, which you should just use once at the start of your document;\n",
-    "you create level 2 headings using `## `, level 3 with `###`, and so forth.)\n",
+    "You can then type the contents of your first post. The only rule is that the first line of your post must be a markdown heading. This is created by putting `# ` at the start of a line, as shown in <<fastpage_title>> (that creates a level-1 heading, which you should just use once at the start of your document; you can create level-2 headings using `## `, level 3 with `###`, and so forth).\n",
    "\n",
-    "<img width=\"300\" src=\"images/fast_template/image9.png\" alt=\"Screenshot showing the start of a blog post\">"
+    "<img width=\"300\" src=\"images/fast_template/image9.png\" alt=\"Screenshot showing the start of a blog post\" id=\"fastpage_title\" caption=\"Markdown syntax for a title\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As before, you can click on the \"preview\" button to see how your\n",
-    "markdown formatting will look.\n",
+    "As before, you can click the \"Preview\" button to see how your markdown formatting will look (<<fastpage_preview1>>).\n",
    "\n",
-    "<img width=\"400\" src=\"images/fast_template/image10.png\" alt=\"Screenshot showing the same blog post interpreted in HTML\">\n",
+    "<img width=\"400\" src=\"images/fast_template/image10.png\" alt=\"Screenshot showing the same blog post interpreted in HTML\" id=\"fastpage_preview1\" caption=\"What the previous mardown syntax will look like on your blog\">\n",
    "\n",
-    "And you will need to click the \"commit new file\" button to save it to\n",
-    "GitHub.\n",
+    "And you will need to click the \"Commit new file\" button to save it to GitHub, as shown in <<fastpage_commit1>>.\n",
    "\n",
-    "<img width=\"700\" src=\"images/fast_template/image11.png\" alt=\"Screenshot showing where to click to commit the new file\">"
+    "<img width=\"700\" src=\"images/fast_template/image11.png\" alt=\"Screenshot showing where to click to commit the new file\" id=\"fastpage_commit1\" caption=\"Commit your changes to save them\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Have a look at your blog homepage again, and you will see that this post\n",
-    "has now appeared! (Remember that you will need to wait a minute or so\n",
-    "for GitHub to process it.)\n",
+    "Have a look at your blog home page again, and you will see that this post has now appeared--<<fastpage_live>> shows the result with the sample pose we just added. (Remember that you will need to wait a minute or so for GitHub to process the request before the file shows up.)\n",
    "\n",
-    "<img width=\"500\" src=\"images/fast_template/image12.png\" alt=\"Screenshot showing the first post on the blog website\">\n",
+    "<img width=\"500\" src=\"images/fast_template/image12.png\" alt=\"Screenshot showing the first post on the blog website\" id=\"fastpage_live\" caption=\"Your first post is live!\">\n",
    "\n",
-    "You’ll also see that we provided a sample blog post, which you can go\n",
-    "ahead and delete now. Go to your posts folder, as before, and click on\n",
-    "\"2020-01-14-welcome.md\". Then click on the trash icon on the far\n",
-    "right.\n",
+    "You may have noticed that we provided a sample blog post, which you can go ahead and delete now. Go to your *\\_posts* folder, as before, and click on *2020-01-14-welcome.md*. Then click the trash icon on the far right, as shown in <<fastpage_delete>>.\n",
    "\n",
-    "<img width=\"400\" src=\"images/fast_template/image13.png\" alt=\"Screenshot showing how to delete the mock post\">"
+    "<img width=\"400\" src=\"images/fast_template/image13.png\" alt=\"Screenshot showing how to delete the mock post\" id=\"fastpage_delete\" caption=\"Delete the sample blog post\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In GitHub, nothing actually changes until you commit— including deleting\n",
-    "a file! So, after you click the trash icon, scroll down to the bottom\n",
-    "and commit your changes.\n",
+    "In GitHub, nothing actually changes until you commit—including when you delete a file! So, after you click the trash icon, scroll down to the bottom of the page and commit your changes.\n",
    "\n",
    "You can include images in your posts by adding a line of markdown like\n",
    "the following:\n",
    "\n",
    "    ![Image description](images/filename.jpg)\n",
    "\n",
-    "For this to work, you will need to put the image inside your \"images\"\n",
-    "folder. To do this, click on the images folder to go into it in GitHub,\n",
-    "and then click the \"upload files\" button.\n",
+    "For this to work, you will need to put the image inside your *images* folder. To do this, click the *images* folder, them click \"Upload files\" button (<<fastpage_upload>>).\n",
    "\n",
-    "<img width=\"400\" src=\"images/fast_template/image14.png\" alt=\"Screenshot showing how to upload new files\">"
+    "<img width=\"400\" src=\"images/fast_template/image14.png\" alt=\"Screenshot showing how to upload new files\" id=\"fastpage_upload\" caption=\"Upload a file from your computer\">"
   ]
  },
  {
@ -234,47 +198,47 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "There’s lots of reasons you might want to copy your blog content from GitHub to your computer. Perhaps you want to read or edit your posts offline. Or maybe you’d like a backup in case something happens to your GitHub repository.\n",
+    "There are lots of reasons you might want to copy your blog content from GitHub to your computer--you might want to be able to read or edit your posts offline, or maybe you’d like a backup in case something happens to your GitHub repository.\n",
    "\n",
-    "GitHub does more than just let you copy your repository to your computer; it lets you *synchronize* it with your computer. So, you can make changes on GitHub, and they’ll copy over to your computer, and you can make changes on your computer, and they’ll copy over to GitHub. You can even let other people access and modify your blog, and their changes and your changes will be automatically combined together next time you sync.\n",
+    "GitHub does more than just let you copy your repository to your computer; it lets you *synchronize* it with your computer. That means you can make changes on GitHub, and they’ll copy over to your computer, and you can make changes on your computer, and they’ll copy over to GitHub. You can even let other people access and modify your blog, and their changes and your changes will be automatically combined together the next time you sync.\n",
    "\n",
-    "To make this work, you have to install an application called [GitHub Desktop](https://desktop.github.com/) to your computer. It runs on Mac, Windows, and Linux. Follow the directions at the link to install it, then when you run it it’ll ask you to login to GitHub, and then to select your repository to sync; click \"Clone a repository from the Internet\".\n",
+    "To make this work, you have to install an application called [GitHub Desktop](https://desktop.github.com/) on your computer. It runs on Mac, Windows, and Linux. Follow the directions to install it, and when you run it it’ll ask you to log in to GitHub and select the repository to sync. Click \"Clone a repository from the Internet,\" as shown in <<fastpage_clone>>.\n",
    "\n",
-    "<img src=\"images/gitblog/image1.png\" width=\"400\" alt=\"A screenshot showing how to clone your repository\">"
+    "<img src=\"images/gitblog/image1.png\" width=\"400\" alt=\"A screenshot showing how to clone your repository\" id=\"fastpage_clone\" caption=\"Clone your repository on GitHub Desktop\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Once GitHub has finished syncing your repo, you’ll be able to click \"View the files of your repository in Finder\" (or Explorer), and you’ll see the local copy of your blog! Try editing one of the files on your computer. Then return to GitHub Desktop, and you’ll see the \"Sync\" button is waiting for you to press it. When you click it, your changes will be copied over to GitHub, where you’ll see them reflected on the web site.\n",
+    "Once GitHub has finished syncing your repo, you’ll be able to click \"View the files of your repository in Explorer\" (or Finder), as shown in <<fastpage_explorer>> and you’ll see the local copy of your blog! Try editing one of the files on your computer. Then return to GitHub Desktop, and you’ll see the \"Sync\" button is waiting for you to press it. When you click it your changes will be copied over to GitHub, where you’ll see them reflected on the website.\n",
    "\n",
-    "<img src=\"images/gitblog/image2.png\" width=\"600\" alt=\"A screenshot showing the cloned repository\">"
+    "<img src=\"images/gitblog/image2.png\" width=\"600\" alt=\"A screenshot showing the cloned repository\" id=\"fastpage_explorer\" caption=\"Viewing your files locally\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If you haven't used git before, GitHub Desktop and a blog is a great way to get started. As you'll discover, it's a fundamental tool used by most data scientists. Another tool that we hope you now love too is Jupyter Notebooks. And there is a way to write your blog directly with it!"
+    "If you haven't used `git` before, GitHub Desktop is a great way to get started. As you'll discover, it's a fundamental tool used by most data scientists. Another tool that we hope you now love is Jupyter Notebooks--and there's a way to write your blog directly with that too!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Jupyter for Blogging"
+    "## Jupyter for Blogging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You can also write blog posts using Jupyter Notebooks! Your markdown cells, code cells, and all outputs will appear in your exported blog post. The best way to do this may have changed by the time you are reading this book, so be sure to check out the [book website](https://book.fast.ai) for the latest information. As we write this, the easiest way to create a blog from notebooks is to use [fastpages](http://fastpages.fast.ai/), which is a more advanced version of `fast_template`. \n",
+    "You can also write blog posts using Jupyter notebooks. Your markdown cells, code cells, and all the outputs will appear in your exported blog post. The best way to do this may have changed by the time you are reading this book, so be sure to check out the [book's website](https://book.fast.ai) for the latest information. As we write this, the easiest way to create a blog from notebooks is to use [`fastpages`](http://fastpages.fast.ai/), which is a more advanced version of `fast_template`. \n",
    "\n",
-    "To blog with a notebook, just pop it in the `_notebooks` folder in your blog repo, and it will appear in your blog. When you write your notebook, write whatever you want your audience to see. Since most writing platforms make it much harder to include code and outputs, many of us are in a habit of including less real examples than we should. So try to get into a new habit of including lots of examples as you write.\n",
+    "To blog with a notebook, just pop it in the *\\_notebooks* folder in your blog repo, and it will appear in your list of blog posts. When you write your notebook, write whatever you want your audience to see. Since most writing platforms make it hard to include code and outputs, many of us are in the habit of including fewer real examples than we should. This is a great way to instead get into the habit of including lots of examples as you write.\n",
    "\n",
-    "Often you'll want to hide boilerplate such as import statements. Add `#hide` to the top of any cell to make it not show up in output. Jupyter displays the result of the last line of a cell, so there's no need to include `print()`. (And including extra code that isn't needed means there's more cognitive overhead for the reader; so don't include code that you don't really need!)"
+    "Often, you'll want to hide boilerplate such as import statements. You can add `#hide` to the top of any cell to make it not show up in output. Jupyter displays the result of the last line of a cell, so there's no need to include `print()`. (Including extra code that isn't needed means there's more cognitive overhead for the reader; so don't include code that you don't really need!)"
   ]
  },
  {
--- a/clean/01_intro.ipynb
+++ b/clean/01_intro.ipynb
@ -1513,7 +1513,7 @@
    "1. What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy?\n",
    "1. What do you need in order to train a model?\n",
    "1. How could a feedback loop impact the rollout of a predictive policing model?\n",
-    "1. Do we always have to use 224\\*224-pixel images with the cat recognition model?\n",
+    "1. Do we always have to use 224×224-pixel images with the cat recognition model?\n",
    "1. What is the difference between classification and regression?\n",
    "1. What is a validation set? What is a test set? Why do we need them?\n",
    "1. What will fastai do if you don't provide a validation set?\n",
--- a/clean/04_mnist_basics.ipynb
+++ b/clean/04_mnist_basics.ipynb
@ -4232,7 +4232,7 @@
    "1. What is the difference between tensor rank and shape? How do you get the rank from the shape?\n",
    "1. What are RMSE and L1 norm?\n",
    "1. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?\n",
-    "1. Create a 3\\*3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.\n",
+    "1. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.\n",
    "1. What is broadcasting?\n",
    "1. Are metrics generally calculated using the training set, or the validation set? Why?\n",
    "1. What is SGD?\n",
--- a/clean/13_convolutions.ipynb
+++ b/clean/13_convolutions.ipynb
@ -2590,8 +2590,8 @@
   "source": [
    "1. What is a \"feature\"?\n",
    "1. Write out the convolutional kernel matrix for a top edge detector.\n",
-    "1. Write out the mathematical operation applied by a 3\\*3 kernel to a single pixel in an image.\n",
-    "1. What is the value of a convolutional kernel apply to a 3\\*3 matrix of zeros?\n",
+    "1. Write out the mathematical operation applied by a 3×3 kernel to a single pixel in an image.\n",
+    "1. What is the value of a convolutional kernel apply to a 3×3 matrix of zeros?\n",
    "1. What is \"padding\"?\n",
    "1. What is \"stride\"?\n",
    "1. Create a nested list comprehension to complete any task that you choose.\n",
--- a/clean/14_resnet.ipynb
+++ b/clean/14_resnet.ipynb
@ -841,7 +841,7 @@
    "1. What is the basic equation for a ResNet block (ignoring batchnorm and ReLU layers)?\n",
    "1. What do ResNets have to do with residuals?\n",
    "1. How do we deal with the skip connection when there is a stride-2 convolution? How about when the number of filters changes?\n",
-    "1. How can we express a 1\\*1 convolution in terms of a vector dot product?\n",
+    "1. How can we express a 1×1 convolution in terms of a vector dot product?\n",
    "1. Create a `1x1 convolution` with `F.conv2d` or `nn.Conv2d` and apply it to an image. What happens to the `shape` of the image?\n",
    "1. What does the `noop` function return?\n",
    "1. Explain what is shown in <<resnet_surface>>.\n",
@ -865,7 +865,7 @@
   "metadata": {},
   "source": [
    "1. Try creating a fully convolutional net with adaptive average pooling for MNIST (note that you'll need fewer stride-2 layers). How does it compare to a network without such a pooling layer?\n",
-    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1\\*1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
+    "1. In <<chapter_foundations>> we introduce *Einstein summation notation*. Skip ahead to see how this works, and then write an implementation of the 1×1 convolution operation using `torch.einsum`. Compare it to the same operation using `torch.conv2d`.\n",
    "1. Write a \"top-5 accuracy\" function using plain PyTorch or plain Python.\n",
    "1. Train a model on Imagenette for more epochs, with and without label smoothing. Take a look at the Imagenette leaderboards and see how close you can get to the best results shown. Read the linked pages describing the leading approaches."
   ]
--- a/clean/17_foundations.ipynb
+++ b/clean/17_foundations.ipynb
@ -23,7 +23,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## A Neural Net Layer from Scratch"
+    "## Building a Neural Net Layer from Scratch"
   ]
  },
  {
@ -710,7 +710,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "#### Broadcasting Rules"
+    "#### Broadcasting rules"
   ]
  },
  {
@ -1156,7 +1156,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Gradients and Backward Pass"
+    "### Gradients and the Backward Pass"
   ]
  },
  {
@ -1258,7 +1258,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Refactor the Model"
+    "### Refactoring the Model"
   ]
  },
  {
@ -1537,41 +1537,41 @@
    "1. Write the Python code to implement a single neuron.\n",
    "1. Write the Python code to implement ReLU.\n",
    "1. Write the Python code for a dense layer in terms of matrix multiplication.\n",
-    "1. Write the Python code for a dense layer in plain Python (that is with list comprehensions and functionality built into Python).\n",
-    "1. What is the hidden size of a layer?\n",
-    "1. What does the `t` method to in PyTorch?\n",
+    "1. Write the Python code for a dense layer in plain Python (that is, with list comprehensions and functionality built into Python).\n",
+    "1. What is the \"hidden size\" of a layer?\n",
+    "1. What does the `t` method do in PyTorch?\n",
    "1. Why is matrix multiplication written in plain Python very slow?\n",
-    "1. In matmul, why is `ac==br`?\n",
-    "1. In Jupyter notebook, how do you measure the time taken for a single cell to execute?\n",
-    "1. What is elementwise arithmetic?\n",
+    "1. In `matmul`, why is `ac==br`?\n",
+    "1. In Jupyter Notebook, how do you measure the time taken for a single cell to execute?\n",
+    "1. What is \"elementwise arithmetic\"?\n",
    "1. Write the PyTorch code to test whether every element of `a` is greater than the corresponding element of `b`.\n",
    "1. What is a rank-0 tensor? How do you convert it to a plain Python data type?\n",
-    "1. What does this return, and why?: `tensor([1,2]) + tensor([1])`\n",
-    "1. What does this return, and why?: `tensor([1,2]) + tensor([1,2,3])`\n",
-    "1. How does elementwise arithmetic help us speed up matmul?\n",
+    "1. What does this return, and why? `tensor([1,2]) + tensor([1])`\n",
+    "1. What does this return, and why? `tensor([1,2]) + tensor([1,2,3])`\n",
+    "1. How does elementwise arithmetic help us speed up `matmul`?\n",
    "1. What are the broadcasting rules?\n",
    "1. What is `expand_as`? Show an example of how it can be used to match the results of broadcasting.\n",
    "1. How does `unsqueeze` help us to solve certain broadcasting problems?\n",
-    "1. How can you use indexing to do the same operation as `unsqueeze`?\n",
+    "1. How can we use indexing to do the same operation as `unsqueeze`?\n",
    "1. How do we show the actual contents of the memory used for a tensor?\n",
-    "1. When adding a vector of size 3 to a matrix of size 3 x 3, are the elements of the vector added to each row, or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)\n",
+    "1. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)\n",
    "1. Do broadcasting and `expand_as` result in increased memory use? Why or why not?\n",
-    "1. Implement matmul using Einstein summation.\n",
+    "1. Implement `matmul` using Einstein summation.\n",
    "1. What does a repeated index letter represent on the left-hand side of einsum?\n",
    "1. What are the three rules of Einstein summation notation? Why?\n",
-    "1. What is the forward pass, and the backward pass, of a neural network?\n",
+    "1. What are the forward pass and backward pass of a neural network?\n",
    "1. Why do we need to store some of the activations calculated for intermediate layers in the forward pass?\n",
-    "1. What is the downside of having activations with a standard deviation too far away from one?\n",
-    "1. How can weight initialisation help avoid this problem?\n",
-    "1. What is the formula to initialise weights such that we get a standard deviation of one, for a plain linear layer; for a linear layer followed by ReLU?\n",
+    "1. What is the downside of having activations with a standard deviation too far away from 1?\n",
+    "1. How can weight initialization help avoid this problem?\n",
+    "1. What is the formula to initialize weights such that we get a standard deviation of 1 for a plain linear layer, and for a linear layer followed by ReLU?\n",
    "1. Why do we sometimes have to use the `squeeze` method in loss functions?\n",
-    "1. What does the argument to the squeeze method do? Why might it be important to include this argument, even though PyTorch does not require it?\n",
-    "1. What is the chain rule? Show the equation in either of the two forms shown in this chapter.\n",
+    "1. What does the argument to the `squeeze` method do? Why might it be important to include this argument, even though PyTorch does not require it?\n",
+    "1. What is the \"chain rule\"? Show the equation in either of the two forms presented in this chapter.\n",
    "1. Show how to calculate the gradients of `mse(lin(l2, w2, b2), y)` using the chain rule.\n",
-    "1. What is the gradient of relu? Show in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)\n",
+    "1. What is the gradient of ReLU? Show it in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)\n",
    "1. In what order do we need to call the `*_grad` functions in the backward pass? Why?\n",
    "1. What is `__call__`?\n",
-    "1. What methods do we need to implement when writing a `torch.autograd.Function`?\n",
+    "1. What methods must we implement when writing a `torch.autograd.Function`?\n",
    "1. Write `nn.Linear` from scratch, and test it works.\n",
    "1. What is the difference between `nn.Module` and fastai's `Module`?"
   ]
@ -1587,10 +1587,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Implement relu as a `torch.autograd.Function` and train a model with it.\n",
-    "1. If you are mathematically inclined, find out what the gradients of a linear layer are in maths notation. Map that to the implementation we saw in this chapter.\n",
-    "1. Learn about the `unfold` method in PyTorch, and use it along with matrix multiplication to implement your own 2d convolution function, and train a CNN that uses it.\n",
-    "1. Implement all what is in this chapter using numpy instead of PyTorch. "
+    "1. Implement ReLU as a `torch.autograd.Function` and train a model with it.\n",
+    "1. If you are mathematically inclined, find out what the gradients of a linear layer are in mathematical notation. Map that to the implementation we saw in this chapter.\n",
+    "1. Learn about the `unfold` method in PyTorch, and use it along with matrix multiplication to implement your own 2D convolution function. Then train a CNN that uses it.\n",
+    "1. Implement everything in this chapter using NumPy instead of PyTorch. "
   ]
  },
  {
--- a/clean/18_CAM.ipynb
+++ b/clean/18_CAM.ipynb
--- a/clean/19_learner.ipynb
+++ b/clean/19_learner.ipynb
@ -14,7 +14,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# fastai Learner from Scratch"
+    "# A fastai Learner from Scratch"
   ]
  },
  {
@ -1288,37 +1288,37 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For the questions here that ask you to explain what some function or class is, you should also complete your own code experiments."
+    "> tip: Experiments: For the questions here that ask you to explain what some function or class is, you should also complete your own code experiments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. What is glob?\n",
+    "1. What is `glob`?\n",
    "1. How do you open an image with the Python imaging library?\n",
-    "1. What does L.map do?\n",
-    "1. What does Self do?\n",
-    "1. What is L.val2idx?\n",
-    "1. What methods do you need to implement to create your own Dataset?\n",
+    "1. What does `L.map` do?\n",
+    "1. What does `Self` do?\n",
+    "1. What is `L.val2idx`?\n",
+    "1. What methods do you need to implement to create your own `Dataset`?\n",
    "1. Why do we call `convert` when we open an image from Imagenette?\n",
    "1. What does `~` do? How is it useful for splitting training and validation sets?\n",
-    "1. Which of these classes does `~` work with: `L`, `Tensor`, numpy array, Python `list`, pandas `DataFrame`?\n",
-    "1. What is ProcessPoolExecutor?\n",
+    "1. Does `~` work with the `L` or `Tensor` classes? What about NumPy arrays, Python lists, or pandas DataFrames?\n",
+    "1. What is `ProcessPoolExecutor`?\n",
    "1. How does `L.range(self.ds)` work?\n",
    "1. What is `__iter__`?\n",
    "1. What is `first`?\n",
    "1. What is `permute`? Why is it needed?\n",
    "1. What is a recursive function? How does it help us define the `parameters` method?\n",
-    "1. Write a recursive function which returns the first 20 items of the Fibonacci sequence.\n",
+    "1. Write a recursive function that returns the first 20 items of the Fibonacci sequence.\n",
    "1. What is `super`?\n",
-    "1. Why do subclasses of Module need to override `forward` instead of defining `__call__`?\n",
-    "1. In `ConvLayer` why does `init` depend on `act`?\n",
+    "1. Why do subclasses of `Module` need to override `forward` instead of defining `__call__`?\n",
+    "1. In `ConvLayer`, why does `init` depend on `act`?\n",
    "1. Why does `Sequential` need to call `register_modules`?\n",
-    "1. Write a hook that prints the shape of every layers activations.\n",
-    "1. What is LogSumExp?\n",
-    "1. Why is log_softmax useful?\n",
-    "1. What is GetAttr? How is it helpful for callbacks?\n",
+    "1. Write a hook that prints the shape of every layer's activations.\n",
+    "1. What is \"LogSumExp\"?\n",
+    "1. Why is `log_softmax` useful?\n",
+    "1. What is `GetAttr`? How is it helpful for callbacks?\n",
    "1. Reimplement one of the callbacks in this chapter without inheriting from `Callback` or `GetAttr`.\n",
    "1. What does `Learner.__call__` do?\n",
    "1. What is `getattr`? (Note the case difference to `GetAttr`!)\n",
@ -1326,7 +1326,7 @@
    "1. Why do we check for `model.training` in `one_batch`?\n",
    "1. What is `store_attr`?\n",
    "1. What is the purpose of `TrackResults.before_epoch`?\n",
-    "1. What does `model.cuda()` do? How does it work?\n",
+    "1. What does `model.cuda` do? How does it work?\n",
    "1. Why do we need to check `model.training` in `LRFinder` and `OneCycle`?\n",
    "1. Use cosine annealing in `OneCycle`."
   ]
@ -1342,15 +1342,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Write `resnet18` from scratch (refer to <<chapter_resnet>> as needed), and train it with the Learner in this chapter.\n",
-    "1. Implement a batchnorm layer from scratch and use it in your resnet18.\n",
-    "1. Write a mixup callback for use in this chapter.\n",
-    "1. Add momentum to `SGD`.\n",
+    "1. Write `resnet18` from scratch (refer to <<chapter_resnet>> as needed), and train it with the `Learner` in this chapter.\n",
+    "1. Implement a batchnorm layer from scratch and use it in your `resnet18`.\n",
+    "1. Write a Mixup callback for use in this chapter.\n",
+    "1. Add momentum to SGD.\n",
    "1. Pick a few features that you're interested in from fastai (or any other library) and implement them in this chapter.\n",
    "1. Pick a research paper that's not yet implemented in fastai or PyTorch and implement it in this chapter.\n",
    "  - Port it over to fastai.\n",
-    "  - Submit a PR to fastai, or create your own extension module and release it. \n",
-    "  - Hint: you may find it helpful to use [nbdev](https://nbdev.fast.ai/) to create and deploy your package."
+    "  - Submit a pull request to fastai, or create your own extension module and release it. \n",
+    "  - Hint: you may find it helpful to use [`nbdev`](https://nbdev.fast.ai/) to create and deploy your package."
   ]
  },
  {
--- a/clean/app_blog.ipynb
+++ b/clean/app_blog.ipynb
@ -36,7 +36,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Setting up Your Homepage"
+    "### Setting Up Your Home Page"
   ]
  },
  {
@ -57,7 +57,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Jupyter for Blogging"
+    "## Jupyter for Blogging"
   ]
  },
  {