Update

2020-05-14 17:27:39 -07:00 · 2020-05-14 17:27:39 -07:00 · a482265f72
commit a482265f72
parent db9c7e81b1
4 changed files with 424 additions and 473 deletions
--- a/08_collab.ipynb
+++ b/08_collab.ipynb
@ -28,20 +28,20 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One very common problem to solve is when you have a number of users, and a number of products, you then want to recommend which products are most likely to be useful for which users. There are many variations of this, for example, recommending movies (such as on Netflix), figuring out what to highlight for a user on a homepage, deciding what stories to show in a social media feed, and so forth. There is a general solution to this problem, called *collaborative filtering*, which works like this: have a look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend the products that those other users have used or liked.\n",
+    "One very common problem to solve is when you have a number of users and a number of products, and you want to recommend which products are most likely to be useful for which users. There are many variations of this: for example, recommending movies (such as on Netflix), figuring out what to highlight for a user on a home page, deciding what stories to show in a social media feed, and so forth. There is a general solution to this problem, called *collaborative filtering*, which works like this: look at what products the current user has used or liked, find other users that have used or liked similar products, and then recommend other products that those users have used or liked.\n",
    "\n",
-    "For example, on Netflix you may have watched lots of movies that are science-fiction, full of action, and were made in the 1970s. Netflix may not know these particular properties of the films you have watched, but it would be able to see that other people that have watched the same movies that you watched also tended to watch other movies that are science-fiction, full of action, and were made in the 1970s. In other words, to use this approach we don't necessarily need to know anything about the movies, except who like to watch them.\n",
+    "For example, on Netflix you may have watched lots of movies that are science fiction, full of action, and were made in the 1970s. Netflix may not know these particular properties of the films you have watched, but it will be able to see that other people that have watched the same movies that you watched also tended to watch other movies that are science fiction, full of action, and were made in the 1970s. In other words, to use this approach we don't necessarily need to know anything about the movies, except who like to watch them.\n",
    "\n",
-    "There is actually a more general class of problems that this approach can solve; not necessarily just things involving users and products. Indeed, for collaborative filtering we more commonly refer to *items*, rather than *products*. Items could be links that you click on, diagnoses that are selected for patients, and so forth.\n",
+    "There is actually a more general class of problems that this approach can solve, not necessarily involving users and products. Indeed, for collaborative filtering we more commonly refer to *items*, rather than *products*. Items could be links that people click on, diagnoses that are selected for patients, and so forth.\n",
    "\n",
-    "The key foundational idea is that of *latent factors*. In the above Netflix example, we started with the assumption that you like old action sci-fi movies. But you never actually told Netflix that you like these kinds of movies. And Netflix never actually needed to add columns to their movies table saying which movies are of these types. But there must be some underlying concept of sci-fi, action, and movie age. And these concepts must be relevant for at least some people's movie watching decisions."
+    "The key foundational idea is that of *latent factors*. In the Netflix example, we started with the assumption that you like old, action-packed sci-fi movies. But you never actually told Netflix that you like these kinds of movies. And Netflix never actually needed to add columns to its movies table saying which movies are of these types. Still, there must be some underlying concept of sci-fi, action, and movie age, and these concepts must be relevant for at least some people's movie watching decisions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "First, let's get some data suitable for a collaborative filtering model."
+    "For this chapter we are going to work on this movie recommendation problem. We'll start by getting some data suitable for a collaborative filtering model."
   ]
  },
  {
@ -55,7 +55,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For this chapter we are going to work on this movie review problem. We do not have access to Netflix's entire dataset of movie watching history, but there is a great dataset that we can use, called MovieLens. This dataset contains tens of millions of movie rankings (that is a combination of a movie ID, a user ID, and a numeric rating), although we will just use a subset of 100,000 of them for our example. If you're interested, it would be a great learning project to try and replicate this approach on the full 25 million recommendation dataset you can get from their website."
+    "We do not have access to Netflix's entire dataset of movie watching history, but there is a great dataset that we can use, called [MovieLens](https://grouplens.org/datasets/movielens/). This dataset contains tens of millions of movie rankings (a combination of a movie ID, a user ID, and a numeric rating), although we will just use a subset of 100,000 of them for our example. If you're interested, it would be a great learning project to try and replicate this approach on the full 25-million recommendation dataset, which you can get from their website."
   ]
  },
  {
@ -80,7 +80,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "According to the `README`, the main table is in the file `u.data`. It is tab-separated and the columns are respectively user, movie, rating and timestamp. Since those names are not encoded, we need to indicate them when reading the file with pandas. Here is a way to open this table and take a look:"
+    "According to the *README*, the main table is in the file *u.data*. It is tab-separated and the columns are, respectively user, movie, rating, and timestamp. Since those names are not encoded, we need to indicate them when reading the file with Pandas. Here is a way to open this table and take a look:"
   ]
  },
  {
@ -179,7 +179,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <<movie_xtab>> shows the same data cross tabulated into a human friendly table."
+    "Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <<movie_xtab>> shows the same data cross-tabulated into a human-friendly table."
   ]
  },
  {
@ -193,9 +193,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have selected just a few of the most popular movies, and users who watch the most movies, for this crosstab example. The empty cells in this table are the things that we would like our model to learn to fill in. Those other places where a user has not reviewed the movie yet, presumably because they have not watched. So for each user, we would like to figure out which of those movies they might be most likely to enjoy.\n",
+    "We have selected just a few of the most popular movies, and users who watch the most movies, for this crosstab example. The empty cells in this table are the things that we would like our model to learn to fill in. Those are the places where a user has not reviewed the movie yet, presumably because they have not watched it. For each user, we would like to figure out which of those movies they might be most likely to enjoy.\n",
    "\n",
-    "If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between -1 and positive one, and positive means high match and negative means low match, and the categories are science-fiction, action, and old movies, then we could represent the movie The Last Skywalker as:"
+    "If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between -1 and +1, with positive numbers indicating stronger matches and negative numbers weaker ones, and the categories are science-fiction, action, and old movies, then we could represent the movie *The Last Skywalker* as:"
   ]
  },
  {
@ -227,7 +227,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "…and we can now calculate the match between this combination:"
+    "and we can now calculate the match between this combination:"
   ]
  },
  {
@ -261,14 +261,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: dot product: the mathematical operation of multiplying the elements of two vectors together, and then summing up the result."
+    "> jargon: dot product: The mathematical operation of multiplying the elements of two vectors together, and then summing up the result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "On the other hand, we might represent the movie Casablanca as:"
+    "On the other hand, we might represent the movie *Casablanca* as:"
   ]
  },
  {
@ -284,7 +284,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "…and the match between this combination is:"
+    "The match between this combination is:"
   ]
  },
  {
@ -325,9 +325,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "There is surprisingly little distance from specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n",
+    "There is surprisingly little difference between specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n",
    "\n",
-    "Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialised values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
+    "Step 1 of this approach is to randomly initialize some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors and each movie will have a set of these factors, we can show these randomly initialized values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
   ]
  },
  {
@ -341,18 +341,18 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Step two of this approach is to calculate our predictions. As we've discussed, we can do this by simply taking the dot product of each movie with each user. If for instance the first latent user factor represents how much they like action movies, and the first latent movie factor represents if the movie has a lot of action or not, when the product of those will be particularly high if either the user likes action movie and the movie has a lot of action in it or if the user doesn't like action movie and the movie doesn't have any action in it. On the other hand, if we have a mismatch (a user loves action movies but the movie isn't, or the user doesn't like action movies and it is one), the product will be very low.\n",
+    "Step 2 of this approach is to calculate our predictions. As we've discussed, we can do this by simply taking the dot product of each movie with each user. If, for instance, the first latent user factor represents how much the user likes action movies and the first latent movie factor represents if the movie has a lot of action or not, the product of those will be particularly high if either the user likes action movies and the movie has a lot of action in it or the user doesn't like action movies and the movie doesn't have any action in it. On the other hand, if we have a mismatch (a user loves action movies but the movie isn't an action film, or the user doesn't like action movies and it is one), the product will be very low.\n",
    "\n",
-    "Step three is to calculate our loss. We can use any loss function that we wish; that's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction.\n",
+    "Step 3 is to calculate our loss. We can use any loss function that we wish; let's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction.\n",
    "\n",
-    "That's all we need. With this in place, we can optimise our parameters (that is, the latent factors) using stochastic gradient descent, such as to minimise the loss. At each step, the stochastic gradient descent optimiser will calculate the match between each movie and each user using the dot product, and will compare it to the actual rating that each user gave to each movie, and it will then calculate the derivative of this value, and will step the weights by multiplying this by the learning rate. After doing this lots of times, the loss will get better and better, and the recommendations will also get better and better."
+    "That's all we need. With this in place, we can optimize our parameters (that is, the latent factors) using stochastic gradient descent, such as to minimize the loss. At each step, the stochastic gradient descent optimizer will calculate the match between each movie and each user using the dot product, and will compare it to the actual rating that each user gave to each movie. It will then calculate the derivative of this value and will step the weights by multiplying this by the learning rate. After doing this lots of times, the loss will get better and better, and the recommendations will also get better and better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To use the usual `Learner` fit function, we will need to get our data into `DataLoaders`, so let's focus on that now."
+    "To use the usual `Learner.fit` function we will need to get our data into a `DataLoaders`, so let's focus on that now."
   ]
  },
  {
@ -366,7 +366,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When showing the data we would rather see movie titles than their ids. The table `u.item` contains the correspondence id to title:"
+    "When showing the data, we would rather see movie titles than their IDs. The table `u.item` contains the correspondence of IDs to titles:"
   ]
  },
  {
@ -453,7 +453,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Next we merge it to our ratings to get the titles."
+    "We can merge this with our `ratings` table to get the user ratings by title:"
   ]
  },
  {
@ -557,7 +557,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can then build a `DataLoaders` object from this table. By default, it takes the first column for user, the second column for the item (here our movies) and the third column for the ratings. We need to change the value of `item_name` in our case, to use the titles instead of the ids:"
+    "We can then build a `DataLoaders` object from this table. By default, it takes the first column for the user, the second column for the item (here our movies), and the third column for the ratings. We need to change the value of `item_name` in our case to use the titles instead of the IDs:"
   ]
  },
  {
@ -658,7 +658,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In order to represent collaborative filtering in PyTorch we can't just use the crosstab representation directly, especially if we want it to fit into our deep learning framework. We can represent our movie and user latent factor tables as simple matrices:"
+    "To represent collaborative filtering in PyTorch we can't just use the crosstab representation directly, especially if we want it to fit into our deep learning framework. We can represent our movie and user latent factor tables as simple matrices:"
   ]
  },
  {
@ -700,9 +700,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To calculate the result for a particular movie and use a combination we have two look up the index of the movie in our movie latent factors matrix, and the index of the user in our user latent factors matrix, and then we can do our dot product between the two latent factor vectors. But *look up in an index* is not an operation which our deep learning models know how to do. They know how to do matrix products, and activation functions.\n",
+    "To calculate the result for a particular movie and user combination, we have to look up the index of the movie in our movie latent factor matrix and the index of the user in our user latent factor matrix; then we can do our dot product between the two latent factor vectors. But *look up in an index* is not an operation our deep learning models know how to do. They know how to do matrix products, and activation functions.\n",
    "\n",
-    "It turns out that we can represent *look up in an index* as a matrix product! The trick is to replace our indices with one hot encoded vectors. Here is an example of what happens if we multiply a vector by a one hot encoded vector representing the index three:"
+    "Fortunately, it turns out that we can represent *look up in an index* as a matrix product. The trick is to replace our indices with one-hot-encoded vectors. Here is an example of what happens if we multiply a vector by a one-hot-encoded vector representing the index 3:"
   ]
  },
  {
@ -714,26 +714,6 @@
    "one_hot_3 = one_hot(3, n_users).float()"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "torch.Size([944, 5])"
-      ]
-     },
-     "execution_count": null,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "user_factors.shape"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -785,29 +765,29 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If we do that for a few indices at once, we will have a matrix of one-hot encoded vectors and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one hot encoded vector, or to search through it to find the occurrence of the number one — we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer which does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one hot encoded vector. This is called an *embedding*."
+    "If we do that for a few indices at once, we will have a matrix of one-hot-encoded vectors, and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one-hot-encoded vector, or to search through it to find the occurrence of the number one—we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer that does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one-hot-encoded vector. This is called an *embedding*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: embedding layer: multiplying by a one hot encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. It is quite a fancy word for a very simple concept. The thing that you multiply the one hot encoded matrix by (or, using the computational shortcut, index into directly) is called the _embedding matrix_."
+    "> jargon: Embedding: Multiplying by a one-hot-encoded matrix, using the computational shortcut that it can be implemented by simply indexing directly. This is quite a fancy word for a very simple concept. The thing that you multiply the one-hot-encoded matrix by (or, using the computational shortcut, index into directly) is called the _embedding matrix_."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In computer vision, we had a very easy way to get all the information of a pixel through its RGB values: each pixel in a coloured imaged is represented by three numbers. Those three numbers gave us the red-ness, the green-ness and the blue-ness, which is enough to get our model to work afterward.\n",
+    "In computer vision, we have a very easy way to get all the information of a pixel through its RGB values: each pixel in a colored image is represented by three numbers. Those three numbers give us the redness, the greenness and the blueness, which is enough to get our model to work afterward.\n",
    "\n",
-    "For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one user might particularly like. \n",
+    "For the problem at hand, we don't have the same easy way to characterize a user or a movie. There are probably relations with genres: if a given user likes romance, they are likely to give higher scores to romance movies. Other factors might be wether the movie is more action-oriented versus heavy on dialogue, or the presence of a specific actor that a user might particularly like. \n",
    "\n",
-    "How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, let our model figure out itself the features that seem important or not.\n",
+    "How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, our model can figure out itself the features that seem important or not.\n",
    "\n",
-    "This is what embeddings are. We will attribute to each of our users and each of our movie a random vector of a certain length (here `n_factors=5`), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and update them with the rule of SGD (or another optimizer).\n",
+    "This is what embeddings are. We will attribute to each of our users and each of our movies a random vector of a certain length (here, `n_factors=5`), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and update them with the rules of SGD (or another optimizer).\n",
    "\n",
-    "At the beginning, those numbers don't mean anything since we have chosen them randomly, but by the end of training, they will. By learning on existing data between users and movies, without having any other information, we will see that they still get some important features, and can isolate blockbusters from independent cinema, action movies from romance...\n",
+    "At the beginning, those numbers don't mean anything since we have chosen them randomly, but by the end of training, they will. By learning on existing data about the relations between users and movies, without having any other information, we will see that they still get some important features, and can isolate blockbusters from independent cinema, action movies from romance, and so on.\n",
    "\n",
    "We are now in a position that we can create our whole model from scratch."
   ]
@ -823,9 +803,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Before we can write a model in PyTorch, we first need to learn the basics of object-oriented programming and Python. If you haven't done any object oriented programming before, we will give you a quick introduction here, but we would recommend looking up a tutorial and doing some practice before moving on.\n",
+    "Before we can write a model in PyTorch, we first need to learn the basics of object-oriented programming and Python. If you haven't done any object-oriented programming before, we will give you a quick introduction here, but we would recommend looking up a tutorial and getting some practice before moving on.\n",
    "\n",
-    "The key idea in object-oriented programming is the *class*. We have been using classes throughout this book, such as DataLoader, string, and Learner. Python makes it easy for us to create new classes. Here is an example of a simple class:"
+    "The key idea in object-oriented programming is the *class*. We have been using classes throughout this book, such as `DataLoader`, `string`, and `Learner`. Python also makes it easy for us to create new classes. Here is an example of a simple class:"
   ]
  },
  {
@ -843,7 +823,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behaviour associated with this method name. In the case of `__init__`, this is the method which Python will call when your new object is created. So, this is where you can set up any state which needs to be done upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method as parameters. Note that the first parameter to any method defined inside a class is `self`, so you can use this to set and get any attributes that you will need."
+    "The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behavior associated with this method name. In the case of `__init__`, this is the method Python will call when your new object is created. So, this is where you can set up any state that needs to be initialized upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method as parameters. Note that the first parameter to any method defined inside a class is `self`, so you can use this to set and get any attributes that you will need:"
   ]
  },
  {
@ -871,9 +851,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Also note that creating a new PyTorch module requires inheriting from Module. *Inheritance* is an important object-oriented concept which we will not discuss in detail here — in short, it means that we can add additional behaviour to an existing class. PyTorch already provides a Module class, which provides some basic foundations that we want to build on. So, we add the name of this *super class* after the name of the class that we are defining, as you see above.\n",
+    "Also note that creating a new PyTorch module requires inheriting from `Module`. *Inheritance* is an important object-oriented concept that we will not discuss in detail here—in short, it means that we can add additional behavior to an existing class. PyTorch already provides a `Module` class, which provides some basic foundations that we want to build on. So, we add the name of this *superclass* after the name of the class that we are defining, as shown in the following example.\n",
    "\n",
-    "The final thing that you need to know to create a new PyTorch module, is that when your module is called, PyTorch will call a method in your class called `forward`, and will pass along to that any parameters that are included in the call. Here is our dot product model:"
+    "The final thing that you need to know to create a new PyTorch module is that when your module is called, PyTorch will call a method in your class called `forward`, and will pass along to that any parameters that are included in the call. Here is the class defining our dot product model:"
   ]
  },
  {
@ -899,7 +879,7 @@
   "source": [
    "If you haven't seen object-oriented programming before, then don't worry, you won't need to use it much in this book. We are just mentioning this approach here, because most online tutorials and documentation will use the object-oriented syntax.\n",
    "\n",
-    "Note that the input of the model is a tensor of shape `batch_size x 2`, where the first columns (`x[:, 0]`) contains the user ids and the second column (`x[:, 1]`) contains the movie ids. As explained before, we use the *embedding* layers to represent our matrices of user and movie latent factors."
+    "Note that the input of the model is a tensor of shape `batch_size x 2`, where the first column (`x[:, 0]`) contains the user IDs and the second column (`x[:, 1]`) contains the movie IDs. As explained before, we use the *embedding* layers to represent our matrices of user and movie latent factors:"
   ]
  },
  {
@ -1014,7 +994,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The first thing we can do to make this model a little bit better is to force those predictions between 0 and 5. For this, we just need to use `sigmoid_range`, like in the previous chapter. One thing we discovered empirically is that it's better to have the range go a little bit over 5, so we use `(0, 5.5)`."
+    "The first thing we can do to make this model a little bit better is to force those predictions to be between 0 and 5. For this, we just need to use `sigmoid_range`, like in <<chapter_mulitcat>>. One thing we discovered empirically is that it's better to have the range go a little bit over 5, so we use `(0, 5.5)`:"
   ]
  },
  {
@ -1104,9 +1084,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say, for instance, about the movie is that it is very sci-fi, very action oriented, and very not old, then you don't really have any way to say most people like it. \n",
+    "This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say about a movie is, for instance, that it is very sci-fi, very action-oriented, and very not old, then you don't really have any way to say whether most people like it. \n",
    "\n",
-    "That's because at this point we only have weights; we do not have biases. If we have a single number for each user which we add to our scores, and ditto for each movie, then this will handle this missing piece very nicely. So first of all, let's adjust our model architecture:"
+    "That's because at this point we only have weights; we do not have biases. If we have a single number for each user that we can add to our scores, and ditto for each movie, that will handle this missing piece very nicely. So first of all, let's adjust our model architecture:"
   ]
  },
  {
@ -1207,7 +1187,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Instead of being better, it ends up being worse (at least at the end of training). Why is that? If we look at both trainings carefully, we can see the validation loss stopped improving in the middle and started to get worse. As we've seen, this is a clear indication of overfitting. In this case, there is no way to use data augmentation, so we will have to use another regularisation technique. One approach that can be helpful is *weight decay*."
+    "Instead of being better, it ends up being worse (at least at the end of training). Why is that? If we look at both trainings carefully, we can see the validation loss stopped improving in the middle and started to get worse. As we've seen, this is a clear indication of overfitting. In this case, there is no way to use data augmentation, so we will have to use another regularization technique. One approach that can be helpful is *weight decay*."
   ]
  },
  {
@ -1221,9 +1201,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.\n",
+    "Weight decay, or *L2 regularization*, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.\n",
    "\n",
-    "Why would it prevent overfitting? The idea is that the larger the coefficients are, the more sharp canyons we will have in the loss function. If we take the basic example of parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is."
+    "Why would it prevent overfitting? The idea is that the larger the coefficients are, the sharper canyons we will have in the loss function. If we take the basic example of a parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is (<<parabolas>>)."
   ]
  },
  {
@ -1248,6 +1228,7 @@
   ],
   "source": [
    "#hide_input\n",
+    "#id parabolas\n",
    "x = np.linspace(-2,2,100)\n",
    "a_s = [1,2,5,10,50] \n",
    "ys = [a * x**2 for a in a_s]\n",
@ -1261,21 +1242,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So by letting our model learn high parameters, it might fit all the data points in the training set with an over-complex function that has very sharp changes, which will lead to overfitting.\n",
+    "So, letting our model learn high parameters might cause it to fit all the data points in the training set with an overcomplex function that has very sharp changes, which will lead to overfitting.\n",
    "\n",
-    "Limiting our weights from growing to much is going to hinder the training of the model, but it will yield to a state where it generalizes better. Going back to the theory a little bit, weight decay (or just `wd`) is a parameter that controls that sum of squares we add to our loss (assuming `parameters` is a tensor of all parameters):\n",
+    "Limiting our weights from growing too much is going to hinder the training of the model, but it will yield a state where it generalizes better. Going back to the theory briefly, weight decay (or just `wd`) is a parameter that controls that sum of squares we add to our loss (assuming `parameters` is a tensor of all parameters):\n",
    "\n",
    "``` python\n",
    "loss_with_wd = loss + wd * (parameters**2).sum()\n",
    "```\n",
    "\n",
-    "In practice though, it would be very inefficient (and maybe numerically unstable) to compute that big sum and add it to the loss. If you remember a little bit of high schoool math, you might recall that the derivative of `p**2` with respect to `p` is `2*p`, so adding that big sum to our loss is exactly the same as doing:\n",
+    "In practice, though, it would be very inefficient (and maybe numerically unstable) to compute that big sum and add it to the loss. If you remember a little bit of high schoool math, you might recall that the derivative of `p**2` with respect to `p` is `2*p`, so adding that big sum to our loss is exactly the same as doing:\n",
    "\n",
    "``` python\n",
    "parameters.grad += wd * 2 * parameters\n",
    "```\n",
    "\n",
-    "In practice, since `wd` is a parameter that we choose, we can just make it twice as big, so we don't even need the `*2` in the above equation. To use weight decay in fastai, just pass `wd` in your call to fit:"
+    "In practice, since `wd` is a parameter that we choose, we can just make it twice as big, so we don't even need the `*2` in this equation. To use weight decay in fastai, just pass `wd` in your call to `fit` or `fit_one_cycle`:"
   ]
  },
  {
@ -1361,7 +1342,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So far, we've used `Embedding` without thinking about how it really works. Let's recreate DotProductBias *without* using this class. We'll need a randomly initialized weight matrix for each of the embeddings. We have to be careful, however. Recall from <<chapter_mnist_basics>> that optimizers require that they can get all the parameters of a module from a module's `parameters()` method. However, this does not happen fully automatically. If we just add a tensor as an attribute to a `Module`, it will not be included in `parameters`:"
+    "So far, we've used `Embedding` without thinking about how it really works. Let's re-create `DotProductBias` *without* using this class. We'll need a randomly initialized weight matrix for each of the embeddings. We have to be careful, however. Recall from <<chapter_mnist_basics>> that optimizers require that they can get all the parameters of a module from the module's `parameters` method. However, this does not happen fully automatically. If we just add a tensor as an attribute to a `Module`, it will not be included in `parameters`:"
   ]
  },
  {
@ -1391,7 +1372,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To tell `Module` that we want to treat a tensor as parameters, we have to wrap it in the `nn.Parameter` class. This class doesn't actually add any functionality (other than automatically calling `requires_grad_()` for us). It's only used as a \"marker\" to show what to include in `parameters()`:"
+    "To tell `Module` that we want to treat a tensor as a parameter, we have to wrap it in the `nn.Parameter` class. This class doesn't actually add any functionality (other than automatically calling `requires_grad_` for us). It's only used as a \"marker\" to show what to include in `parameters`:"
   ]
  },
  {
@ -1522,7 +1503,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's train it again to check it's around the same results we saw in the previous section:"
+    "Then let's train it again to check we get around the same results we saw in the previous section:"
   ]
  },
  {
@ -1594,7 +1575,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now, let's have a look at what our model has learned."
+    "Now, let's take a look at what our model has learned."
   ]
  },
  {
@ -1608,7 +1589,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Our model is already useful, in that it can provide us with recommendations for movies for our users — but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:"
+    "Our model is already useful, in that it can provide us with movie recommendations for our users—but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:"
   ]
  },
  {
@ -1641,7 +1622,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Have a think about what this means. What this is saying is, that for these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth) they still generally don't like it. We could have simply sorted movies directly by the average rating, but looking at their learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people type do not like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:"
+    "Think about what this means. What it's saying is that for each of these movies, even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth), they still generally don't like it. We could have simply sorted the movies directly by their average rating, but looking at the learned bias tells us something much more interesting. It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people tend not to like watching it even if it is of a kind that they would otherwise enjoy! By the same token, here are the movies with the highest bias:"
   ]
  },
  {
@ -1673,9 +1654,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So, for instance, even if you don't normally enjoy detective movies, you might enjoy LA Confidential!\n",
+    "So, for instance, even if you don't normally enjoy detective movies, you might enjoy *LA Confidential*!\n",
    "\n",
-    "It is not quite so easy to directly interpret the embedding matrices. There is just too many factors for a human to look at. But there is a technique which can pull out the most important underlying *directions* in such a matrix, called *principal component analysis* (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course, Computational Linear Algebra for Coders. <<img_pca_movie>> shows what our movies look like based on two of the strongest PCA components."
+    "It is not quite so easy to directly interpret the embedding matrices. There are just too many factors for a human to look at. But there is a technique that can pull out the most important underlying *directions* in such a matrix, called *principal component analysis* (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course [Computational Linear Algebra for Coders](https://github.com/fastai/numerical-linear-algebra). <<img_pca_movie>> shows what our movies look like based on two of the strongest PCA components."
   ]
  },
  {
@ -1701,8 +1682,8 @@
   "source": [
    "#hide_input\n",
    "#id img_pca_movie\n",
-    "#caption Representation of movies on two strongest PCA components\n",
-    "#alt Representation of movies on two strongest PCA components\n",
+    "#caption Representation of movies based on two strongest PCA components\n",
+    "#alt Representation of movies based on two strongest PCA components\n",
    "g = ratings.groupby('title')['rating'].count()\n",
    "top_movies = g.sort_values(ascending=False).index.values[:1000]\n",
    "top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])\n",
@ -1731,14 +1712,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: no matter how many models I train, I never stop getting moved and surprised by how these randomly initialised bunches of numbers, trained with such simple mechanics, managed to discover things about my data all by themselves. It almost seems like cheating, that I can create code which does useful things, without ever actually telling it how to do those things!"
+    "> j: No matter how many models I train, I never stop getting moved and surprised by how these randomly initialized bunches of numbers, trained with such simple mechanics, manage to discover things about my data all by themselves. It almost seems like cheating, that I can create code that does useful things without ever actually telling it how to do those things!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We defined our model from scratch to teach you what is inside, but you can directly use the fastai library to build it."
+    "We defined our model from scratch to teach you what is inside, but you can directly use the fastai library to build it. We'll look at how to do that next."
   ]
  },
  {
@ -1752,7 +1733,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "fastai can create and train a collaborative filtering model using the exact structure shown above by using `collab_learner`:"
+    "We can create and train a collaborative filtering model using the exact structure shown earlier by using fastai's `collab_learner`:"
   ]
  },
  {
@ -1831,7 +1812,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The names of the layers can be seen by printing the model"
+    "The names of the layers can be seen by printing the model:"
   ]
  },
  {
@ -1863,7 +1844,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can use these to replicate any of the analyses we did in the previous section, for instance:"
+    "We can use these to replicate any of the analyses we did in the previous section--for instance:"
   ]
  },
  {
@ -1896,7 +1877,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "An other interesting thing we can do with these learned embeddings is to look at _distance_."
+    "Another interesting thing we can do with these learned embeddings is to look at _distance_."
   ]
  },
  {
@ -1910,7 +1891,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that X and Y are the distances between the coordinates on each axis). For a 50 dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n",
+    "On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that *x* and *y* are the distances between the coordinates on each axis). For a 50-dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n",
    "\n",
    "If there were two movies that were nearly identical, then their embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:"
   ]
@ -1943,7 +1924,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we have succesfully trained a model, let's see how to deal when we have no data for a new user, to be able to make recommendations to them."
+    "Now that we have succesfully trained a model, let's see how to deal with the situation where we have no data for a user. How can we make recommendations to new users?"
   ]
  },
  {
@ -1957,29 +1938,29 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The biggest challenge with using collaborative filtering models in practice is the *bootstrapping problem*. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What product do you recommend to your very first user?\n",
+    "The biggest challenge with using collaborative filtering models in practice is the *bootstrapping problem*. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What products do you recommend to your very first user?\n",
    "\n",
-    "But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of the form *use your common sense*. You can start your new users such that they have the mean of all of the embedding vectors of your other users — although this has the problem that that particular combination of latent factors may be not at all common (for instance the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent *average taste*.\n",
+    "But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of *use your common sense*. You could assign new users the mean of all of the embedding vectors of your other users, but this has the problem that that particular combination of latent factors may be not at all common (for instance, the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent *average taste*.\n",
    "\n",
-    "Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them which could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup meta data. We will learn in the next section how to create these kinds of tabular models. You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations."
+    "Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them that could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup metadata. We will see in the next section how to create these kinds of tabular models. (You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One thing to be careful of is that a small number of extremely enthusiastic users may end up effectively setting the recommendations for your whole user base. This is a very common problem, for instance, in movie recommendation systems. People that watch anime tend to watch a whole lot of it, and don't watch very much else, and spend a lot of time putting their ratings into websites. As a result, a lot of *best ever movies* lists tend to be heavily overrepresented with anime. In this particular case, it can be fairly obvious that you have a problem of representation bias, but if the bias is occurring in the latent factors then it may not be obvious at all.\n",
+    "One thing to be careful of is that a small number of extremely enthusiastic users may end up effectively setting the recommendations for your whole user base. This is a very common problem, for instance, in movie recommendation systems. People that watch anime tend to watch a whole lot of it, and don't watch very much else, and spend a lot of time putting their ratings on websites. As a result, anime tends to be heavily overrepresented in a lot of *best ever movies* lists. In this particular case, it can be fairly obvious that you have a problem of representation bias, but if the bias is occurring in the latent factors then it may not be obvious at all.\n",
    "\n",
-    "Such a problem can change the entire make up of your user base, and the behaviour of your system. This is particularly true because of positive feedback loops. If a small number of your users tend to set the direction of your recommendation system, then they are naturally going to end up attracting more people like them to your system. And that will, of course, amplify the original representation bias. This is a natural tendency to be amplified exponentially. You may have seen examples of company executives expressing surprise at how their online platforms rapidly deteriorate in such a way that they express values that are at odds with the values of the founders. In the presence of these kinds of feedback loops, it is easy to see how such a divergence can happen both quickly, and in a way that is hidden until it is too late.\n",
+    "Such a problem can change the entire makeup of your user base, and the behavior of your system. This is particularly true because of positive feedback loops. If a small number of your users tend to set the direction of your recommendation system, then they are naturally going to end up attracting more people like them to your system. And that will, of course, amplify the original representation bias. This type of bias has a natural tendency to be amplified exponentially. You may have seen examples of company executives expressing surprise at how their online platforms rapidly deteriorated in such a way that they expressed values at odds with the values of the founders. In the presence of these kinds of feedback loops, it is easy to see how such a divergence can happen both quickly and in a way that is hidden until it is too late.\n",
    "\n",
-    "In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify upfront how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It's all about ensuring that there are humans in the loop, that there is careful monitoring, and gradual and thoughtful rollout."
+    "In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify up front how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It's all about ensuring that there are humans in the loop; that there is careful monitoring, and a gradual and thoughtful rollout."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as *probabilistic matrix factorisation* (PMF). Another approach, which generally works similarly well given the same data, is deep learning."
+    "Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as *probabilistic matrix factorization* (PMF). Another approach, which generally works similarly well given the same data, is deep learning."
   ]
  },
  {
@ -1993,9 +1974,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To turn our architecture into a deep learning model the first step is to take the results of the embedding look up, and concatenating those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.\n",
+    "To turn our architecture into a deep learning model, the first step is to take the results of the embedding lookup and concatenate those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.\n",
    "\n",
-    "Since we'll be concatenating the embedding matrices, rather than taking their dot product, that means that the two embedding matrices can have different sizes (i.e. different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:"
+    "Since we'll be concatenating the embedding matrices, rather than taking their dot product, the two embedding matrices can have different sizes (i.e., different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:"
   ]
  },
  {
@ -2052,7 +2033,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "...and use it to create a model:"
+    "And use it to create a model:"
   ]
  },
  {
@ -2068,7 +2049,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "`CollabNN` creates our `Embedding` layers in the same way as previous classes in this chapter, except that we now use the `embs` sizes. Then `self.layers` is identical to the mini neural net we created in <<chapter_mnist_basics>> for MNIST. Then, in `forward`, we apply the embeddings, concatenate the results, and pass it through the mini neural net. Finally, we apply `sigmoid_range` as we have in previous models.\n",
+    "`CollabNN` creates our `Embedding` layers in the same way as previous classes in this chapter, except that we now use the `embs` sizes. `self.layers` is identical to the mini-neural net we created in <<chapter_mnist_basics>> for MNIST. Then, in `forward`, we apply the embeddings, concatenate the results, and pass this through the mini-neural net. Finally, we apply `sigmoid_range` as we have in previous models.\n",
    "\n",
    "Let's see if it trains:"
   ]
@ -2141,7 +2122,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Fastai provides this model in fastai.collab, if you pass `use_nn=True` in your call to `collab_learner` (including calling `get_emb_sz` for you), plus lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:"
+    "Fastai provides this model in `fastai.collab` if you pass `use_nn=True` in your call to `collab_learner` (including calling `get_emb_sz` for you), and it lets you easily create more layers. For instance, here we're creating two hidden layers, of size 100 and 50, respectively:"
   ]
  },
  {
@ -2231,25 +2212,25 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Wow that's not a lot of code! This class *inherits* from `TabularModel`, which is where it gets all its functionality from. In `__init__` is calls the same method in `TabularModel`, passing `n_cont=0` and `out_sz=1`; other than that, it only passes along whatever arguments it received."
+    "Wow, that's not a lot of code! This class *inherits* from `TabularModel`, which is where it gets all its functionality from. In `__init__` it calls the same method in `TabularModel`, passing `n_cont=0` and `out_sz=1`; other than that, it only passes along whatever arguments it received."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Sidebar: Kwargs and Delegates"
+    "### Sidebar: kwargs and Delegates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "`EmbeddingNN` includes `**kwargs` as a parameter to `__init__`. In python `**kwargs` in a parameter like means \"put any additional keyword arguments into a dict called `kwarg`. And `**kwargs` in an argument list means \"insert all key/value pairs in the `kwargs` dict as named arguments here\". This approach is used in many popular libraries, such as `matplotlib`, in which the main `plot` function simply has the signature `plot(*args, **kwargs)`. The [plot documentation](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) says \"*The `kwargs` are Line2D properties*\" and then lists those properties.\n",
+    "`EmbeddingNN` includes `**kwargs` as a parameter to `__init__`. In Python `**kwargs` in a parameter list means \"put any additional keyword arguments into a dict called `kwargs`. And `**kwargs` in an argument list means \"insert all key/value pairs in the `kwargs` dict as named arguments here\". This approach is used in many popular libraries, such as `matplotlib`, in which the main `plot` function simply has the signature `plot(*args, **kwargs)`. The [`plot` documentation](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) says \"The `kwargs` are `Line2D` properties\" and then lists those properties.\n",
    "\n",
-    "We're using `**kwargs` in `EmbeddingNN` to avoid having to write all the arguments to `TabularModel` a second time, and keep them in sync. However, this makes our API quite difficult to work with, because now Jupyter Notebook doesn't know what parameters are available, so things like tab-completion of parameter names and popup lists of signatures won't work.\n",
+    "We're using `**kwargs` in `EmbeddingNN` to avoid having to write all the arguments to `TabularModel` a second time, and keep them in sync. However, this makes our API quite difficult to work with, because now Jupyter Notebook doesn't know what parameters are available. Consequently things like tab completion of parameter names and pop-up lists of signatures won't work.\n",
    "\n",
-    "Fastai resolves this by providing a special `@delegates` decorator, which automatically changes the signature of the class or function (`EmbeddingNN` in this case) to insert all of its keyword arguments into the signature"
+    "Fastai resolves this by providing a special `@delegates` decorator, which automatically changes the signature of the class or function (`EmbeddingNN` in this case) to insert all of its keyword arguments into the signature."
   ]
  },
  {
@ -2263,7 +2244,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Although the results of `EmbeddingNN` are a bit worse than the dot product approach (which shows the power of carefully using an architecture for a domain), it does allow us to do something very important: we can now directly incorporate other user and movie information, time, and other information that may be relevant to the recommendation. That's exactly what `TabularModel` does. In fact, we've now seen that `EmbeddingNN` is just a `TabularModel`, with `n_cont=0` and `out_sz=1`. So we better spend some time learning about `TabularModel`, and how to use it to get great results!"
+    "Although the results of `EmbeddingNN` are a bit worse than the dot product approach (which shows the power of carefully constructing an architecture for a domain), it does allow us to do something very important: we can now directly incorporate other user and movie information, date and time information, or any other information that may be relevant to the recommendation. That's exactly what `TabularModel` does. In fact, we've now seen that `EmbeddingNN` is just a `TabularModel`, with `n_cont=0` and `out_sz=1`. So, we'd better spend some time learning about `TabularModel`, and how to use it to get great results! We'll do that in the next chapter."
   ]
  },
  {
@ -2277,7 +2258,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For our first non computer vision application, we looked at recommendation systems and saw how gradient descent can learn intrinsic factors or bias about items from a history of ratings. Those can then give us information about the data. \n",
+    "For our first non-computer vision application, we looked at recommendation systems and saw how gradient descent can learn intrinsic factors or biases about items from a history of ratings. Those can then give us information about the data. \n",
    "\n",
    "We also built our first model in PyTorch. We will do a lot more of this in the next section of the book, but first, let's finish our dive into the other general applications of deep learning, continuing with tabular data."
   ]
@ -2297,33 +2278,33 @@
    "1. How does it solve it?\n",
    "1. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?\n",
    "1. What does a crosstab representation of collaborative filtering data look like?\n",
-    "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!)\n",
+    "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!).\n",
    "1. What is a latent factor? Why is it \"latent\"?\n",
-    "1. What is a dot product? Calculate a dot product manually using pure python with lists.\n",
+    "1. What is a dot product? Calculate a dot product manually using pure Python with lists.\n",
    "1. What does `pandas.DataFrame.merge` do?\n",
    "1. What is an embedding matrix?\n",
-    "1. What is the relationship between an embedding and a matrix of one-hot encoded vectors?\n",
-    "1. Why do we need `Embedding` if we could use one-hot encoded vectors for the same thing?\n",
-    "1. What does an embedding contain before we start training (assuming we're not using a prertained model)?\n",
+    "1. What is the relationship between an embedding and a matrix of one-hot-encoded vectors?\n",
+    "1. Why do we need `Embedding` if we could use one-hot-encoded vectors for the same thing?\n",
+    "1. What does an embedding contain before we start training (assuming we're not using a pretained model)?\n",
    "1. Create a class (without peeking, if possible!) and use it.\n",
    "1. What does `x[:,0]` return?\n",
-    "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it\n",
+    "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it.\n",
    "1. What is a good loss function to use for MovieLens? Why? \n",
-    "1. What would happen if we used `CrossEntropy` loss with MovieLens? How would we need to change the model?\n",
+    "1. What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?\n",
    "1. What is the use of bias in a dot product model?\n",
    "1. What is another name for weight decay?\n",
-    "1. Write the equation for weight decay (without peeking!)\n",
+    "1. Write the equation for weight decay (without peeking!).\n",
    "1. Write the equation for the gradient of weight decay. Why does it help reduce weights?\n",
    "1. Why does reducing weights lead to better generalization?\n",
    "1. What does `argsort` do in PyTorch?\n",
-    "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why / why not?\n",
+    "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?\n",
    "1. How do you print the names and details of the layers in a model?\n",
    "1. What is the \"bootstrapping problem\" in collaborative filtering?\n",
    "1. How could you deal with the bootstrapping problem for new users? For new movies?\n",
    "1. How can feedback loops impact collaborative filtering systems?\n",
-    "1. When using a neural network in collaborative filtering, why can we have different number of factors for movie and user?\n",
-    "1. Why is there a `nn.Sequential` in the `CollabNN` model?\n",
-    "1. What kind of model should be use if we want to add metadata about users and items, or information such as date and time, to a collaborative filter model?"
+    "1. When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?\n",
+    "1. Why is there an `nn.Sequential` in the `CollabNN` model?\n",
+    "1. What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?"
   ]
  },
  {
@ -2332,10 +2313,10 @@
   "source": [
    "### Further Research\n",
    "\n",
-    "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change, to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n",
-    "1. Find three other areas where collaborative filtering is being used, and find out what pros and cons of this approach in those areas.\n",
-    "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book website and forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas)\n",
-    "1. Create a model for MovieLens with works with CrossEntropy loss, and compare it to the model in this chapter."
+    "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n",
+    "1. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.\n",
+    "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas).\n",
+    "1. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter."
   ]
  },
  {
--- a/09_tabular.ipynb
+++ b/09_tabular.ipynb
--- a/clean/08_collab.ipynb
+++ b/clean/08_collab.ipynb
@ -523,26 +523,6 @@
    "one_hot_3 = one_hot(3, n_users).float()"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "torch.Size([944, 5])"
-      ]
-     },
-     "execution_count": null,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "user_factors.shape"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -1670,7 +1650,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Sidebar: Kwargs and Delegates"
+    "### Sidebar: kwargs and Delegates"
   ]
  },
  {
@ -1702,33 +1682,33 @@
    "1. How does it solve it?\n",
    "1. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?\n",
    "1. What does a crosstab representation of collaborative filtering data look like?\n",
-    "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!)\n",
+    "1. Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!).\n",
    "1. What is a latent factor? Why is it \"latent\"?\n",
-    "1. What is a dot product? Calculate a dot product manually using pure python with lists.\n",
+    "1. What is a dot product? Calculate a dot product manually using pure Python with lists.\n",
    "1. What does `pandas.DataFrame.merge` do?\n",
    "1. What is an embedding matrix?\n",
-    "1. What is the relationship between an embedding and a matrix of one-hot encoded vectors?\n",
-    "1. Why do we need `Embedding` if we could use one-hot encoded vectors for the same thing?\n",
-    "1. What does an embedding contain before we start training (assuming we're not using a prertained model)?\n",
+    "1. What is the relationship between an embedding and a matrix of one-hot-encoded vectors?\n",
+    "1. Why do we need `Embedding` if we could use one-hot-encoded vectors for the same thing?\n",
+    "1. What does an embedding contain before we start training (assuming we're not using a pretained model)?\n",
    "1. Create a class (without peeking, if possible!) and use it.\n",
    "1. What does `x[:,0]` return?\n",
-    "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it\n",
+    "1. Rewrite the `DotProduct` class (without peeking, if possible!) and train a model with it.\n",
    "1. What is a good loss function to use for MovieLens? Why? \n",
-    "1. What would happen if we used `CrossEntropy` loss with MovieLens? How would we need to change the model?\n",
+    "1. What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?\n",
    "1. What is the use of bias in a dot product model?\n",
    "1. What is another name for weight decay?\n",
-    "1. Write the equation for weight decay (without peeking!)\n",
+    "1. Write the equation for weight decay (without peeking!).\n",
    "1. Write the equation for the gradient of weight decay. Why does it help reduce weights?\n",
    "1. Why does reducing weights lead to better generalization?\n",
    "1. What does `argsort` do in PyTorch?\n",
-    "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why / why not?\n",
+    "1. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?\n",
    "1. How do you print the names and details of the layers in a model?\n",
    "1. What is the \"bootstrapping problem\" in collaborative filtering?\n",
    "1. How could you deal with the bootstrapping problem for new users? For new movies?\n",
    "1. How can feedback loops impact collaborative filtering systems?\n",
-    "1. When using a neural network in collaborative filtering, why can we have different number of factors for movie and user?\n",
-    "1. Why is there a `nn.Sequential` in the `CollabNN` model?\n",
-    "1. What kind of model should be use if we want to add metadata about users and items, or information such as date and time, to a collaborative filter model?"
+    "1. When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?\n",
+    "1. Why is there an `nn.Sequential` in the `CollabNN` model?\n",
+    "1. What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?"
   ]
  },
  {
@ -1737,10 +1717,10 @@
   "source": [
    "### Further Research\n",
    "\n",
-    "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change, to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n",
-    "1. Find three other areas where collaborative filtering is being used, and find out what pros and cons of this approach in those areas.\n",
-    "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book website and forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas)\n",
-    "1. Create a model for MovieLens with works with CrossEntropy loss, and compare it to the model in this chapter."
+    "1. Take a look at all the differences between the `Embedding` version of `DotProductBias` and the `create_params` version, and try to understand why each of those changes is required. If you're not sure, try reverting each change to see what happens. (NB: even the type of brackets used in `forward` has changed!)\n",
+    "1. Find three other areas where collaborative filtering is being used, and find out what the pros and cons of this approach are in those areas.\n",
+    "1. Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book's website and the fast.ai forum for ideas. Note that there are more columns in the full dataset--see if you can use those too (the next chapter might give you ideas).\n",
+    "1. Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter."
   ]
  },
  {
--- a/clean/09_tabular.ipynb
+++ b/clean/09_tabular.ipynb
@ -954,6 +954,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "#hide\n",
    "to = (path/'to.pkl').load()"
   ]
  },
@ -7779,7 +7780,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Finding out of Domain Data"
+    "### Finding Out-of-Domain Data"
   ]
  },
  {
@ -8311,7 +8312,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Ensembling"
+    "## Ensembling"
   ]
  },
  {
@ -8378,12 +8379,12 @@
   "source": [
    "1. What is a continuous variable?\n",
    "1. What is a categorical variable?\n",
-    "1. Provide 2 of the words that are used for the possible values of a categorical variable.\n",
+    "1. Provide two of the words that are used for the possible values of a categorical variable.\n",
    "1. What is a \"dense layer\"?\n",
    "1. How do entity embeddings reduce memory usage and speed up neural networks?\n",
-    "1. What kind of datasets are entity embeddings especially useful for?\n",
+    "1. What kinds of datasets are entity embeddings especially useful for?\n",
    "1. What are the two main families of machine learning algorithms?\n",
-    "1. Why do some categorical columns need a special ordering in their classes? How do you do this in pandas?\n",
+    "1. Why do some categorical columns need a special ordering in their classes? How do you do this in Pandas?\n",
    "1. Summarize what a decision tree algorithm does.\n",
    "1. Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?\n",
    "1. Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?\n",
@ -8394,19 +8395,20 @@
    "1. What is bagging?\n",
    "1. What is the difference between `max_samples` and `max_features` when creating a random forest?\n",
    "1. If you increase `n_estimators` to a very high value, can that lead to overfitting? Why or why not?\n",
-    "1. What is *out of bag error*?\n",
+    "1. In the section \"Creating a Random Forest\", just after <<max_features>>, why did `preds.mean(0)` give the same result as our random forest?\n",
+    "1. What is \"out-of-bag-error\"?\n",
    "1. Make a list of reasons why a model's validation set error might be worse than the OOB error. How could you test your hypotheses?\n",
-    "1. How can you answer each of these things with a random forest? How do they work?:\n",
-    "   - How confident are we in our projections using a particular row of data?\n",
+    "1. Explain why random forests are well suited to answering each of the following question:\n",
+    "   - How confident are we in our predictions using a particular row of data?\n",
    "   - For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\n",
    "   - Which columns are the strongest predictors?\n",
-    "   - How do predictions vary, as we vary these columns?\n",
+    "   - How do predictions vary as we vary these columns?\n",
    "1. What's the purpose of removing unimportant variables?\n",
    "1. What's a good type of plot for showing tree interpreter results?\n",
-    "1. What is the *extrapolation problem*?\n",
-    "1. How can you tell if your test or validation set is distributed in a different way to your training set?\n",
-    "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9000 distinct values?\n",
-    "1. What is boosting?\n",
+    "1. What is the \"extrapolation problem\"?\n",
+    "1. How can you tell if your test or validation set is distributed in a different way than your training set?\n",
+    "1. Why do we make `saleElapsed` a continuous variable, even although it has less than 9,000 distinct values?\n",
+    "1. What is \"boosting\"?\n",
    "1. How could we use embeddings with a random forest? Would we expect this to help?\n",
    "1. Why might we not always use a neural net for tabular modeling?"
   ]
@ -8422,8 +8424,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare yourself to the private leaderboard.\n",
-    "1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on this dataset.\n",
+    "1. Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare your results to the private leaderboard.\n",
+    "1. Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the datase you used in the first exercise.\n",
    "1. Use the embeddings from the neural net in this chapter in a random forest, and see if you can improve on the random forest results we saw.\n",
    "1. Explain what each line of the source of `TabularModel` does (with the exception of the `BatchNorm1d` and `Dropout` layers)."
   ]