Follow-up

2020-04-03 08:49:34 -07:00 · 2020-04-03 08:49:34 -07:00 · 36fbb9384f
commit 36fbb9384f
parent a909a8cb5b
6 changed files with 83 additions and 29 deletions
--- a/02_production.ipynb
+++ b/02_production.ipynb
@ -139,7 +139,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Just like in computer vision, computers are very good at categorising both short and long documents based on categories such as spam, sentiment (e.g. is the review positive or negative), author, source website, and so forth. We are not aware of any rigourous work done in this area to compare to human performance, but anecdotally it seems to us that deep learning performance is similar to human performance here. Deep learning is also very good at generating context-appropriate text, such as generating replies to social media posts, and imitating a particular author's style. It is also good at making this content compelling to humans, and has been shown to be even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information, along with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content which appears to a layman to be compelling, but actually is entirely incorrect.\n",
+    "Just like in computer vision, computers are very good at categorising both short and long documents based on categories such as spam, sentiment (e.g. is the review positive or negative), author, source website, and so forth. We are not aware of any rigorous work done in this area to compare to human performance, but anecdotally it seems to us that deep learning performance is similar to human performance here. Deep learning is also very good at generating context-appropriate text, such as generating replies to social media posts, and imitating a particular author's style. It is also good at making this content compelling to humans, and has been shown to be even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information, along with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content which appears to a layman to be compelling, but actually is entirely incorrect.\n",
    "\n",
    "Another concern is that context-appropriate, highly compelling responses on social media can be used at massive scale — thousands of times greater than any troll farm previously seen — to spread disinformation, create unrest, and encourage conflict. As a rule of thumb, text generation will always be technologically a bit ahead of the ability of models to recognize automatically generated text. For instance, it is possible to use a model that can recognize artificially generated content to actually improve the generator that creates that content, until the classification model is no longer able to complete its task.\n",
    "\n",
@ -274,7 +274,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> important: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use one particular provider, _Bing Image Search_, using the service they have as this book as written. We'll be providing more options and more up to date information on the http://book.fast.ai[book website], so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
+    "> important: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use one particular provider, _Bing Image Search_, using the service they have as this book was written. We'll be providing more options and more up to date information on the http://book.fast.ai[book website], so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
   ]
  },
  {
@ -804,7 +804,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "All of these approaches seem somewhat wasteful, or problematic. If we squished or stretch the images then they end up unrealistic shapes, leading to a model that learns that things look different to how they actually are, which we would expect to result in lower accuracy. If we crop the images then we remove some of the features that allow us to recognize them. For instance, if we were trying to recognise the breed of dog or cat, we may end up cropping out a key part of the body or the face necessary to distinguish between similar breeds. If we pad the images then we have a whole lot of empty space, which is just wasted computation for our model, and results in a lower effective resolution for the part of the image we actually use.\n",
+    "All of these approaches seem somewhat wasteful, or problematic. If we squished or stretched the images then they end up as unrealistic shapes, leading to a model that learns that things look different to how they actually are, which we would expect to result in lower accuracy. If we crop the images then we remove some of the features that allow us to recognize them. For instance, if we were trying to recognise the breed of dog or cat, we may end up cropping out a key part of the body or the face necessary to distinguish between similar breeds. If we pad the images then we have a whole lot of empty space, which is just wasted computation for our model, and results in a lower effective resolution for the part of the image we actually use.\n",
    "\n",
    "Instead, what we normally do in practice is to randomly select part of the image, and crop to just that part. On each epoch (which is one complete pass through all of our images in the dataset) we randomly select a different part of each image. This means that our model can learn to focus on, and recognize, different features in our images. It also reflects how images work in the real world; different photos of the same thing may be framed in slightly different ways.\n",
    "\n",
@ -855,7 +855,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms to a batch, we use the `batch_tfms` parameter. (Note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
+    "Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms on a batch, we use the `batch_tfms` parameter. (Note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
   ]
  },
  {
--- a/03_ethics.ipynb
+++ b/03_ethics.ipynb
@ -196,7 +196,7 @@
   "source": [
    "Of course, the project managers and engineers and technicians involved were just living their ordinary lives. Caring for their families, going to the church on Sunday, doing their jobs as best as they could. Following orders. The marketers were just doing what they could to meet their business development goals. Edwin Black, author of \"IBM and the Holocaust\", said: \"To the blind technocrat, the means were more important than the ends. The destruction of the Jewish people became even less important because the invigorating nature of IBM's technical achievement was only heightened by the fantastical profits to be made at a time when bread lines stretched across the world.\"\n",
    "\n",
-    "Step back for a moment and consider: how would you feel if you discovered that you had been part of a system that ending up hurting society? Would you even know? Would you be open to finding out? How can you help make sure this doesn't happen? We have described the most extreme situation here in Nazi Germany, but there are many negative societal consequences happening due to AI and machine learning right now, some of which we'll describe in this chapter.\n",
+    "Step back for a moment and consider: how would you feel if you discovered that you had been part of a system that ended up hurting society? Would you even know? Would you be open to finding out? How can you help make sure this doesn't happen? We have described the most extreme situation here in Nazi Germany, but there are many negative societal consequences happening due to AI and machine learning right now, some of which we'll describe in this chapter.\n",
    "\n",
    "It's not just a moral burden either. Sometimes, technologists pay very directly for their actions. For instance, the first person who was jailed as a result of the Volkswagen scandal, where the car company cheated on their diesel emissions tests, was not the manager that oversaw the project, or an executive at the helm of the company. It was one of the engineers, James Liang, who just did what he was told.\n",
    "\n",
@ -204,7 +204,7 @@
    "\n",
    "Okay, so hopefully we have convinced you that you ought to care. But what should you do? As data scientists, we're naturally inclined to focus on making our model better at optimizing some metric. But optimizing that metric may not actually lead to better outcomes. And even if optimizing that metric *does* help create better outcomes, it almost certainly won't be the only thing that matters. Consider the pipeline of steps that occurs between the development of a model or an algorithm by a researcher or practitioner, and the point at which this work is actually used to make some decision. This entire pipeline needs to be considered *as a whole* if we're to have a hope of getting the kinds of outcomes we want.\n",
    "\n",
-    "Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no-one is better placed to inform everyone involved in this chain about the capabilities, constraints, and details of your work than you are. Although there's no \"silver bullet\" that can ensure your work is used the right way, by getting involved in the process, and asking the right questions, you can at the very least ensured that the right issues are being considered.\n",
+    "Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no-one is better placed to inform everyone involved in this chain about the capabilities, constraints, and details of your work than you are. Although there's no \"silver bullet\" that can ensure your work is used the right way, by getting involved in the process, and asking the right questions, you can at the very least ensure that the right issues are being considered.\n",
    "\n",
    "Sometimes, the right response to being asked to do a piece of work is to just say \"no\". Often, however, the response we hear is \"if I don’t do it, someone else will\". But consider this: if you’ve been picked for the job, you’re the best person they’ve found; so if you don’t do it, the best person isn’t working on that project. If the first 5 they ask all say no too, then even better!"
   ]
@ -700,7 +700,7 @@
    "- support good policy\n",
    "- increase diversity\n",
    "\n",
-    "Let's walk through each step next, staring with analyzing a project you are working on."
+    "Let's walk through each step next, starting with analyzing a project you are working on."
   ]
  },
  {
@ -726,7 +726,7 @@
    "\n",
    "These questions may be able to help you identify outstanding issues, and possible alternatives that are easier to understand and control. In addition to asking the right questions, it's also important to consider practices and processes to implement.\n",
    "\n",
-    "One thing to consider at this stage is what data you are collecting and storing. Data often ends up being used for different purposes than why it was originally collected.  For instance, IBM began selling to Nazi Germany well before the Holocaust, including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany. US census data was used to round up Japanese-Americans (who were US citizens) for internment during World War II. It is important to recognize how data and images collected can be weaponized later. Columbia professor [Tim Wu wrote](https://www.nytimes.com/2019/04/10/opinion/sunday/privacy-capitalism.html) that “You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”"
+    "One thing to consider at this stage is what data you are collecting and storing. Data often ends up being used for different purposes than why it was originally collected for. For instance, IBM began selling to Nazi Germany well before the Holocaust, including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany. US census data was used to round up Japanese-Americans (who were US citizens) for internment during World War II. It is important to recognize how data and images collected can be weaponized later. Columbia professor [Tim Wu wrote](https://www.nytimes.com/2019/04/10/opinion/sunday/privacy-capitalism.html) that “You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”"
   ]
  },
  {
@ -1034,10 +1034,25 @@
  }
 ],
 "metadata": {
  "jupytext": {
   "split_at_heading": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/08_collab.ipynb
+++ b/08_collab.ipynb
@ -41,7 +41,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "First, let's get some data suitable for a collaboratie filtering model."
+    "First, let's get some data suitable for a collaborative filtering model."
   ]
  },
  {
@ -327,7 +327,7 @@
   "source": [
    "There is surprisingly little distance from specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n",
    "\n",
-    "Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialise values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
+    "Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialised values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
   ]
  },
  {
@ -366,7 +366,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When showing the data we would rather see movie titles than their ids. The table `u.item` contains the coorespondance id to title:"
+    "When showing the data we would rather see movie titles than their ids. The table `u.item` contains the correspondence id to title:"
   ]
  },
  {
@ -681,7 +681,7 @@
   "source": [
    "To calculate the result for a particular movie and use a combination we have two look up the index of the movie in our movie latent factors matrix, and the index of the user in our user latent factors matrix, and then we can do our dot product between the two latent factor vectors. But *look up in an index* is not an operation which our deep learning models know how to do. They know how to do matrix products, and activation functions.\n",
    "\n",
-    "It turns out that we can represent *look up in an index* as a matrix product! The trick is to replace our indices with one hot encoded vectors. He is an example of what happens if we multiply a vector by a one hot encoded vector representing the index three:"
+    "It turns out that we can represent *look up in an index* as a matrix product! The trick is to replace our indices with one hot encoded vectors. Here is an example of what happens if we multiply a vector by a one hot encoded vector representing the index three:"
   ]
  },
  {
@ -736,7 +736,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If we do that for a few indices at once, we will have a matrix of one-hot encoded vectors and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one hot encoded vector, or to search through it to find the occurrence of the number one — we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer which does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had of done a matrix multiplication with a one hot encoded vector. This is called an *embedding*."
+    "If we do that for a few indices at once, we will have a matrix of one-hot encoded vectors and that operation will be a matrix multiplication! This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary. We know that there is no real underlying reason to store the one hot encoded vector, or to search through it to find the occurrence of the number one — we should just be able to index into an array directly with an integer. Therefore, most deep learning libraries, including PyTorch, include a special layer which does just this; it indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one hot encoded vector. This is called an *embedding*."
   ]
  },
  {
@ -752,7 +752,7 @@
   "source": [
    "In computer vision, we had a very easy way to get all the information of a pixel through its RGB values: each pixel in a coloured imaged is represented by three numbers. Those three numbers gave us the red-ness, the green-ness and the blue-ness, which is enough to get our model to work afterward.\n",
    "\n",
-    "For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one use might particularly like. \n",
+    "For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one user might particularly like. \n",
    "\n",
    "How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, let our model figure out itself the features that seem important or not.\n",
    "\n",
@ -794,7 +794,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behaviour associated with this method name. In the case of `__init__`, this is the method which Python will call when your new object is created. So, this is where you can set up any state which needs to be done upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method is parameters. Note that the first parameter to any methods defined inside a class is `self`, so you can use this to set and get any attributes that you will need."
+    "The most important piece of this is the special method called `__init__` (pronounced *dunder init*). In Python, any method surrounded in double underscores like this is considered special. It indicates that there is some extra behaviour associated with this method name. In the case of `__init__`, this is the method which Python will call when your new object is created. So, this is where you can set up any state which needs to be done upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the `__init__` method as parameters. Note that the first parameter to any method defined inside a class is `self`, so you can use this to set and get any attributes that you will need."
   ]
  },
  {
@ -1055,7 +1055,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations and others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say, for instance, about the movie is that it is very sci-fi, very action oriented, and very not old, then you don't really have any way to say most people like it. \n",
+    "This is a reasonable start, but we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation we do not have any way to encode either of these things. If all you can say, for instance, about the movie is that it is very sci-fi, very action oriented, and very not old, then you don't really have any way to say most people like it. \n",
    "\n",
    "That's because at this point we only have weights; we do not have biases. If we have a single number for each user which we add to our scores, and ditto for each movie, then this will handle this missing piece very nicely. So first of all, let's adjust our model architecture:"
   ]
@ -1174,7 +1174,7 @@
   "source": [
    "Weight decay, or L2 regularization, consists in adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.\n",
    "\n",
-    "Why would it prevent overfitting? The idea is that the larger the coefficient are, the more sharp canyons we will have in the loss function. If we take the basic example of parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is."
+    "Why would it prevent overfitting? The idea is that the larger the coefficients are, the more sharp canyons we will have in the loss function. If we take the basic example of parabola, `y = a * (x**2)`, the larger `a` is, the more *narrow* the parabola is."
   ]
  },
  {
@ -1863,7 +1863,7 @@
   "source": [
    "On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that X and Y are the distances between the coordinates on each axis). For a 50 dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n",
    "\n",
-    "If there were two movies that were nearly identical, then there embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:"
+    "If there were two movies that were nearly identical, then their embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:"
   ]
  },
  {
@ -1894,7 +1894,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we have succesfully trained a model, let's see how to deal whwn we have no data for a new user, to be able to make recommandations to them."
+    "Now that we have succesfully trained a model, let's see how to deal when we have no data for a new user, to be able to make recommendations to them."
   ]
  },
  {
@ -1912,7 +1912,7 @@
    "\n",
    "But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of the form *use your common sense*. You can start your new users such that they have the mean of all of the embedding vectors of your other users — although this has the problem that that particular combination of latent factors may be not at all common (for instance the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent *average taste*.\n",
    "\n",
-    "Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them which could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup meta data. We will learn in the next section how to create these kinds of tabular models. You may have noticed that when you sign up for services such as Pandora and Netflix that they tend to ask you a few questions about what genres of movie or music that you like; this is how they come up with your initial collaborative filtering recommendations."
+    "Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them which could help you to understand their tastes. Then you can create a model where the dependent variable is a user's embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup meta data. We will learn in the next section how to create these kinds of tabular models. You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations."
   ]
  },
  {
--- a/12_nlp_dive.ipynb
+++ b/12_nlp_dive.ipynb
@ -44,7 +44,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Whenever we start working on a new problem, we always first try to think of the simplest dataset we can which would allow us to try out methods quickly and easily, and interpret the results. When we started working on language modelling a few years ago, we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *human numbers*, and it simply contains the first 10,000 words written out in English."
+    "Whenever we start working on a new problem, we always first try to think of the simplest dataset we can which would allow us to try out methods quickly and easily, and interpret the results. When we started working on language modelling a few years ago, we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *human numbers*, and it simply contains the first 10,000 numbers written out in English."
   ]
  },
  {
@ -717,7 +717,7 @@
   "source": [
    "Looking at the code for our RNN, one thing that seems problematic is that we are initialising our hidden state to zero for every new input sequence. Why is that a problem? We made our sample sequences short so they would fit easily into batches. But if we order those samples correctly, those sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence. \n",
    "\n",
-    "Another thing we can look at is havin more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? \n",
+    "Another thing we can look at is having more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? \n",
    "\n",
    "We'll see how we can implement those changes, starting with adding some state."
   ]
@ -1572,7 +1572,7 @@
    "\n",
    "$$\\tanh(x) = \\frac{e^{x} + e^{-x}}{e^{x}-e^{-x}} = 2 \\sigma(2x) - 1$$\n",
    "\n",
-    "where $\\sigma$ is the sigmoid function. The green boxes are elementwise operations. What goes out is the new hidden state ($h_{t}$) and new cell state ($c_{t}$) on the left, ready for our next input. The new hidden state is also use as output, which is why the arrow splits to go up.\n",
+    "where $\\sigma$ is the sigmoid function. The green boxes are elementwise operations. What goes out is the new hidden state ($h_{t}$) and new cell state ($c_{t}$) on the right, ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.\n",
    "\n",
    "Let's go over the four neural nets (called *gates*) one by one and explain the diagram, but before this, notice how very little the cell state (on the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.\n",
    "\n",
@ -1580,7 +1580,7 @@
    "\n",
    "The first gate (looking from the left to right) is called the *forget gate*. Since it's a linear layer followed by a sigmoid, its output will have scalars between 0 and 1. We multiply this result by the cell gate, so for all the values close to 0, we will forget what was inside that cell state (and for the values close to 1 it doesn't do anything). This gives the ability to the LSTM to forget things about its longterm state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
    "\n",
-    "The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it jsut decides which element of the cell state to update (valeus close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
+    "The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it just decides which element of the cell state to update (values close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
    "\n",
    "The last gate is the *output gate*. It will decides which information take in the cell state to generate the output. The cell state goes through a tanh before this and the output gate combined with the sigmoid decides which values to take inside it.\n",
    "\n",
@ -1894,7 +1894,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Recurrent neural networks, in general, are hard to train, because of the problems of vanishing activations and gradients we saw before. Using LSTMs (or GRUs) cell make training easier than vanilla RNNs, but there are still very prone to overfitting. Data augmentation, while it exists for text data, is less often used because in most cases, it requires another model to generate random augmentation (by translating in another language and back to the language used for instance). Overall, data augmentation for text data is currently not a well explored space.\n",
+    "Recurrent neural networks, in general, are hard to train, because of the problems of vanishing activations and gradients we saw before. Using LSTMs (or GRUs) cells make training easier than vanilla RNNs, but there are still very prone to overfitting. Data augmentation, while it exists for text data, is less often used because in most cases, it requires another model to generate random augmentation (by translating in another language and back to the language used for instance). Overall, data augmentation for text data is currently not a well explored space.\n",
    "\n",
    "However, there are other regularization techniques we can use instead to reduce overfitting, which were thoroughly studied for use with LSTMs in the paper [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182). This paper showed how effective use of *dropout*, *activation regularization*, and *temporal activation regularization* could allow an LSTM to beat state of the art results that previously required much more complicated models. They called an LSTM using these techniques an *AWD LSTM*. We'll look at each of these techniques in turn."
   ]
@ -2090,7 +2090,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can the train the model, and add additional regularization by increasing the weight decay to `0.1`:"
+    "We can then train the model, and add additional regularization by increasing the weight decay to `0.1`:"
   ]
  },
  {
@ -2257,7 +2257,7 @@
    "- weight dropout (applied to the weights of the LSTM at each training step)\n",
    "- hidden dropout (applied to the hidden state between two layers)\n",
    "\n",
-    "which makes it even more regularized. Since fine-tuning those five dropout values (adding the dropout before the output layer) is complicated, so we have determined good defaults, and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw (which is multiplied by each dropout).\n",
+    "which makes it even more regularized. Since fine-tuning those five dropout values (adding the dropout before the output layer) is complicated, we have determined good defaults, and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw (which is multiplied by each dropout).\n",
    "\n",
    "Another architecture that is very powerful, especially in \"sequence to sequence\" problems (that is, problems where the dependent variable is itself a variable length sequence, such as language translation), is the Transformers architecture. You can find it in an online bonus chapter on the book website."
   ]
@ -2274,7 +2274,7 @@
   "metadata": {},
   "source": [
    "1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?\n",
-    "1. Why do we concatenating the documents in our dataset before creating a language model?\n",
+    "1. Why do we concatenate the documents in our dataset before creating a language model?\n",
    "1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?\n",
    "1. How can we share a weight matrix across multiple layers in PyTorch?\n",
    "1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.\n",
@ -2347,6 +2347,18 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/clean/03_ethics.ipynb
+++ b/clean/03_ethics.ipynb
@ -304,10 +304,25 @@
  }
 ],
 "metadata": {
  "jupytext": {
   "split_at_heading": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/clean/12_nlp_dive.ipynb
+++ b/clean/12_nlp_dive.ipynb
@ -1553,7 +1553,7 @@
   "metadata": {},
   "source": [
    "1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?\n",
-    "1. Why do we concatenating the documents in our dataset before creating a language model?\n",
+    "1. Why do we concatenate the documents in our dataset before creating a language model?\n",
    "1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?\n",
    "1. How can we share a weight matrix across multiple layers in PyTorch?\n",
    "1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.\n",
@ -1626,6 +1626,18 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,