Update

2020-03-17 12:15:55 -07:00 · 2020-03-17 12:15:55 -07:00 · a32f57183a
commit a32f57183a
parent b2f1c12d4c
26 changed files with 287 additions and 459 deletions
--- a/01_intro.ipynb
+++ b/01_intro.ipynb
@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@ -67,7 +67,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Deep learning has power, flexibility, and simplicity. That's why we believe it should be applied across many disciplines. These include the social and physical sciences, the arts, medicine, finance, scientific research, and much more. To give a personal example, despite having no background in medicine, Jeremy started Enlitic, a company that uses deep learning algorithms to diagnose illness and disease. Within months of starting the company, it was announced that their algorithm could identify malignent tumors [more accurately than radiologists](https://www.nytimes.com/2016/02/29/technology/the-promise-of-artificial-intelligence-unfolds-in-small-steps.html).\n",
+    "Deep learning has power, flexibility, and simplicity. That's why we believe it should be applied across many disciplines. These include the social and physical sciences, the arts, medicine, finance, scientific research, and much more. To give a personal example, despite having no background in medicine, Jeremy started Enlitic, a company that uses deep learning algorithms to diagnose illness and disease. Within months of starting the company, it was announced that their algorithm could identify malignant tumors [more accurately than radiologists](https://www.nytimes.com/2016/02/29/technology/the-promise-of-artificial-intelligence-unfolds-in-small-steps.html).\n",
    "\n",
    "Here's a list of some of the thousands of tasks where deep learning, or methods heavily using deep learning, is now the best in the world:\n",
    "\n",
@ -208,7 +208,7 @@
   "source": [
    "Since we are going to be spending a lot of time together, let's get to know each other a bit… We are Sylvain and Jeremy, your guides on this journey. We hope that you will find us well suited for this position.\n",
    "\n",
-    "Jeremy has been using and teaching machine learning for around 30 years. He started using neural networks 25 years ago. During this time he has led many companies and projects which have machine learning at their core, including founding the first company to focus on deep learning and medicine, Enlitic, and taking on the role of President and Chief Scientist of the world's largest machine learning community, Kaggle. He is the co-founder, along with Dr Rachel Thomas, of fast.ai, the organisation that built the course this book is based on.\n",
+    "Jeremy has been using and teaching machine learning for around 30 years. He started using neural networks 25 years ago. During this time, he has led many companies and projects which have machine learning at their core, including founding the first company to focus on deep learning and medicine, Enlitic, and taking on the role of President and Chief Scientist of the world's largest machine learning community, Kaggle. He is the co-founder, along with Dr Rachel Thomas, of fast.ai, the organisation that built the course this book is based on.\n",
    "\n",
    "From time to time you will hear directly from us, in sidebars like this one from Jeremy:"
   ]
@ -517,7 +517,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -636,7 +636,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -645,7 +645,7 @@
       "2"
      ]
     },
-     "execution_count": 3,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -663,7 +663,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -673,7 +673,7 @@
       "<PIL.Image.Image image mode=RGB size=151x192 at 0x7F67678F7E90>"
      ]
     },
-     "execution_count": 4,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -701,7 +701,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -741,7 +741,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
@ -754,7 +754,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -805,7 +805,7 @@
   "source": [
    "Your classifier is a deep learning model. As was already mentioned, deep learning models use neural networks, which originally date from the 1950s and have become powerful very recently thanks to recent advancements.\n",
    "\n",
-    "Another key piece of context is that deep learning is just a modern area in the more general discipline of *machine learning*. To understand the essence of what you did when you trained your own classication model, you don't need to understand deep learning. It is enough to see how your model and your training process are examples of the concepts that apply to machine learning in general.\n",
+    "Another key piece of context is that deep learning is just a modern area in the more general discipline of *machine learning*. To understand the essence of what you did when you trained your own classification model, you don't need to understand deep learning. It is enough to see how your model and your training process are examples of the concepts that apply to machine learning in general.\n",
    "\n",
    "So in this section, we will describe what machine learning is. We will introduce the key concepts, and see how they can be traced back to the original essay that introduced the concept.\n",
    "\n",
@ -1286,7 +1286,7 @@
    "\n",
    "What about the next piece, an *automatic means of testing the effectiveness of any current weight assignment in terms of actual performance*? Determining \"actual performance\" is easy enough: we can simply define our model's performance as its accuracy at predicting the correct answers.\n",
    "\n",
-    "Putting this all together, and assuming that SGD is our mechanism for updating the weight assginments, we can see how our image classifier is a machine learning model, much like Samuel envisioned."
+    "Putting this all together, and assuming that SGD is our mechanism for updating the weight assignments, we can see how our image classifier is a machine learning model, much like Samuel envisioned."
   ]
  },
  {
@ -1568,7 +1568,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Overfitting is the single most important and challenging issue** when training for all machine learning practitioners, and all algorithms. As we will see, it is very easy to create a model that does a great job at making predictions on the exact data which it has been trained on, but it is much harder to make predictions on data that it has never seen before. And of course this is the data that will actually matter in practice. For instance, if you create a hand-written digit classifier (as we will very soon!) and use it to recognise numbers written on cheques, then you are never going to see any of the numbers that the model was trained on -- every cheque will have slightly different variations of writing to deal with. We will learn many methods to avoid overfitting in this book. However, you should only use those methods after you have confirmed that overfitting is actually occurring (i.e. you have actually observed the validation accuracy getting worse during training). We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn't need to do so, ending up with a model that could be less accurate than what they could have achieved."
+    "**Overfitting is the single most important and challenging issue** when training for all machine learning practitioners, and all algorithms. As we will see, it is very easy to create a model that does a great job at making predictions on the exact data which it has been trained on, but it is much harder to make predictions on data that it has never seen before. And of course, this is the data that will actually matter in practice. For instance, if you create a hand-written digit classifier (as we will very soon!) and use it to recognise numbers written on cheques, then you are never going to see any of the numbers that the model was trained on -- every cheque will have slightly different variations of writing to deal with. We will learn many methods to avoid overfitting in this book. However, you should only use those methods after you have confirmed that overfitting is actually occurring (i.e. you have actually observed the validation accuracy getting worse during training). We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn't need to do so, ending up with a model that could be less accurate than what they could have achieved."
   ]
  },
  {
@ -1594,7 +1594,7 @@
    "\n",
    "What is a metric? A *metric* is a function that measures quality of the model's predictions using the validation set, and will be printed at the end of each *epoch*. In this case, we're using `error_rate`, which is a function provided by fastai which does just what it says: tells you what percentage of images in the validation set are being classified incorrectly. Another common metric for classification is `accuracy` (which is just `1.0 - error_rate`). fastai provides many more, which will be discussed throughout this book.\n",
    "\n",
-    "The concept of a metric may remind you of loss, but there is an important distinction. The entire purpose of loss was to define a \"measure of performance\" that the training system could use to update weights automatically. In other words, a good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption. So a good metric is one that is easy for you to understand, and that hews as closely as possible to what you want the model to do. At times, you might decide that the loss function is a suitable metirc, but that is not necessarily the case."
+    "The concept of a metric may remind you of loss, but there is an important distinction. The entire purpose of loss was to define a \"measure of performance\" that the training system could use to update weights automatically. In other words, a good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption. So a good metric is one that is easy for you to understand, and that hews as closely as possible to what you want the model to do. At times, you might decide that the loss function is a suitable metric, but that is not necessarily the case."
   ]
  },
  {
@ -1833,7 +1833,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We just covered a lot of information so let's recap breifly. <<dljargon>> provides a handy list.\n",
+    "We just covered a lot of information so let's recap briefly. <<dljargon>> provides a handy list.\n",
    "\n",
    "```asciidoc\n",
    "[[dljargon]]\n",
@ -2062,7 +2062,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
   "outputs": [
    {
     "data": {
@ -2502,7 +2504,7 @@
   "source": [
    "This model is predicting movie ratings on a scale of 0.5 to 5.0 to within around 0.6 average error. Since we're predicting a continuous number, rather than a category, we have to tell fastai what range our target has, using the `y_range` parameter.\n",
    "\n",
-    "Although we're not actually using a pretrained model (for the same reason that we didn't for the tabular model), this example shows that fastai let's us use `fine_tune` even in this case (we'll learn how and why this works later in <<chapter_pet_breeds>>). Sometimes it's best to experiment with `fine_tune` versus `fit_one_cycle` to see which works best for your dataset.\n",
+    "Although we're not actually using a pretrained model (for the same reason that we didn't for the tabular model), this example shows that fastai lets us use `fine_tune` even in this case (we'll learn how and why this works later in <<chapter_pet_breeds>>). Sometimes it's best to experiment with `fine_tune` versus `fit_one_cycle` to see which works best for your dataset.\n",
    "\n",
    "We can use the same `show_results` call we saw earlier to view a few examples of user and movie IDs, actual ratings, and predictions:"
   ]
--- a/02_production.ipynb
+++ b/02_production.ipynb
@ -69,14 +69,14 @@
    "\n",
    "If you take this approach, then you will be on your third iteration of learning and improving whilst the perfectionists are still in the planning stages!\n",
    "\n",
-    "We also suggest that you iterate from end to end in your project; that is, don't spend months fine tuning your model, or polishing the perfect GUI, or labelling the perfect dataset… Instead, complete every step as well as you can in a reasonable amount of time, all the way to the end. For instance, if your final goal is an application that runs on a mobile phone, then that should be what you have after each iteration. But perhaps in the early iterations you take some shortcuts, for instance by doing all of the processing on a remote server, and using a simple responsive web application. By completing the project and to end, you will see where the most tricky bits are, and which bits make the biggest difference to the final result."
+    "We also suggest that you iterate from end to end in your project; that is, don't spend months fine tuning your model, or polishing the perfect GUI, or labelling the perfect dataset… Instead, complete every step as well as you can in a reasonable amount of time, all the way to the end. For instance, if your final goal is an application that runs on a mobile phone, then that should be what you have after each iteration. But perhaps in the early iterations you take some shortcuts, for instance by doing all of the processing on a remote server, and using a simple responsive web application. By completing the project end to end, you will see where the most tricky bits are, and which bits make the biggest difference to the final result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As you work through this book, we suggest that you both complete lots of small experiments, by running and adjusting the notebooks we provide, at the same time that you gradually develop your own projects. That way, you will be getting experience with all of the tools and techniques that were explaining, as we discuss them.\n",
+    "As you work through this book, we suggest that you both complete lots of small experiments, by running and adjusting the notebooks we provide, at the same time that you gradually develop your own projects. That way, you will be getting experience with all of the tools and techniques that we're explaining, as we discuss them.\n",
    "\n",
    "> s: To make the most of this book, take the time to experiment between each chapter, be it on your own project or exploring the notebooks we provide. Then try re-writing those notebooks from scratch on a new dataset. It's only by practicing (and failing) a lot that you will get an intuition on how to train a model.  \n",
    "\n",
@ -91,9 +91,9 @@
   "source": [
    "Since it is easiest to get started on a project where you already have data available, that means it's probably easiest to get started on a project related to something you are already doing, because you already have data about things that you are doing. For instance, if you work in the music business, you may have access to many recordings. If you work as a radiologist, you probably have access to lots of medical images. If you are interested in wildlife preservation, you may have access to lots of images of wildlife.\n",
    "\n",
-    "Sometimes, you have to get a bit creative. Maybe you can find some previous machine learning project, such as a Kaggle competition, that is related to your field of interest. Sometimes, you have to compromize. Maybe you can't find the exact data you need for the precise project you have in mind; but you might be able to find something from a similar domain, or measured in a different way, tackling a slightly different problem. Working on these kinds of similar projects will still give you a good understanding of the overall process, and may help you identify other shortcuts, data sources, and so forth.\n",
+    "Sometimes, you have to get a bit creative. Maybe you can find some previous machine learning project, such as a Kaggle competition, that is related to your field of interest. Sometimes, you have to compromise. Maybe you can't find the exact data you need for the precise project you have in mind; but you might be able to find something from a similar domain, or measured in a different way, tackling a slightly different problem. Working on these kinds of similar projects will still give you a good understanding of the overall process, and may help you identify other shortcuts, data sources, and so forth.\n",
    "\n",
-    "Especially when you are just starting out with deep learning it's not a good idea to branch out into very different areas to places that deep learning has not been applied to before. That's because if your model does not work at first, you will not know whether it is because you have made a mistake, or if the very problem you are trying to solve is simply not solvable with deep learning. And you won't know where to look to get help. Therefore, it is best at first to start with something where you can find an example online of somebody who has had good results with something that is at least somewhat similar to what you are trying to achieve, or where you can convert your data into a format similar what someone else has used before (such as creating an image from your data). Let's have a look at the state of deep learning, jsut so you know what kinds of things deep learning is good at right now."
+    "Especially when you are just starting out with deep learning it's not a good idea to branch out into very different areas to places that deep learning has not been applied to before. That's because if your model does not work at first, you will not know whether it is because you have made a mistake, or if the very problem you are trying to solve is simply not solvable with deep learning. And you won't know where to look to get help. Therefore, it is best at first to start with something where you can find an example online of somebody who has had good results with something that is at least somewhat similar to what you are trying to achieve, or where you can convert your data into a format similar what someone else has used before (such as creating an image from your data). Let's have a look at the state of deep learning, just so you know what kinds of things deep learning is good at right now."
   ]
  },
  {
@ -107,7 +107,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's start by considering whether deep learning can be any good at the problem you are looking to work on. In general, here is a summary of the state of deep learning is at the start of 2020. However, things move very fast, and by the time you read this some of these constraints may no longer exist. We will try to keep the book website up-to-date; in addition, a Google search for \"what can AI do now\" there is likely to provide some up-to-date information."
+    "Let's start by considering whether deep learning can be any good at the problem you are looking to work on. In general, here is a summary of the state of deep learning at the start of 2020. However, things move very fast, and by the time you read this some of these constraints may no longer exist. We will try to keep the book website up-to-date; in addition, a Google search for \"what can AI do now\" there is likely to provide some up-to-date information."
   ]
  },
  {
@ -159,7 +159,7 @@
   "source": [
    "The ability of deep learning to combine text and images into a single model is, generally, far better than most people intuitively expect. For example, a deep learning model can be trained on input images, and output captions written in English, and can learn to generate surprisingly appropriate captions automatically for new images! But again, we have the same warning that we discussed in the previous section: there is no guarantee that these captions will actually be correct.\n",
    "\n",
-    "Because of this serious issue we generally recommend that deep learning be used not as an entirely automated process, but as part of a process in which the model and a human user interact closely. This can potentially make humans orders of magnitude more productive than they would be with entirely manual methods, and actually result in more accurate processes than using a human alone. For instance, an automatic system can be used to identify potential strokes directly from CT scans, send a high priority alert to have potential/scans looked at quickly. There is only a three-hour window to treat strokes, so this fast feedback loop could save lives. At the same time, however, all scans could continue to be sent to radiologists in the usual way, so there would be no reduction in human input. Other deep learning models could automatically measure items seen on the scan, and insert those measurements into reports, warning the radiologist about findings that they may have missed, and tell the radiologist about other cases which might be relevant."
+    "Because of this serious issue we generally recommend that deep learning be used not as an entirely automated process, but as part of a process in which the model and a human user interact closely. This can potentially make humans orders of magnitude more productive than they would be with entirely manual methods, and actually result in more accurate processes than using a human alone. For instance, an automatic system can be used to identify potential strokes directly from CT scans, and send a high priority alert to have those scans looked at quickly. There is only a three-hour window to treat strokes, so this fast feedback loop could save lives. At the same time, however, all scans could continue to be sent to radiologists in the usual way, so there would be no reduction in human input. Other deep learning models could automatically measure items seen on the scan, and insert those measurements into reports, warning the radiologist about findings that they may have missed, and tell the radiologist about other cases which might be relevant."
   ]
  },
  {
@ -173,7 +173,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For analysing timeseries and tabular data, deep learning has recently been making great strides. However, deep learning is generally used as part of an ensemble of multiple types of model. If you already have a system that is using random forests or gradient boosting machines (popular tabular modelling tools that we will learn about soon) then switching to, or adding, deep learning may not result in any dramatic improvement. Deep learning does greatly increase the variety of columns that you can include, for example columns containing natural language (e.g. book titles, reviews, etc), and *high cardinality categorical* columns (i.e. something that contains a large number of discrete choices, such as zip code or product id). On the downside, deep learning models generally take longer to train than random forests or gradient boosting machines, although this is changing thanks to libraries such as [RAPIDS](https://rapids.ai/), which provides GPU acceleration for the whole modeling pipeline. We cover the pros and cons of all these methods in detail in <<chapter_tabular>> in this book."
+    "For analysing timeseries and tabular data, deep learning has recently been making great strides. However, deep learning is generally used as part of an ensemble of multiple types of model. If you already have a system that is using random forests or gradient boosting machines (popular tabular modelling tools that we will learn about soon) then switching to, or adding, deep learning may not result in any dramatic improvement. Deep learning does greatly increase the variety of columns that you can include, for example columns containing natural language (e.g. book titles, reviews, etc.), and *high cardinality categorical* columns (i.e. something that contains a large number of discrete choices, such as zip code or product id). On the downside, deep learning models generally take longer to train than random forests or gradient boosting machines, although this is changing thanks to libraries such as [RAPIDS](https://rapids.ai/), which provides GPU acceleration for the whole modeling pipeline. We cover the pros and cons of all these methods in detail in <<chapter_tabular>> in this book."
   ]
  },
  {
@ -246,7 +246,7 @@
    "\n",
    "Finally, you could build two **models** for purchase probabilities, conditional on seeing or not seeing a recommendation. The difference between these two probabilities is a utility function for a given recommendation to a customer. It will be low in cases where the algorithm recommends a familiar book that the customer has already rejected (both components are small) or a book that he or she would have bought even without the recommendation (both components are large and cancel each other out).\n",
    "\n",
-    "As you can see, in practice often the practical implementation of your model will require a lot more than just training a model! You'll often need to run experiments to collect more data, and consider how to incorporate your models into the overall system you're developing. Speaking of data, let's now focus on how to find find data for your project."
+    "As you can see, in practice often the practical implementation of your model will require a lot more than just training a model! You'll often need to run experiments to collect more data, and consider how to incorporate your models into the overall system you're developing. Speaking of data, let's now focus on how to find data for your project."
   ]
  },
  {
@ -260,7 +260,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For many types of projects, you may be able to find all the data you need online. The project we'll be completing in this chapter is a *bear detector*. It will discriminate between three types of bear: grizzly, black, and teddy bear. There are many images on the Internet of each type of bear we can use. We just need a way to find them and download them. We've provided a tool you can use for this purpose, so you can follow along with this chapter, creating your own image recognition application for whatever kinds of object you're interested in. In the fast.ai course, thousands of students have presented their work on the course forums, displaying everything from Trinidad hummingbird varieties, to Panama bus types, and even an application that helped one student let his fiancee recognize his sixteen cousins during Christmas vacation!"
+    "For many types of projects, you may be able to find all the data you need online. The project we'll be completing in this chapter is a *bear detector*. It will discriminate between three types of bear: grizzly, black, and teddy bear. There are many images on the Internet of each type of bear we can use. We just need a way to find them and download them. We've provided a tool you can use for this purpose, so you can follow along with this chapter, creating your own image recognition application for whatever kinds of object you're interested in. In the fast.ai course, thousands of students have presented their work on the course forums, displaying everything from Trinidad hummingbird varieties, to Panama bus types, and even an application that helped one student let his fiancée recognize his sixteen cousins during Christmas vacation!"
   ]
  },
  {
@ -550,7 +550,7 @@
    "File:      ~/git/fastai/fastai/vision/utils.py\n",
    "Type:      function\n",
    "```\n",
-    "It tells us what argument the function accepts (`fns`) then shows us the source code and the file it comes from. Looking at that source code, we can see it applies the function `verify_image` in parallel and only keep the ones for which the result of that function is `False`, which is consistent with the doc string: it finds the images in `fns` that can't be opened.\n",
+    "It tells us what argument the function accepts (`fns`) then shows us the source code and the file it comes from. Looking at that source code, we can see it applies the function `verify_image` in parallel and only keeps the ones for which the result of that function is `False`, which is consistent with the doc string: it finds the images in `fns` that can't be opened.\n",
    "\n",
    "Here are the commands that are very useful in Jupyter notebooks:\n",
    "\n",
@ -558,7 +558,7 @@
    "- when inside the parenthesis of a function, pressing \"shift\" and \"tab\" simultaneously will display a window with the signature of the function and a short documentation. Pressing it twice will expand the documentation and pressing it three times will open a full window with the same information at the bottom of your screen.\n",
    "- in a cell, typing `?func_name` and executing will open a window with the signature of the function and a short documentation.\n",
    "- in a cell, typing `??func_name` and executing will open a window with the signature of the function, a short documentation and the source code.\n",
-    "- if you are using the fasti library, we added a `doc` function for you, executing `doc(func_name)` in a cell will open a window with the signature of the function, a short documentation and links to the source code on GitHub and the full documentation of the funciton in the [documentation of the library](https://docs.fast.ai).\n",
+    "- if you are using the fastai library, we added a `doc` function for you, executing `doc(func_name)` in a cell will open a window with the signature of the function, a short documentation and links to the source code on GitHub and the full documentation of the function in the [documentation of the library](https://docs.fast.ai).\n",
    "- unrelated to the documentation but still very useful to get help, at any point, if you get an error, type `%debug` in the next cell and execute to open the [python debugger](https://docs.python.org/3/library/pdb.html) that will let you inspect the content of every variable."
   ]
  },
@ -587,7 +587,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So with this as your training data, you would end up not with a healthy skin detector, but a *young white woman touching her face* detector! Be sure to think carefully about the types of data that you might expect to see in practice in your application, and check carefully to ensure that all these types are reflected in your model's source data.footnote:[Thanks to Deb Raji, who came up with the *healthy skin* example. See her paper *Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products* for more fascinating insights into model bias.]"
+    "So with this as your training data, you would end up not with a healthy skin detector, but a *young white woman touching her face* detector! Be sure to think carefully about the types of data that you might expect to see in practice in your application, and check carefully to ensure that all these types are reflected in your model's source data. footnote:[Thanks to Deb Raji, who came up with the *healthy skin* example. See her paper *Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products* for more fascinating insights into model bias.]"
   ]
  },
  {
@ -800,7 +800,7 @@
    "\n",
    "Instead, what we normally do in practice is to randomly select part of the image, and crop to just that part. On each epoch (which is one complete pass through all of our images in the dataset) we randomly select a different part of each image. This means that our model can learn to focus on, and recognize, different features in our images. It also reflects how images work in the real world; different photos of the same thing may be framed in slightly different ways.\n",
    "\n",
-    "Here is a another copy of the previous examples, but this time we are replacing `Resize` with `RandomResizedCrop`, which is the transform that provides the behaviour described above.The most important parameter to pass in is the `min_scale` parameter, which determines how much of the image to select at minimum each time."
+    "Here is another copy of the previous examples, but this time we are replacing `Resize` with `RandomResizedCrop`, which is the transform that provides the behaviour described above. The most important parameter to pass in is the `min_scale` parameter, which determines how much of the image to select at minimum each time."
   ]
  },
  {
@ -855,7 +855,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms to a batch, we use the `batch_tfms` parameter. (Note that's we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
+    "Data augmentation refers to creating random variations of our input data, such that they appear different, but are not expected to change the meaning of the data. Examples of common data augmentation for images are rotation, flipping, perspective warping, brightness changes, contrast changes, and much more. For natural photo images such as the ones we are using here, there is a standard set of augmentations which we have found work pretty well, and are provided with the `aug_transforms` function. Because the images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms to a batch, we use the `batch_tfms` parameter. (Note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason)."
   ]
  },
  {
@ -901,7 +901,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Time to use the same lined of codes as in <<chapter_intro>> to train our bear classifier.\n",
+    "Time to use the same lines of code as in <<chapter_intro>> to train our bear classifier.\n",
    "\n",
    "We don't have a lot of data for our problem (150 pictures of each sort of bear at most), so to train our model, we'll use `RandomResizedCrop` and default `aug_transforms` for our model, on an image size of 224px, which is fairly standard for image classification."
   ]
@ -1065,7 +1065,7 @@
   "source": [
    "Each row here represents all the black, grizzly, and teddy bears in our dataset, respectively. Each column represents the images which the model predicted as black, grizzly, and teddy bears, respectively. Therefore, the diagonal of the matrix shows the images which were classified correctly, and the other, off diagonal, cells represent those which were classified incorrectly. This is called a *confusion matrix* and is one of the many ways that fastai allows you to view the results of your model. It is (of course!) calculated using the validation set. With the color coding, the goal is to have white everywhere, except the diagonal where we want dark blue. Our bear classifier isn't making many mistakes!\n",
    "\n",
-    "It's helpful to see where exactly our errors are occuring, to see whether it's due to a dataset problem (e.g. images that aren't bears at all, or are labelled incorrectly, etc), or a model problem (e.g. perhaps it isn't handling images taken with unusual lighting, or from a different angle, etc). To do this, we can sort out images by their *loss*.\n",
+    "It's helpful to see where exactly our errors are occurring, to see whether it's due to a dataset problem (e.g. images that aren't bears at all, or are labelled incorrectly, etc.), or a model problem (e.g. perhaps it isn't handling images taken with unusual lighting, or from a different angle, etc.). To do this, we can sort out images by their *loss*.\n",
    "\n",
    "The *loss* is a number that is higher if the model is incorrect (and especially if it's also confident of its incorrect answer), or if it's correct, but not confident of its correct answer. In a couple chapters we'll learn in depth how loss is calculated and used in training process. For now, `plot_top_losses` shows us the images with the highest loss in our dataset. As the title of the output says, each image is labeled with four things: prediction, actual (target label), loss, and probability. The *probability* here is the confidence level, from zero to one, that the model has assigned to its prediction."
   ]
@ -1172,7 +1172,7 @@
    "for idx,cat in cleaner.change(): shutil.move(cleaner.fns[idx], path/cat)\n",
    "```\n",
    "\n",
-    "> s: Cleaning the data or getting it ready for your model are two of the biggest challenges for data scientists, one they say take 90% of their time. The fastai library aims at providing tools to make it as easy as possible.\n",
+    "> s: Cleaning the data or getting it ready for your model are two of the biggest challenges for data scientists; they say it takes 90% of their time. The fastai library aims at providing tools to make it as easy as possible.\n",
    "\n",
    "We'll be seeing more examples of model-driven data cleaning throughout this book. Once we've cleaned up our data, we can retrain our model. Try it yourself, and see if your accuracy improves!"
   ]
@ -1354,7 +1354,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We know how to make predictions from our saved model, so we have everything we need to start building our app. We can do it directly in a Jupyter Notenook."
+    "We know how to make predictions from our saved model, so we have everything we need to start building our app. We can do it directly in a Jupyter Notebook."
   ]
  },
  {
@ -1377,7 +1377,7 @@
    "\n",
    "*IPython widgets* are GUI components that bring together JavaScript and Python functionality in a web browser, and can be created and used within a Jupyter notebook. For instance, the image cleaner that we saw earlier in this chapter is entirely written with IPython widgets. However, we don't want to require users of our application to have to run Jupyter themselves.\n",
    "\n",
-    "That is why *Voilà* exists. It is a system for making applications consisting of IPython widgets available to end-users, without them having to use Jupyter at all. Voila is taking advantage of the fact that a notebook _already is_ a kind of web application, just a rather complex one that depends on another web application Jupyter itself. Essentially, it helps us automatically convert the complex web application which we've already implicitly made (the notebook) into a simpler, easier-to-deploy web application, which functions like a normal web application rather than like a notebook.\n",
+    "That is why *Voilà* exists. It is a system for making applications consisting of IPython widgets available to end-users, without them having to use Jupyter at all. Voilà is taking advantage of the fact that a notebook _already is_ a kind of web application, just a rather complex one that depends on another web application, Jupyter itself. Essentially, it helps us automatically convert the complex web application which we've already implicitly made (the notebook) into a simpler, easier-to-deploy web application, which functions like a normal web application rather than like a notebook.\n",
    "\n",
    "But we still have the advantage of developing in a notebook. So with ipywidgets, we can build up our GUI step by step. We will use this approach to create a simple image classifier. First, we need a file upload widget:"
   ]
@ -1644,7 +1644,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have written all the code necessary for our app. The next step is to convert it in something we can deploy."
+    "We have written all the code necessary for our app. The next step is to convert it into something we can deploy."
   ]
  },
  {
@ -1660,16 +1660,16 @@
   "source": [
    "Now that we have everything working in this Jupyter notebook, we can create our application. To do this, create a notebook which contains only the code needed to create and show the widgets that you need, and markdown for any text that you want to appear. Have a look at the *bear_classifier* notebook in the book repo to see the simple notebook application we created.\n",
    "\n",
-    "Next, install Voila if you have not already, by copying these lines into a Notebook cell, and executing it (if you're comfortable using the command line, you can also execute these two lines in your terminal, without the `!` prefix):\n",
+    "Next, install Voilà if you have not already, by copying these lines into a Notebook cell, and executing it (if you're comfortable using the command line, you can also execute these two lines in your terminal, without the `!` prefix):\n",
    "\n",
    "    !pip install voila\n",
    "    !jupyter serverextension enable voila --sys-prefix\n",
    "\n",
    "Cells which begin with a `!` do not contain Python code, but instead contain code which is passed to your shell, such as bash, power shell in windows, or so forth. If you are comfortable using the command line (which we'll be learning about later in this book), you can of course simply type these two lines (without the `!` prefix) directly into your terminal. In this case, the first line installs the voila library and application, and the second connects it to your existing Jupyter notebook.\n",
    "\n",
-    "Voila runs Jupyter notebooks, just like the Jupyter notebook server you are using now does, except that it does something very important: it removes all of the cell inputs, and only shows output (including ipywidgets), along with your markdown cells. So what's left is a web application! To view your notebook as a voila web application replace the word \"notebooks\" in your browser's URL with: \"voila/render\". You will see the same content as your notebook, but without any of the code cells.\n",
+    "Voilà runs Jupyter notebooks, just like the Jupyter notebook server you are using now does, except that it does something very important: it removes all of the cell inputs, and only shows output (including ipywidgets), along with your markdown cells. So what's left is a web application! To view your notebook as a voila web application replace the word \"notebooks\" in your browser's URL with: \"voila/render\". You will see the same content as your notebook, but without any of the code cells.\n",
    "\n",
-    "Of course, you don't need to use Voila or ipywidgets. Your model is just a function you can call: `pred,pred_idx,probs = learn.predict(img)` . So you can use it with any framework, hosted on any platform. And you can take something you've prototyped in ipywidgets and Voila and later convert it into a regular web application. We're showing you this approach in the book because we think it's a great way for data scientists and other folks that aren't web development experts to create applications from their models.\n",
+    "Of course, you don't need to use Voilà or ipywidgets. Your model is just a function you can call: `pred,pred_idx,probs = learn.predict(img)` . So you can use it with any framework, hosted on any platform. And you can take something you've prototyped in ipywidgets and Voilà and later convert it into a regular web application. We're showing you this approach in the book because we think it's a great way for data scientists and other folks that aren't web development experts to create applications from their models.\n",
    "\n",
    "We have our app, now let's deploy it!"
   ]
@ -1692,7 +1692,7 @@
    "- The complexities of dealing with GPU inference are significant. In particular, the GPU's memory will need careful manual management, and you'll need some careful queueing system to ensure you only do one batch at a time\n",
    "- There's a lot more market competition in CPU servers than GPU, as a result of which there's much cheaper options available for CPU servers.\n",
    "\n",
-    "Because of the complexity of GPU serving, many systems have sprung up to try to automate this. However, managing and running these systems is themselves complex, and generally requires compiling your model into a different form that's specialized for that system. It doesn't make sense to deal with this complexity until/unless your app gets popular enough that it makes clear financial sense for you to do so."
+    "Because of the complexity of GPU serving, many systems have sprung up to try to automate this. However, managing and running these systems is also complex, and generally requires compiling your model into a different form that's specialized for that system. It doesn't make sense to deal with this complexity until/unless your app gets popular enough that it makes clear financial sense for you to do so."
   ]
  },
  {
@ -1720,7 +1720,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The first time you do this Binder will take around 5 minutes to build your site. In other words, is it finding a virtual machine which can run your app, allocating storage, collecting the files needed for Jupyter, for your notebook, and for presenting your notebook as a web application. It's doing all of this behind the scenes.\n",
+    "The first time you do this Binder will take around 5 minutes to build your site. In other words, it is finding a virtual machine which can run your app, allocating storage, collecting the files needed for Jupyter, for your notebook, and for presenting your notebook as a web application. It's doing all of this behind the scenes.\n",
    "\n",
    "Finally, once it has started the app running, it will navigate your browser to your new web app. You can share the URL you copied to allow others to access your app as well.\n",
    "\n",
@ -1731,11 +1731,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You may well want to deploy your application onto mobile devices, or edge devices such as a Raspberry Pi. There are a lot of libraries and frameworks to allow you to integrate a model directly into a mobile application. However these approaches tend to require a lot of extra steps and boilerplate, and do not always support all the PyTorch and fastai layers that your model might use. In addition, the work you do will depend on what kind of mobile devices you are targeting for deployment. So you might need to do some work to run on iOS devices, different work to run on newer Android devices, different work for older Android devices, etc.. Instead, we recommend wherever possible that you deploy the model itself to a server, and have your mobile or edge application connect to it as a web service.\n",
+    "You may well want to deploy your application onto mobile devices, or edge devices such as a Raspberry Pi. There are a lot of libraries and frameworks to allow you to integrate a model directly into a mobile application. However these approaches tend to require a lot of extra steps and boilerplate, and do not always support all the PyTorch and fastai layers that your model might use. In addition, the work you do will depend on what kind of mobile devices you are targeting for deployment. So you might need to do some work to run on iOS devices, different work to run on newer Android devices, different work for older Android devices, etc. Instead, we recommend wherever possible that you deploy the model itself to a server, and have your mobile or edge application connect to it as a web service.\n",
    "\n",
    "There is quite a few upsides to this approach. The initial installation is easier, because you only have to deploy a small GUI application, which connects to the server to do all the heavy lifting. More importantly perhaps, upgrades of that core logic can happen on your server, rather than needing to be distributed to all of your users. Your server can have a lot more memory and processing capacity than most edge devices, and it is far easier to scale those resources if your model becomes more demanding. The hardware that you will have on a server is going to be more standard and more easily supported by fastai and PyTorch, so you don't have to compile your model into a different form.\n",
    "\n",
-    "There are downsides too, of course. Your application will require a network connection, and there will be some latency each time the model is called. It takes a while for a neural network model to run anyway, so this additional network latency may not make a big difference to your users in practice. In fact, since you can use better hardware on the server, the overall latency may even be less! If your application uses sensitive data then your users may be concerned about an approach which sends that data to a remote server, so sometimes privacy considerations will mean that you need to run the model on the edge device. Sometimes this can be avoided by having an *on premise* server, such as inside a company's firewall. Managing the complexity and scaling the server can create additional overhead, whereas if your model runs on the edge devices then each user is bringing their own compute resources, which leads to easier scaling with an increasing number of users (also known as _horizontal scaling_)."
+    "There are downsides too, of course. Your application will require a network connection, and there will be some latency each time the model is called. It takes a while for a neural network model to run anyway, so this additional network latency may not make a big difference to your users in practice. In fact, since you can use better hardware on the server, the overall latency may even be less! If your application uses sensitive data then your users may be concerned about an approach which sends that data to a remote server, so sometimes privacy considerations will mean that you need to run the model on the edge device. Sometimes this can be avoided by having an *on premise* server, such as inside a company's firewall. Managing the complexity and scaling the server can create additional overhead, whereas if your model runs on the edge devices then each user is bringing their own compute resources, which leads to easier scaling with an increasing number of users (also known as *horizontal scaling*)."
   ]
  },
  {
@ -1746,7 +1746,7 @@
    "\n",
    "Overall, we'd recommend using a simple CPU-based server approach where possible, for as long as you can get away with it. If you're lucky enough to have a very successful application, then you'll be able to justify the investment in more complex deployment approaches at that time.\n",
    "\n",
-    "Congratulations, you have succesfully built a deep learning model and deployed it! Now is a good time to take a pause and think about what could go wrong."
+    "Congratulations, you have successfully built a deep learning model and deployed it! Now is a good time to take a pause and think about what could go wrong."
   ]
  },
  {
@ -1762,7 +1762,7 @@
   "source": [
    "In practice, a deep learning model will be just one piece of a much bigger system. As we discussed at the start of this chapter, a *data product* requires thinking about the entire end to end process within which our model lives. In this book, we can't hope to cover all the complexity of managing deployed data products, such as managing multiple versions of models, A/B testing, canarying, refreshing the data (should we just grow and grow our datasets all the time, or should we regularly remove some of the old data), handling data labelling, monitoring all this, detecting model rot, and so forth. However, there is an excellent book that covers many deployment issues, which is [Building Machine Learning Powered Applications](https://www.amazon.com/Building-Machine-Learning-Powered-Applications/dp/149204511X), by Emmanuel Ameisen. In this section, we will give an overview of some of the most important issues to consider.\n",
    "\n",
-    "One of the biggest issues with this is that understanding and testing the behavior of a deep learning model is much more difficult than most code that you would write. With normal software development you can analyse the exact steps that the software is taking, and carefully study with of these steps match the desired behaviour that you are trying to create. But with a neural network the behavior emerges from the models attempt to match the training data, rather than being exactly defined.\n",
+    "One of the biggest issues with this is that understanding and testing the behavior of a deep learning model is much more difficult than most code that you would write. With normal software development you can analyse the exact steps that the software is taking, and carefully study which of these steps match the desired behaviour that you are trying to create. But with a neural network the behavior emerges from the models attempt to match the training data, rather than being exactly defined.\n",
    "\n",
    "This can result in disaster! For instance, let's say you really were rolling out a bear detection system which will be attached to video cameras around the campsite, and will warn campers of incoming bears. If we used a model trained with the dataset we downloaded, there are going to be all kinds of problems in practice, such as:\n",
    "\n",
@ -1808,7 +1808,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: I started a company 20 years ago called _Optimal Decisions_ which used machine learning and optimisation to help giant insurance companies set their pricing, impacting tens of billions of dollars of risks. We used the approaches described above to manage the potential downsides of something that might go wrong. Also, before we worked with our clients to put anything in production, we tried to simulate the impact by testing the end to end system on their previous year's data. It was always quite a nerve-wracking process, putting these new algorithms in production, but every rollout was successful."
+    "> J: I started a company 20 years ago called _Optimal Decisions_ which used machine learning and optimisation to help giant insurance companies set their pricing, impacting tens of billions of dollars of risks. We used the approaches described above to manage the potential downsides of something that might go wrong. Also, before we worked with our clients to put anything in production, we tried to simulate the impact by testing the end to end system on their previous year's data. It was always quite a nerve-wracking process, putting these new algorithms in production, but every rollout was successful."
   ]
  },
  {
@ -1822,7 +1822,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One of the biggest challenges in rolling out a model is that your model may change the behaviour of the system it is a part of. For instance, consider a \"predictive policing\" algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. In the Royal Statiscal Society paper [To predict and serve](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2016.00960.x), Kristian Lum and William Isaac write: \"predictive policing is aptly named: it is predicting future policing, not future crime\".\n",
+    "One of the biggest challenges in rolling out a model is that your model may change the behaviour of the system it is a part of. For instance, consider a \"predictive policing\" algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. In the Royal Statistical Society paper [To predict and serve](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2016.00960.x), Kristian Lum and William Isaac write: \"predictive policing is aptly named: it is predicting future policing, not future crime\".\n",
    "\n",
    "Part of the issue in this case is that in the presence of *bias* (which we'll discuss in depth in the next chapter), feedback loops can result in negative implications of that bias getting worse and worse. For instance, there are concerns that this is already happening in the US, where there is significant bias in arrest rates on racial grounds. [According to the ACLU](https://www.aclu.org/issues/smart-justice/sentencing-reform/war-marijuana-black-and-white), \"despite roughly equal usage rates, Blacks are 3.73 times more likely than whites to be arrested for marijuana\". The impact of this bias, along with the roll-out of predictive policing algorithms in many parts of the US, led Bärí Williams to [write in the NY Times](https://www.nytimes.com/2017/12/02/opinion/sunday/intelligent-policing-and-my-innocent-children.html): \"The same technology that’s the source of so much excitement in my career is being used in law enforcement in ways that could mean that in the coming years, my son, who is 7 now, is more likely to be profiled or arrested — or worse — for no reason other than his race and where we live.\"\n",
    "\n",
@ -1874,7 +1874,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. Provide an example of where the bear classification model might work poorly, due to structural or style differences to the training data\n",
+    "1. Provide an example of where the bear classification model might work poorly, due to structural or style differences to the training data.\n",
    "1. Where do text models currently have a major deficiency?\n",
    "1. What are possible negative societal implications of text generation models?\n",
    "1. In situations where a model might make mistakes, and those mistakes could be harmful, what is a good alternative to automating a process?\n",
@ -1929,25 +1929,10 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/03_ethics.ipynb
+++ b/03_ethics.ipynb
@ -50,7 +50,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: At university, philosophy of ethics was my main thing (it would have been the topic of my thesis, if I'd finished it, instead of dropping out to join the real-world). Based on the years I spent studying ethics, I can tell you this: no one really agrees on what right and wrong are, whether they exist, how to spot them, which people are good, and which bad, or pretty much anything else. So don't expect too much from the theory! We're going to focus on examples and thought starters here, not theory."
+    "> J: At university, philosophy of ethics was my main thing (it would have been the topic of my thesis, if I'd finished it, instead of dropping out to join the real-world). Based on the years I spent studying ethics, I can tell you this: no one really agrees on what right and wrong are, whether they exist, how to spot them, which people are good, and which bad, or pretty much anything else. So don't expect too much from the theory! We're going to focus on examples and thought starters here, not theory."
   ]
  },
  {
@ -62,7 +62,7 @@
    "- Well-founded standards of right and wrong that prescribe what humans ought to do, and\n",
    "- The study and development of one's ethical standards.\n",
    "\n",
-    "There is no list of right answers for ethics. There is no list of dos and don'ts. Ethics is complicated, and context-dependent. It involves the perspectives of many stakeholders. Ethics is a muscle that you have to develop and practice. In this chapter, our goal is to provide some signposts to help you on that journey.\n",
+    "There is no list of right answers for ethics. There is no list of do's and dont's. Ethics is complicated, and context-dependent. It involves the perspectives of many stakeholders. Ethics is a muscle that you have to develop and practice. In this chapter, our goal is to provide some signposts to help you on that journey.\n",
    "\n",
    "Spotting ethical issues is best to do as part of a collaborative team. This is the only way you can really incorporate different perspectives. Different people's backgrounds will help them to see things which may not be obvious to you. Working with a team is helpful for many \"muscle building\" activities, including this one.\n",
    "\n",
@ -84,7 +84,7 @@
    "\n",
    "1.  **Recourse processes**: Arkansas's buggy healthcare algorithms left patients stranded\n",
    "2.  **Feedback loops**: YouTube's recommendation system helped unleash a conspiracy theory boom\n",
-    "3.  **Bias**: When a traditionally African-American name is searched for on Google, it displays ads for criminal background checks.\n",
+    "3.  **Bias**: When a traditionally African-American name is searched for on Google, it displays ads for criminal background checks\n",
    "\n",
    "In fact, for every concept that we introduce in this chapter, we are going to provide at least one specific example. For each one, have a think about what you could have done in this situation, and think about what kinds of obstructions there might have been to you getting that done. How would you deal with them? What would you look out for?"
   ]
@ -242,7 +242,7 @@
    "\n",
    "The modern workplace is a very specialised place. Everybody tends to have very well-defined jobs to perform. Especially in large companies, it can be very hard to know what all the pieces of the puzzle are. Sometimes companies even intentionally obscure the overall project goals that are being worked on, if they know that their employees are not going to like the answers. This is sometimes done by compartmentalising pieces as much as possible\n",
    "\n",
-    "In other words, we're not saying that any of this is easy. It's hard. It's really hard. We all have to do our best. And we have often seen that the people who do get involved in the higher-level context of these projects, and attempt to develop cross disciplinary capabilities and teams, become some of the most important and well rewarded parts of their organisations. It's the kind of work that tends to be highly appreciated by senior executives, even if it is considered, sometimes, rather uncomfortable by middle management."
+    "In other words, we're not saying that any of this is easy. It's hard. It's really hard. We all have to do our best. And we have often seen that the people who do get involved in the higher-level context of these projects, and attempt to develop cross-disciplinary capabilities and teams, become some of the most important and well rewarded members of their organisations. It's the kind of work that tends to be highly appreciated by senior executives, even if it is considered, sometimes, rather uncomfortable by middle management."
   ]
  },
  {
@ -284,9 +284,9 @@
   "source": [
    "In a complex system, it is easy for no one person to feel responsible for outcomes. While this is understandable, it does not lead to good results. In the earlier example of the Arkansas healthcare system in which a bug led to people with cerebral palsy losing access to needed care, the creator of the algorithm blamed government officials, and government officials could blame those who implemented the software. NYU professor Danah Boyd described this phenomenon: \"bureaucracy has often been used to evade responsibility, and today's algorithmic systems are extending bureaucracy.\"\n",
    "\n",
-    "An additional reason why recourse is so necessary, is because data often contains errors. Mechanisms for audits and error-correction are crucial. A database of suspected gang members maintained by California law enforcement officials was found to be full of errors, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as “admitting to being gang members”). In this case, there was no process in place for correcting mistakes or removing people once they’ve been added. Another example is the US credit report system; in a large-scale study of credit reports by the FTC (Federal Trade Commission) in 2012, it was found that 26% of consumers had at least one mistake in their files, and 5% had errors that could be devastating.  Yet, the process of getting such errors corrected is incredibly slow and opaque. When public-radio reporter Bobby Allyn discovered that he was erroneously listed as having a firearms conviction, it took him \"more than a dozen phone calls, the handiwork of a county court clerk and six weeks to solve the problem. And that was only after I contacted the company’s communications department as a journalist.\" (as covered in the article [How the careless errors of credit reporting agencies are ruining people’s lives](https://www.washingtonpost.com/posteverything/wp/2016/09/08/how-the-careless-errors-of-credit-reporting-agencies-are-ruining-peoples-lives/))\n",
+    "An additional reason why recourse is so necessary is because data often contains errors. Mechanisms for audits and error-correction are crucial. A database of suspected gang members maintained by California law enforcement officials was found to be full of errors, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as “admitting to being gang members”). In this case, there was no process in place for correcting mistakes or removing people once they’d been added. Another example is the US credit report system: in a large-scale study of credit reports by the FTC (Federal Trade Commission) in 2012, it was found that 26% of consumers had at least one mistake in their files, and 5% had errors that could be devastating.  Yet, the process of getting such errors corrected is incredibly slow and opaque. When public-radio reporter Bobby Allyn discovered that he was erroneously listed as having a firearms conviction, it took him \"more than a dozen phone calls, the handiwork of a county court clerk and six weeks to solve the problem. And that was only after I contacted the company’s communications department as a journalist.\" (as covered in the article [How the careless errors of credit reporting agencies are ruining people’s lives](https://www.washingtonpost.com/posteverything/wp/2016/09/08/how-the-careless-errors-of-credit-reporting-agencies-are-ruining-peoples-lives/))\n",
    "\n",
-    "As machine learning practitioners, we do not always think of it as our responsibility to understand how our algorithms and up being implemented in practice. But we need to."
+    "As machine learning practitioners, we do not always think of it as our responsibility to understand how our algorithms end up being implemented in practice. But we need to."
   ]
  },
  {
@ -347,7 +347,7 @@
    "\n",
    "> : \"One important signal to classify the main topic of a video is the channel it comes from. For example, a video uploaded to a cooking channel is very likely to be a cooking video. But how do we know what topic a channel is about? Well… in part by looking at the topics of the videos it contains! Do you see the loop? For example, many videos have a description which indicates what camera was used to shoot the video. As a result, some of these videos might get classified as videos about “photography”. If a channel has such as misclassified video, it might be classified as a “photography” channel, making it even more likely for future videos on this channel to be wrongly classified as “photography”. This could even lead to runaway virus-like classifications! One way to break this feedback loop is to classify videos with and without the channel signal. Then when classifying the channels, you can only use the classes obtained without the channel signal. This way, the feedback loop is broken.\"\n",
    "\n",
-    "There are positive examples of people and organizations attempting to combat these problems. Evan Estola, lead machine learning engineer at Meetup, [discussed the example](https://www.youtube.com/watch?v=MqoRzNhrTnQ) of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop, but explicitly not using gender for that part of their model. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact. \"You need to decide which feature not to use in your algorithm… the most optimal algorithm is perhaps not the best one to launch into production\", he said.\n",
+    "There are positive examples of people and organizations attempting to combat these problems. Evan Estola, lead machine learning engineer at Meetup, [discussed the example](https://www.youtube.com/watch?v=MqoRzNhrTnQ) of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop, by explicitly not using gender for that part of their model. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact. \"You need to decide which feature not to use in your algorithm… the most optimal algorithm is perhaps not the best one to launch into production\", he said.\n",
    "\n",
    "While Meetup chose to avoid such an outcome, Facebook provides an example of allowing a runaway feedback loop to run wild. Facebook radicalizes users interested in one conspiracy theory by introducing them to more. As [Renee DiResta, a researcher on proliferation of disinformation, writes](https://www.fastcompany.com/3059742/social-network-algorithms-are-distorting-reality-by-boosting-conspiracy-theories):"
   ]
@ -363,7 +363,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It is extremely important to keep in mind this kind of behavior can happen, and to either anticipate a feedback loop or take positive action to break it when you can the first signs of it in your own projects. Another thing to keep in mind is *bias*, which, as we discussed in the previous chapter, can interact with feedback loops in very troublesome ways."
+    "It is extremely important to keep in mind this kind of behavior can happen, and to either anticipate a feedback loop or take positive action to break it when you can see the first signs of it in your own projects. Another thing to keep in mind is *bias*, which, as we discussed in the previous chapter, can interact with feedback loops in very troublesome ways."
   ]
  },
  {
@ -414,7 +414,7 @@
    "  - When doctors were shown identical files, they were much less likely to recommend cardiac catheterization (a helpful procedure) to Black patients\n",
    "  - When bargaining for a used car, Black people were offered initial prices $700 higher and received far smaller concessions\n",
    "  - Responding to apartment-rental ads on Craigslist with a Black name elicited fewer responses than with a white name\n",
-    "  - An all-white jury was 16 percentage points more likely to convict a Black defendant than a white one, but when a jury had 1 Black member, it convicted both at same rate.\n",
+    "  - An all-white jury was 16 percentage points more likely to convict a Black defendant than a white one, but when a jury had 1 Black member, it convicted both at the same rate\n",
    "\n",
    "The COMPAS algorithm, widely used for sentencing and bail decisions in the US, is an example of an important algorithm which, when tested by ProPublica, showed clear racial bias in practice:"
   ]
@ -635,7 +635,7 @@
    "  - People are more likely to assume algorithms are objective or error-free (even if they’re given the option of a human override)\n",
    "  - Algorithms are more likely to be implemented with no appeals process in place\n",
    "  - Algorithms are often used at scale\n",
-    "  - Algorithmic systems are cheap.\n",
+    "  - Algorithmic systems are cheap\n",
    "\n",
    "Even in the absence of bias, algorithms (and deep learning especially, since it is such an effective and scalable algorithm) can lead to negative societal problems, such as when used for *disinformation*."
   ]
@ -714,7 +714,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It's easy to miss important issues when considered ethical implications of your work. One thing that helps enormously is simply asking the right questions. Rachel Thomas recommends considering the following questions throughout the development of a data project:\n",
+    "It's easy to miss important issues when considering ethical implications of your work. One thing that helps enormously is simply asking the right questions. Rachel Thomas recommends considering the following questions throughout the development of a data project:\n",
    "\n",
    "  - Should we even be doing this?\n",
    "  - What bias is in the data?\n",
@ -775,7 +775,7 @@
    "  - Will the effects in aggregate likely create more good than harm, and what types of good and harm?\n",
    "  - Are we thinking about all relevant types of harm/benefit (psychological, political, environmental, moral, cognitive, emotional, institutional, cultural)?\n",
    "  - How might future generations be affected by this project?\n",
-    "  - Do the risks of harm from this project fall disproportionately on the least powerful in society? Will the benefits go disproportionately the well-off?\n",
+    "  - Do the risks of harm from this project fall disproportionately on the least powerful in society? Will the benefits go disproportionately to the well-off?\n",
    "  - Have we adequately considered ‘dual-use?\n",
    "\n",
    "The alternative lens to this is the *deontological* perspective, which focuses on basic *right* and *wrong*:\n",
@ -814,7 +814,7 @@
    "\n",
    "Receiving mentorship has been statistically shown to help men advance, but not women. The reason behind this is that when women receive mentorship, it’s advice on how they should change and gain more self-knowledge. When men receive mentorship, it’s public endorsement of their authority. Guess which is more useful in getting promoted?\n",
    "\n",
-    "As long as qualified women keep dropping out of tech, teaching more girls to code will not solve the diversity issues plaguing the field. Diversity initiatives often end up focusing primarily on white women, even although women of colour face many additional barriers. In interviews with 60 women of color who work in STEM research, 100% had experienced discrimination."
+    "As long as qualified women keep dropping out of tech, teaching more girls to code will not solve the diversity issues plaguing the field. Diversity initiatives often end up focusing primarily on white women, even though women of colour face many additional barriers. In interviews with 60 women of color who work in STEM research, 100% had experienced discrimination."
   ]
  },
  {
@ -843,7 +843,7 @@
    "\n",
    "FAccT is another lens that you may find useful in considering ethical issues. One useful resource for this is the free online book [Fairness and machine learning; Limitations and Opportunities](https://fairmlbook.org/), which \"gives a perspective on machine learning that treats fairness as a central concern rather than an afterthought.\" It also warns, however, that it \"is intentionally narrow in scope... A narrow framing of machine learning ethics might be tempting to technologists and businesses as a way to focus on technical interventions while sidestepping deeper questions about power and accountability. We caution against this temptation.\" Rather than provide an overview of the FAccT approach to ethics (which is better done in books such as the one linked above), our focus here will be on the limitations of this kind of narrow framing.\n",
    "\n",
-    "One great way to consider whether an ethical lens is complete, is to try to come up with an example where the lens and our own ethical intuitions give diverging results. Os Keyes explored this in a graphic way in their paper [A Mulching Proposal\n",
+    "One great way to consider whether an ethical lens is complete, is to try to come up with an example where the lens and our own ethical intuitions give diverging results. Os Keyes et al. explored this in a graphic way in their paper [A Mulching Proposal\n",
    "Analysing and Improving an Algorithmic System for Turning the Elderly into High-Nutrient Slurry](https://arxiv.org/abs/1908.06166). The paper's abstract says:"
   ]
  },
@ -860,7 +860,7 @@
   "source": [
    "In this paper, the rather controversial proposal (\"Turning the Elderly into High-Nutrient Slurry\") and the results (\"drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system\") are at odds... to say the least!\n",
    "\n",
-    "In philosophy, and especially philosophy of ethics, this is one of the most effective tools: first, come up with a process, definition, set of questions, etc, which is designed to resolve some problem. Then try to come up with an example where that apparent solution results in a proposal that no-one would consider acceptable. This can then lead to a further refinement of the solution.\n",
+    "In philosophy, and especially philosophy of ethics, this is one of the most effective tools: first, come up with a process, definition, set of questions, etc., which is designed to resolve some problem. Then try to come up with an example where that apparent solution results in a proposal that no-one would consider acceptable. This can then lead to a further refinement of the solution.\n",
    "\n",
    "So far, we've focused on things that you and your organization can do. But sometimes individual or organizational action is not enough. Sometimes, governments also need to consider policy implications."
   ]
@ -876,7 +876,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We often talk to people who are eager for technical or design fixes to be full solution to the kinds of problems that we've been discussing; for instance, a technical approach to debias data, or design guidelines for making technology less addictive. While such measures can be useful, they will not be sufficient to address the underlying problems that have led to our current state. For example, as long as it is incredibly profitable to create addictive technology, companies will continue to do so, regardless of whether this has the side effect of promoting conspiracy theories and polluting our information ecosystem. While individual designers may try to tweak product designs, we will not see substantial changes until the underlying profit incentives changes."
+    "We often talk to people who are eager for technical or design fixes to be a full solution to the kinds of problems that we've been discussing; for instance, a technical approach to debias data, or design guidelines for making technology less addictive. While such measures can be useful, they will not be sufficient to address the underlying problems that have led to our current state. For example, as long as it is incredibly profitable to create addictive technology, companies will continue to do so, regardless of whether this has the side effect of promoting conspiracy theories and polluting our information ecosystem. While individual designers may try to tweak product designs, we will not see substantial changes until the underlying profit incentives changes."
   ]
  },
  {
@ -908,9 +908,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Clean air and clean drinking water are public goods which are nearly impossible to protect through individual market decisions, but rather require coordinated regulatory action. Similarly, many of the harms resulting from unintended consequences of misuses of technology involve public goods, such as a polluted information environment or deteriorated ambient privacy. Too often privacy is framed as an individual right, yet there are societal impacts to widespread surveillance (which would still be the case even if it was possible for a few individuals to opt out)\n",
+    "Clean air and clean drinking water are public goods which are nearly impossible to protect through individual market decisions, but rather require coordinated regulatory action. Similarly, many of the harms resulting from unintended consequences of misuses of technology involve public goods, such as a polluted information environment or deteriorated ambient privacy. Too often privacy is framed as an individual right, yet there are societal impacts to widespread surveillance (which would still be the case even if it was possible for a few individuals to opt out).\n",
    "\n",
-    "Many of the issues we are seeing in tech are actually human rights issues, such as when a biased algorithm recommends that Black defendants to have longer prison sentences, when particular job ads are only shown to young people, or when police use facial recognition to identify protesters. The appropriate venue to address human rights issues is typically through the law.\n",
+    "Many of the issues we are seeing in tech are actually human rights issues, such as when a biased algorithm recommends that Black defendants have longer prison sentences, when particular job ads are only shown to young people, or when police use facial recognition to identify protesters. The appropriate venue to address human rights issues is typically through the law.\n",
    "\n",
    "We need both regulatory and legal changes, *and* the ethical behavior of individuals. Individual behavior change can’t address misaligned profit incentives, externalities (where corporations reap large profits while off-loading their costs & harms to the broader society), or systemic failures. However, the law will never cover all edge cases, and it is important that individual software developers and data scientists are equipped to make ethical decisions in practice."
   ]
@ -940,7 +940,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Coming from a background of working with binary logic, the lack of clear answers in ethics can be frustrating at first.  Yet, the implications of how our work impacts the world, including unintended consequences and the work becoming weaponization by bad actors, are some of the most important questions we can (and should!) consider.  Even though there aren't any easy answers, there are definite pitfalls to avoid and practices to move towards more ethical behavior.\n",
+    "Coming from a background of working with binary logic, the lack of clear answers in ethics can be frustrating at first.  Yet, the implications of how our work impacts the world, including unintended consequences and the work becoming weaponized by bad actors, are some of the most important questions we can (and should!) consider.  Even though there aren't any easy answers, there are definite pitfalls to avoid and practices to move towards more ethical behavior.\n",
    "\n",
    "Many people (including us!) are looking for more satisfying, solid answers of how to address harmful impacts of technology. However, given the complex, far-reaching, and interdisciplinary nature of the problems we are facing, there are no simple solutions. Julia Angwin, former senior reporter at ProPublica who focuses on issues of algorithmic bias and surveillance (and one of the 2016 investigators of the COMPAS recidivism algorithm that helped spark the field of Fairness Accountability and Transparency) said in [a 2019 interview](https://www.fastcompany.com/90337954/who-cares-about-liberty-julia-angwin-and-trevor-paglen-on-privacy-surveillance-and-the-mess-were-in), “I strongly believe that in order to solve a problem, you have to diagnose it, and that we’re still in the diagnosis phase of this. If you think about the turn of the century and industrialization, we had, I don’t know, 30 years of child labor, unlimited work hours, terrible working conditions, and it took a lot of journalist muckraking and advocacy to diagnose the problem and have some understanding of what it was, and then the activism to get laws changed. I feel like we’re in a second industrialization of data information... I see my role as trying to make as clear as possible what the downsides are, and diagnosing them really accurately so that they can be solvable. That’s hard work, and lots more people need to be doing it.” It's reassuring that Angwin thinks we are largely still in the diagnosis phase: if your understanding of these problems feels incomplete, that is normal and natural. Nobody has a “cure” yet, although it is vital that we continue working to better understand and address the problems we are facing.\n",
    "\n",
@ -1034,9 +1034,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
--- a/04_mnist_basics.ipynb
+++ b/04_mnist_basics.ipynb
@ -31,9 +31,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Having seen what it looks like to actually train a variety of models in chapter 2, let’s now look under the hood and see exactly what is going on. We’ll start with computer vision, and will use that to introduce fundamental tools and concepts of deep learning.\n",
+    "Having seen what it looks like to actually train a variety of models in Chapter 2, let’s now look under the hood and see exactly what is going on. We’ll start by using computer vision to introduce fundamental tools and concepts for deep learning.\n",
    "\n",
-    "To be exact, we'll discuss the role of arrays and tensors, and of brodcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of loss function for our basic classification task, and the role of mini-batches. We'll also finally describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together to see them working together.\n",
+    "To be exact, we'll discuss the role of arrays and tensors, and of broadcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of a loss function for our basic classification task, and the role of mini-batches. We'll also describe the math that a basic neural network is actually doing. Finally, we'll put all these pieces together to see them at work.\n",
    "\n",
    "In future chapters we’ll do deep dives into other applications as well, and see how these concepts and tools generalize. But this chapter is about laying foundation stones. To be frank, that also makes this one of the harder chapters, because of how these concepts all depend on each other. Like an arch, all the stones need to be in place for the structure to stay up. Also like an arch, once that happens, it's a powerful structure that can support other things. But it requires some patience to assemble.\n",
    "\n",
@ -51,7 +51,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In order to understand what happens in a computer vision model, we first have to understand how computers handle images. We'll use one of the most famous datasets in computer vision, [MNIST](https://en.wikipedia.org/wiki/MNIST_database), for our experiments. MNIST contains hand-written digits, collected by the National Institute of Standards and Technology, and collated into a machine learning dataset by Yann Lecun and his colleagues. Lecun used MNIST in 1998 to demonstrate [Lenet 5](http://yann.lecun.com/exdb/lenet/), the first computer system to demonstrate practically useful recognition of hand-written digit sequences. This was one of the most important breakthroughs in the history of AI."
+    "In order to understand what happens in a computer vision model, we first have to understand how computers handle images. We'll use one of the most famous datasets in computer vision, [MNIST](https://en.wikipedia.org/wiki/MNIST_database), for our experiments. MNIST contains hand-written digits, collected by the National Institute of Standards and Technology, and collated into a machine learning dataset by Yann Lecun and his colleagues. Lecun used MNIST in 1998 to demonstrate [Lenet 5](https://yann.lecun.com/exdb/lenet/), the first computer system to demonstrate practically useful recognition of hand-written digit sequences. This was one of the most important breakthroughs in the history of AI."
   ]
  },
  {
@ -65,13 +65,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The story of deep learning is one of tenacity and grit from a handful of dedicated researchers. After early hopes (and hype!) neural networks went out of favor in the 1990's and 2000's, and just a handful of researchers kept trying to make them work well. Three of them, Yann Lecun, Geoff Hinton, and Yoshua Bengio were awarded the highest honor in computer science, the Turing Award (generally considered the \"Nobel Prize of computer science\") after triumphing despite the deep skepticism and disinterest of the wider machine learning and statistics community.\n",
+    "The story of deep learning is one of tenacity and grit from a handful of dedicated researchers. After early hopes (and hype!) neural networks went out of favor in the 1990's and 2000's, and just a handful of researchers kept trying to make them work well. Three of them, Yann Lecun,  Yoshua Bengio and Geoffrey Hinton were awarded the highest honor in computer science, the Turing Award (generally considered the \"Nobel Prize of computer science\") after triumphing despite the deep skepticism and disinterest of the wider machine learning and statistics community.\n",
    "\n",
-    "<img src=\"images/turing_300.jpg\" id=\"dl_fathers\" caption=\"Left to right, Yann Lecun, Geoffrey Hinton and Yoshua Bengio\" alt=\"Picture of Yann Lecun, Geoffrey Hinton and Yoshua Bengio\">\n",
+    "<img src=\"images/turing_300.jpg\" id=\"dl_fathers\" caption=\"Left to right, Yann Lecun,  Yoshua Bengio and Geoffrey Hinton\" alt=\"Picture of Yann Lecun,  Yoshua Bengio and Geoffrey Hinton\">\n",
    "\n",
    "Geoff Hinton has told of how even academic papers showing dramatically better results than anything previously published would be rejected from top journals and conferences, just because they used a neural network. Yann Lecun's work on convolutional neural networks, which we will study in the next section, showed that these models could read hand-written text--something that had never been achieved before. However his breakthrough was ignored by most researchers, even as it was used commercially to read 10% of the checks in the US!\n",
    "\n",
-    "In addition to these three Turing Award winners, there are many other researchers who have battled to get us to where we are today. For instance, Jurgen Schmidhuber (who many believe should have shared in the Turing Award) pioneered many important ideas, including working on the *LSTM* architecture with his student Sepp Hochreiter (widely used for speech recognition and other text modeling tasks, and used in the IMDb example in <<chapter_intro>>). Perhaps most important of all, Paul Werbos in 1974 invented back-propagation for neural networks, the technique shown in this chapter and used universally for training neural networks ([Werbos 1994](https://books.google.com/books/about/The_Roots_of_Backpropagation.html?id=WdR3OOM2gBwC)). His development was almost entirely ignored for decades, but today it is the most important foundation of modern AI.\n",
+    "In addition to these three Turing Award winners, there are many other researchers who have battled to get us to where we are today. For instance, Jurgen Schmidhuber (who many believe should have shared in the Turing Award) pioneered many important ideas, including working on the *LSTM* architecture with his student Sepp Hochreiter (widely used for speech recognition and other text modeling tasks, and used in the IMDB example in <<chapter_intro>>). Perhaps most important of all, Paul Werbos in 1974 invented back-propagation for neural networks, the technique shown in this chapter and used universally for training neural networks ([Werbos 1994](https://books.google.com/books/about/The_Roots_of_Backpropagation.html?id=WdR3OOM2gBwC)). His development was almost entirely ignored for decades, but today it is the most important foundation of modern AI.\n",
    "\n",
    "There is a lesson here for all of us! On your deep learning journey you will face many obstacles, both technical, and (even more difficult) people around you who don't believe you'll be successful. There's one *guaranteed* way to fail, and that's to stop trying. We've seen that the only consistent trait amongst every fast.ai student that's gone on to be a world-class practitioner is that they are all very tenacious."
   ]
@ -87,7 +87,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For this initial tutorial we are just going to try to create a model that can recognise \"3\"s and \"7\"s. So let's download a sample of MNIST which contains images of just these digits:"
+    "For this initial tutorial we are just going to try to create a model that can classify any image as a \"3\" or a \"7\". So let's download a sample of MNIST which contains images of just these digits:"
   ]
  },
  {
@ -113,7 +113,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can see what's in this directory by using `ls()`, a method added by fastai. This method returns an object of a special fastai class called `L`, which has all the same functionality of Python's builtin `list`, plus a lot more. One of its handy features is that, when printed, it displays the count of items, before listing the items themselves (if there's more than 10 items, it just shows the first few)."
+    "We can see what's in this directory by using `ls()`, a method added by fastai. This method returns an object of a special fastai class called `L`, which has all the same functionality of Python's built-in `list`, plus a lot more. One of its handy features is that, when printed, it displays the count of items, before listing the items themselves (if there's more than 10 items, it just shows the first few)."
   ]
  },
  {
@ -228,7 +228,7 @@
   "source": [
    "Here we are using the `Image` class from the *Python Imaging Library* (PIL), which is the most widely used Python package for opening, manipulating, and viewing images. Jupyter knows about PIL images, so it displays the image for us automatically.\n",
    "\n",
-    "In a computer, everything is represented as a number. To view the numbers that make up this image, we have to convert it to a *NumPy array* or a *PyTorch tensor*. For instance, here's a few numbers from the top-left of the image, converted to a numpy array:"
+    "In a computer, everything is represented as a number. To view the numbers that make up this image, we have to convert it to a *NumPy array* or a *PyTorch tensor*. For instance, here's a few numbers from the top-left of the image, converted to a NumPy array:"
   ]
  },
  {
@ -1429,7 +1429,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We'll also check that one of the images looks okay. Since we now have tensors (which Jupyter by default will print as values), rather than PIL images (which Jupyter by default will display as an image), we need to use fastai's show_image function to display it:"
+    "We'll also check that one of the images looks okay. Since we now have tensors (which Jupyter by default will print as values), rather than PIL images (which Jupyter by default will display as an image), we need to use fastai's `show_image` function to display it:"
   ]
  },
  {
@ -1462,7 +1462,7 @@
    "\n",
    "Some operations in PyTorch, such as taking a mean, require us to cast our integer types to float types. Since we'll be needing this later, we'll also cast our stacked tensor to `float` now. Casting in PyTorch is as simple as typing the name of the type you wish to cast to, and treating it as a method.\n",
    "\n",
-    "Generally when images are floats, the pixels are expected to be be zero and one, so we will also divide by 255 here."
+    "Generally when images are floats, the pixels are expected to be between zero and one, so we will also divide by 255 here."
   ]
  },
  {
@ -1586,7 +1586,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "According to this dataset, this is the ideal number three! Let's do the same thing for the sevens, but let's put all the steps together at once to save some time:"
+    "According to this dataset, this is the ideal number three! (You may not like it, but this is what peak number 3 performance looks like.) You can see how it's very dark where all the images agree it should be dark, but it becomes wispy and blurry where the images disagree. \n",
+    "\n",
+    "Let's do the same thing for the sevens, but let's put all the steps together at once to save some time:"
   ]
  },
  {
@ -1616,11 +1618,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's now pick a \"3\", and measure its *distance* from each of these \"ideal digits\".\n",
+    "Let's now pick an arbitrary \"3\", and measure its *distance* from each of these \"ideal digits\".\n",
    "\n",
    "> stop: How would you calculate how similar a particular image is from each of our ideal digits? Remember to step away from this book and jot down some ideas, before you move on! Research shows that recall and understanding improves dramatically when you are *engaged* with the learning process by solving problems, experimenting, and trying new ideas yourself\n",
    "\n",
-    "Here's our sample \"3\":"
+    "Here's a sample \"3\":"
   ]
  },
  {
@ -1659,7 +1661,7 @@
    "- Take the mean of the *absolute value* of differences (_absolute value_ is the function that replaces negative values with positive values). This is called the *mean absolute difference* or *L1 norm*\n",
    "- Take the mean of the *square* of differences (which makes everything positive) and then take the *square root* (which *undoes* the squaring). This is called the *root mean squared error (RMSE)* or *L2 norm*.\n",
    "\n",
-    "> important: in this book we generally assume that you have completed high school maths, and remember at least some of it... But everybody forgets some things! It all depends on what you happen to have had reason to practice in the meantime. Perhaps you have forgotten what a _square root_ is, or exactly how they work. No problem! Any time you come across a maths concept that is not explained fully in this book, don't just keep moving on, but instead stop and look it up. Make sure you understand the basic idea of what that maths concept is, how it works, and why we might be using it. One of the best places to refresh your understanding is Khan Academy. For instance, Khan Academy has a great [introduction to square roots](https://www.khanacademy.org/math/algebra/x2f8bb11595b61c86:rational-exponents-radicals/x2f8bb11595b61c86:radicals/v/understanding-square-roots)."
+    "> important: in this book we generally assume that you have completed high school maths, and remember at least some of it... But everybody forgets some things! It all depends on what you happen to have had reason to practice in the meantime. Perhaps you have forgotten what a _square root_ is, or exactly how they work. No problem! Any time you come across a maths concept that is not explained fully in this book, don't just keep moving on, but instead stop and look it up. Make sure you understand the basic idea of what that the maths concept is, how it works, and why we might be using it. One of the best places to refresh your understanding is Khan Academy. For instance, Khan Academy has a great [introduction to square roots](https://www.khanacademy.org/math/algebra/x2f8bb11595b61c86:rational-exponents-radicals/x2f8bb11595b61c86:radicals/v/understanding-square-roots)."
   ]
  },
  {
@ -1724,7 +1726,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> s: Intuitively, the difference between L1 norm and mean squared error (_MSE_) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes)."
+    "> S: Intuitively, the difference between L1 norm and mean squared error (*MSE*) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes)."
   ]
  },
  {
@ -1758,14 +1760,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: When I first came across this \"L1\" thingie, I looked it up to see what on Earth it meant, found on Google that it is a _vector norm_ using _absolute value_, so looked up _vector norm_ and started reading: _Given a vector space V over a field F of the real or complex numbers, a norm on V is a nonnegative-valued any function p: V → \\[0,+∞) with the following properties: For all a ∈ F and all u, v ∈ V, p(u + v) ≤ p(u) + p(v)..._ Then I stopped reading. \"Ugh, I'll never understand math!\" I thought, for the thousandth time. Since then I've learned that every time these complex mathy bits of jargon come up in practice, it turns out I can replace them with a tiny bit of code! Like the _L1 loss_ is just equal to `(a-b).abs().mean()`, where `a` and `b` are tensors. I guess mathy folks just think differently to me... I'll make sure, in this book, every time some mathy jargon comes up, I'll give you the little bit of code it's equal to as well, and explain in common sense terms what's going on."
+    "> J: When I first came across this \"L1\" thingie, I looked it up to see what on Earth it meant, found on Google that it is a *vector norm* using *absolute value*, so looked up *vector norm* and started reading: *Given a vector space V over a field F of the real or complex numbers, a norm on V is a nonnegative-valued any function p: V → \\[0,+∞) with the following properties: For all a ∈ F and all u, v ∈ V, p(u + v) ≤ p(u) + p(v)...* Then I stopped reading. \"Ugh, I'll never understand math!\" I thought, for the thousandth time. Since then I've learned that every time these complex mathy bits of jargon come up in practice, it turns out I can replace them with a tiny bit of code! Like the _L1 loss_ is just equal to `(a-b).abs().mean()`, where `a` and `b` are tensors. I guess mathy folks just think differently to me... I'll make sure, in this book, every time some mathy jargon comes up, I'll give you the little bit of code it's equal to as well, and explain in common sense terms what's going on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the above code we completed various mathematical operations on *PyTorch tensors*. If you've done some numeric programming in Pytorch before, you may recognize these as being similar to *Numpy arrays*. Let's have a look at those two very important classes."
+    "In the above code we completed various mathematical operations on *PyTorch tensors*. If you've done some numeric programming in PyTorch before, you may recognize these as being similar to *Numpy arrays*. Let's have a look at those two very important classes."
   ]
  },
  {
@ -1790,15 +1792,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A numpy array is multidimensional table of data, with all items of the same type. Since that can be any type at all, they could even be arrays of arrays, with the innermost arrays potentially being different sizes — this is called a \"jagged array\". By \"multidimensional table\" we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a \"table of tables\" or a \"cube\" (dimension of three), and so forth. If the items are all of some simple type such as an integer or a float then numpy will store them as a compact C data structure in memory. This is where numpy shines. Numpy has a wide variety of operators and methods which can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C.\n",
+    "Python is slow compared to many languages. Anything fast in Python, NumPy or PyTorch is likely to be a wrapper to a compiled object written (and optimised) in another language - specifically C. In fact, **NumPy arrays and PyTorch tensors can finish computations many thousands of times faster than using pure Python.**\n",
    "\n",
-    "In fact, **arrays and tensors can finish computations many thousands of times faster than using pure Python.**\n",
+    "A NumPy array is a multidimensional table of data, with all items of the same type. Since that can be any type at all, they could even be arrays of arrays, with the innermost arrays potentially being different sizes — this is called a \"jagged array\". By \"multidimensional table\" we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a \"table of tables\" or a \"cube\" (dimension of three), and so forth. If the items are all of some simple type such as an integer or a float then NumPy will store them as a compact C data structure in memory. This is where NumPy shines. Numpy has a wide variety of operators and methods which can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C.\n",
    "\n",
-    "A PyTorch tensor is nearly the same thing as a numpy array, but with an additional restriction which unlocks some additional capabilities. It's the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type — it has to use a single basic numeric type for all componentss. As a result, a tensor is not as flexible as a genuine array of arrays, which allows jagged arrays, where the inner arrays could have different sizes. So a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.\n",
+    "A PyTorch tensor is nearly the same thing as a numpy array, but with an additional restriction which unlocks some additional capabilities. It's the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type — it has to use a single basic numeric type for all components. As a result, a tensor is not as flexible as a genuine array of arrays, which allows jagged arrays, where the inner arrays could have different sizes. So a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.\n",
    "\n",
-    "The vast majority of methods and operators supported by numpy on these structures are also supported by PyTorch. But PyTorch tensors have additional capabilities. One major capability is that these structures can live on the GPU, in which case their computation will be optimised for the GPU, and can run much faster. In addition, PyTorch can automatically calculate derivatives of these operations, including combinations of operations. As you'll see, it would be impossible to do deep learning in practice without this capability.\n",
+    "The vast majority of methods and operators supported by NumPy on these structures are also supported by PyTorch. But PyTorch tensors have additional capabilities. One major capability is that these structures can live on the GPU, in which case their computation will be optimised for the GPU, and can run much faster (given lots of values to work on). In addition, PyTorch can automatically calculate derivatives of these operations, including combinations of operations. As you'll see, it would be impossible to do deep learning in practice without this capability.\n",
    "\n",
-    "> s: If you don't know what C is, do not worry as you won't need it at all. In a nutshell, it's a low-level  (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops and replace them by commands that work directly on arrays or tensors.\n",
+    "> S: If you don't know what C is, do not worry as you won't need it at all. In a nutshell, it's a low-level  (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops and replace them by commands that work directly on arrays or tensors.\n",
    "\n",
    "Perhaps the most important new coding skill for a Python programmer to learn is how to effectively use the array/tensor APIs. We will be showing lots more tricks later in this book, but here's a summary of the key things you need to know for now."
   ]
@ -1867,7 +1869,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "All the operations below are shown on tensors - the syntax and results for NumPy arrays is idential.\n",
+    "All the operations below are shown on tensors - the syntax and results for NumPy arrays is identical.\n",
    "\n",
    "You can select a row:"
   ]
@ -2049,9 +2051,9 @@
   "source": [
    "Recall that a metric is a number which is calculated from the predictions of our model, and the correct labels in our dataset, in order to tell us how good our model is. For instance, we could use either of the functions we saw in the previous section, mean squared error, or mean absolute error, and take the average of them over the whole dataset. However, neither of these are numbers that are very understandable to most people; in practice, we normally use *accuracy* as the metric for classification models.\n",
    "\n",
-    "As we've discussed, we need to use a *validation set* to calculate our metric. That means we need to remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called \"valid\"? That's what this directory is for!\n",
+    "As we've discussed, we will want to calculate our metric over a a *validation set*. To get a validation set we need to remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called \"valid\"? That's what this directory is for!\n",
    "\n",
-    "So to start with, let's create tensors for our threes and sevens from that directory."
+    "So to start with, let's create tensors for our threes and sevens from that directory. These are the tensors we will use calculate a metric measuring the quality of our first try model, which measures distance from an ideal image."
   ]
  },
  {
@ -2084,7 +2086,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we need a function that decides if a digit is a 3 or a 7. We need to know which of our \"ideal digits\" its closer to. First, we need a function that calculates the distance from a dataset to an ideal image. It turns out we can do that very simply, in this case calculating the mean absolute error:"
+    "It's good to get in the habit of checking shapes as you go. Here we see two tensors, one representing the threes validation set of 1,010 images of size 28x28, and one representing the sevens validation set of 1,028 images of size 28x28.\n",
+    "\n",
+    "Now we ultimately want to write a function `is_3` that will decide if an arbitrary image is a 3 or a 7. It will do this by deciding which of our two \"ideal digits\" such an arbitrary image is closer to. For that we need to define a notion of distance, that is, a function which calculates the distance between two images.\n",
+    "\n",
+    "We can do that very simply, writing a function that calculates the mean absolute error, using an experssion very similar to the one we wrote in the last section:"
   ]
  },
  {
@ -2112,7 +2118,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Something very interesting happens when we run this function on the whole set of threes in the validation set:"
+    "This is the same value we previously calculated for the distance between these two images, the ideal three `mean_3` and the arbitrary sample three `a_3`, which are both single-image tensors with a shape of `[28,28]`.\n",
+    "\n",
+    "But in order to calculate a metric for overall accuracy, we will need to calculate the distance to the ideal three for _every_ image in the validation set. So how do we do that calculation? One could write a loop over all of the single-image tensors that are stacked within our validation set tensor, `valid_3_tens`, which has a shape `[1010,28,28]` representing 1,010 images. But there is a better way.\n",
+    "\n",
+    "Something very interesting happens when we take this exact same distance function, designed for comparing two single images, but pass in as an argument `valid_3_tens`, the tensor which represents the threes validation set:"
   ]
  },
  {
@ -2141,11 +2151,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It's returned the distance for every single image, as a vector (i.e. a rank 1 tensor) of length 1010 (the number of threes in our validation set). How did that happen? Have a look again at our function `mnist_distance`, and you'll see we have there `(a-b)`.\n",
+    "Instead of complaining about shapes not matching, it returned the distance for every single image, as a vector (i.e. a rank 1 tensor) of length 1010 (the number of threes in our validation set). How did that happen?\n",
    "\n",
-    "The magic trick is that PyTorch, when it sees two tensors of different ranks, will *broadcast* the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.\n",
+    "Have a look again at our function `mnist_distance`, and you'll see we have there the subtraction `(a-b)`.\n",
    "\n",
-    "Thanks to  broadcasting, when PyTorch sees an operation on two tensors of the same rank, it completes the operation on each corresponding element of the two tensors, and returns the tensor result. For instance:"
+    "The magic trick is that PyTorch, when it tries to perform a simple operation subtraction between two tensors of different ranks, will use *broadcasting*. Broadcasting means PyTorch will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.\n",
+    "\n",
+    "After broadcasting expands the two argument tensors to have the same rank, PyTorch applies its usual logic for two tensors of the same rank, which is to perform the operation on each corresponding element of the two tensors, and returns the tensor result. For instance:"
   ]
  },
  {
@ -2201,16 +2213,16 @@
   "source": [
    "We are calculating the difference between the \"ideal 3\" and each of 1010 threes in the validation set, for each of `28x28` images, resulting in the shape `1010,28,28`.\n",
    "\n",
-    "There's a couple of really cool things to know about this operation we just did:\n",
+    "There's a couple of important points about how broadcasting is implemented, which mean it is valuable not just for expressivity but also for performance:\n",
    "\n",
    "- PyTorch doesn't *actually* copy `mean3` 1010 times. Instead, it just *pretends* as if it was a tensor of that shape, but doesn't actually allocate any additional memory\n",
    "- It does the whole calculation in C (or, if you're using a GPU, in CUDA, the equivalent of C on the GPU), tens of thousands of times faster than pure Python (up to millions of times faster on a GPU!)\n",
    "\n",
-    "This is true of all broadcasting and elementwise operations and functions done in PyTorch. **It's the most important technique for you to know to create efficient PyTorch code.**\n",
+    "This is true of all broadcasting and elementwise operations and functions done in PyTorch. **It's the most important technique for you to know to create efficient PyTorch code.** \n",
    "\n",
-    "Next in `mnist_distance` we see `abs()`. You might be able to guess now what this does when applied to a tensor... It applies the method to each individual element in the tensor, and returns a tensor of the results (that is, it applies the method \"elementwise\"). So in this case, we'll get back 1010 absolute values.\n",
+    "Next in `mnist_distance` we see `abs()`. You might be able to guess now what this does when applied to a tensor. It applies the method to each individual element in the tensor, and returns a tensor of the results (that is, it applies the method \"elementwise\"). So in this case, we'll get back 1010 absolute values.\n",
    "\n",
-    "Finally, our function calls `mean((-1,-2))`. In Python, `-1` refers to the last element, and `-2` refers to the second last. So in this case, this tells PyTorch that we want to take the mean of the last two axes of the tensor. After taking the mean over the last two axes, we are left with just the first axis, which is why our final size was `(1010)`.\n",
+    "Finally, our function calls `mean((-1,-2))`. The tuple `(-1,-2)` represents a range of axes. In Python, `-1` refers to the last element, and `-2` refers to the second last. So in this case, this tells PyTorch that we want to take the mean ranging over the values indexed by the last two axes of the tensor. The last two axes are the horizontal and vertical dimensions of an image. So after taking the mean over the last two axes, we are left with just the first tensor axis, which indexes over our images, which is why our final size was `(1010)`. In other words, for every image, we averaged the intensity of all the pixels in that image.\n",
    "\n",
    "We'll be learning lots more about broadcasting throughout this book, especially in <<chapter_foundations>>, and will be practising it regularly too.\n",
    "\n",
@ -2230,7 +2242,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's test it on our example case (note also that when we convert the boolean response to a float, we get a `1.0` for true and `0.0` for false):"
+    "Let's test it on our example case.\n",
+    "\n",
+    "Note also that when we convert the boolean response to a float, we get a `1.0` for true and `0.0` for false:"
   ]
  },
  {
@ -2257,7 +2271,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "And testing it on the full validation set of threes:"
+    "Thanks to broadcasting, we can also test it on the full validation set of threes:"
   ]
  },
  {
@ -2314,18 +2328,18 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This looks like a pretty good start! We're getting over 90% accuracy on both threes and sevens.\n",
+    "This looks like a pretty good start! We're getting over 90% accuracy on both threes and sevens. And we've seen how to define a metric conveniently using broadcasting.\n",
    "\n",
    "But let's be honest: threes and sevens are very different looking digits. And we're only classifying two out of the ten possible digits so far. So we're going to need to do better!\n",
    "\n",
-    "To do better, perhaps it is time to try a system that doess some learning -- that is, that can automatically modify itself to improve its performance. In other words, it's time to talk about the training process, and SGD."
+    "To do better, perhaps it is time to try a system that does some real learning -- that is, that can automatically modify itself to improve its performance. In other words, it's time to talk about the training process, and SGD."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Stochastic Gradient descent (SGD)"
+    "## Stochastic Gradient Descent (SGD)"
   ]
  },
  {
@ -2334,11 +2348,11 @@
   "source": [
    "Do you remember the way that Arthur Samuel described machine learning, which we quoted in <<chapter_intro>>:\n",
    "\n",
-    "> : _Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programed would \"learn\" from its experience._\n",
+    "> : _Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would \"learn\" from its experience._\n",
    "\n",
    "As we discussed, this is the key to allowing us to have something which can get better and better — to learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can't really improve our pixel similarity approach by modifying a set of parameters (which will be the SGD part, as we will see). In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.\n",
    "\n",
-    "Instead of trying to find the similarity between an image and a \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function for each possible category, for instance the probability of being the number eight:\n",
+    "Instead of trying to find the similarity between an image and an \"ideal image\" we could instead look at each individual pixel, and come up with a set of weights for each pixel, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels towards the bottom right are not very likely to be activated for a seven, so they should have a low weight for a seven, but are more likely to be activated for an eight, so they should have a high weight for an eight. This can be represented as a function for each possible category, for instance the probability of being the number eight:\n",
    "\n",
    "```\n",
    "def pr_eight(x,w) = (x*w).sum()\n",
@ -2484,12 +2498,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "These seven steps, illustrated in <<gradient_descent>> are the key to the training of all deep learning models， and we'll be using the seven terms in the above diagram throughout this book. That deep learning turns out to rely entirely on these steps is extremely surprising and counter-intuitive. It's amazing that this process can solve such complex problems. But, as you'll see, it really does!\n",
+    "These seven steps, illustrated in <<gradient_descent>> are the key to the training of all deep learning models. That deep learning turns out to rely entirely on these steps is extremely surprising and counter-intuitive. It's amazing that this process can solve such complex problems. But, as you'll see, it really does!\n",
    "\n",
    "There are many different ways to do each of these seven steps, and we will be learning about them throughout the rest of this book. These are the details which make a big difference for deep learning practitioners. But it turns out that the general approach to each one generally follows some basic principles:\n",
    "\n",
-    "- **Initialize**:: we initialise the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
-    "- **Loss**:: This is the thing Arthur Samuel refered to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
+    "- **Initialize**:: we initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
+    "- **Loss**:: This is the thing Arthur Samuel referred to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
    "- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes, by calculating *gradients*. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n",
    "- **Stop**:: We have already discussed how to choose how many epochs to train a model for. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time."
   ]
@ -2587,7 +2601,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can change our weight by a little in the direction of the slop, calculate our loss and adjustment again, and repeat this a few times. Eventually, we will get to the lowest point on our curve:"
+    "We can change our weight by a little in the direction of the slope, calculate our loss and adjustment again, and repeat this a few times. Eventually, we will get to the lowest point on our curve:"
   ]
  },
  {
@ -2648,7 +2662,7 @@
   "source": [
    "Notice the special method `requires_grad_`? That's the magical incantation we use to tell PyTorch that we want to calculate gradients with respect to that variable at that value. It is essentially tagging the variable, so PyTorch will remember to keep track of how to compute gradients of the other, direct calculations on it which you will ask for.\n",
    "\n",
-    "> a: This API might throw you if you're coming from math or physics. In those contexts the \"gradient\" of a function is just another function (i.e., its derivative), so you might expect gradient-related API to give you a new function. But in deep learning, \"gradients\" usually means the _value_ of a function's derivative at a particular argument value. PyTorch API also puts the focus on that argument, not the function you're actually computing gradients of. It may feel backwards at first but it's just a different perspective.\n",
+    "> A: This API might throw you if you're coming from math or physics. In those contexts the \"gradient\" of a function is just another function (i.e., its derivative), so you might expect gradient-related APIs to give you a new function. But in deep learning, \"gradients\" usually means the _value_ of a function's derivative at a particular argument value. PyTorch API also puts the focus on that argument, not the function you're actually computing gradients of. It may feel backwards at first but it's just a different perspective.\n",
    "\n",
    "Now we calculate our function with that value. Notice how PyTorch prints not just the value calculated, but also a note that it has a gradient function it'll be using to calculate our gradient when needed:"
   ]
@ -3150,7 +3164,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can use these gradients to improve our parameters. We'll need to pick a learning rate (we'll discuss how to do that in practice in the next chapter; for now we'll just pick `1.0`):"
+    "We can use these gradients to improve our parameters. We'll need to pick a learning rate (we'll discuss how to do that in practice in the next chapter; for now we'll just pick `1e-5`(0.00001)):"
   ]
  },
  {
@ -3367,7 +3381,7 @@
   "source": [
    "To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the specific task we are targeting. So the model will need to *learn* better weights.\n",
    "\n",
-    "To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare to the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
+    "To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
    "\n",
    "To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradient*. (Actually, we let PyTorch do it for us!) Let's imagine you are lost in the mountains with your car parked at the lowest point. To find your way, you might wander in a random direction but that probably won't help much. Since you know your vehicle is at the lowest point, you would be better to go downhill. By always taking a step in the direction of the steepest downward slope, you should eventually arrive at your destination. We use the magnitude of the gradient (i.e., the steepness of the slope) to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size."
   ]
@ -3391,7 +3405,7 @@
    "\n",
    "As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!\n",
    "\n",
-    "> s: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are zero or infinite, so useless to do an update of gradient descent.\n",
+    "> S: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are zero or infinite, so, useless to do an update of gradient descent.\n",
    "\n",
    "Instead, we need a loss function which, when our weights result in slightly better predictions, gives us a slightly better loss. So what does a \"slightly better prediction\" look like, exactly? Well, in this case, it means that, if the correct answer is a 3, then the score is a little higher, or if the correct answer is a 7, then the score is a little lower.\n",
    "\n",
@ -3401,7 +3415,7 @@
    "\n",
    "The purpose of the loss function is to measure the difference between predicted values and the true values -- that is, the targets (aka, the labels). So let's make another argument `targets`, a vector (i.e., another rank-1 tensor), indexed over the images, with a value of 0 or 1 which tells whether that image actually is a 3.\n",
    "\n",
-    "So, for instance, suppose we had three images which we knew were a 3, a 7, and a 3. And suppose our model predicted with high confidence that the first was a 3, with slight confidence that the second was a 7, and with fair confidence (and incorrectly!) that the last was a 7. This would mean our loss function would take receive values as its inputs:"
+    "So, for instance, suppose we had three images which we knew were a 3, a 7, and a 3. And suppose our model predicted with high confidence that the first was a 3, with slight confidence that the second was a 7, and with fair confidence (and incorrectly!) that the last was a 7. This would mean our loss function would take values as its inputs:"
   ]
  },
  {
@ -3435,7 +3449,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We're using a new function, `torch.where(a,b,c)`. This the same as running the list comprehension `[b[i] if a[i] else c[i] for i in range(len(a))]`, except it works on tensors, at C/CUDA speed. In plain English, this function will measure how distant each prediction is from 1 if it should be 1, and how distant it is from from 0 if it should be 0, and then it will take the mean of all those distances.\n",
+    "We're using a new function, `torch.where(a,b,c)`. This is the same as running the list comprehension `[b[i] if a[i] else c[i] for i in range(len(a))]`, except it works on tensors, at C/CUDA speed. In plain English, this function will measure how distant each prediction is from 1 if it should be 1, and how distant it is from from 0 if it should be 0, and then it will take the mean of all those distances.\n",
    "\n",
    "> note: It's important to learn about PyTorch functions like this, because looping over tensors in Python performs at Python speed, not C/CUDA speed!\n",
    "\n",
@ -3550,7 +3564,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Pytorch actually already defines this for us, so we don’t really need our own version. This is an important function in deep learning, since we often want to ensure values between zero and one. This is what it looks like:"
+    "Pytorch actually already defines this for us, so we don’t really need our own version. This is an important function in deep learning, since we often want to ensure values are between zero and one. This is what it looks like:"
   ]
  },
  {
@ -3603,7 +3617,7 @@
    "\n",
    "Having defined a loss function, now is a good moment to recapitulate why we did this. After all, we already had a *metric*, which was overall accuracy. So why did we define a *loss*?\n",
    "\n",
-    "The key difference is that the metric is to drive human understanding and the loss is to drive automated learning. To drive automted learning, the loss must be a function which has a meaningful derivative. It can't have big flat sections, and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level. The requirements on loss sometimes it does not really reflect exactly what we are trying to achieve, but is something that is a compromise between our real goal, and a function that can be optimised using its gradient. The loss function is calculated for each item in our dataset, and then at the end of an epoch these are all averaged, and the overall mean is reported for the epoch.\n",
+    "The key difference is that the metric is to drive human understanding and the loss is to drive automated learning. To drive automated learning, the loss must be a function which has a meaningful derivative. It can't have big flat sections, and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level. The requirements on loss sometimes do not really reflect exactly what we are trying to achieve, but are rather a compromise between our real goal, and a function that can be optimised using its gradient. The loss function is calculated for each item in our dataset, and then at the end of an epoch these are all averaged, and the overall mean is reported for the epoch.\n",
    "\n",
    "Metrics, on the other hand, are the numbers that we really care about. These are the things which are printed at the end of each epoch, and tell us how our model is really doing. It is important that we learn to focus on these metrics, rather than the loss, when judging the performance of a model."
   ]
@ -3621,13 +3635,13 @@
   "source": [
    "Now that we have a loss function which is suitable to drive SGD, we can consider some of the details involved in the next phase of the learning process, which is *step* (i.e., change or update) the weights based on the gradients. This is called an optimisation step.\n",
    "\n",
-    "In order to take an optimiser step we need to calculate the loss over one or more data items. How many should we use? We could calculate it for the whole dataset, and take the average, or we could calculate it for a single data item. But neither of these is ideal. Calculating it for the whole dataset would take a very long time. Calculating it for a single item would not use much inofmration, and so it would result in a very imprecise and unstable gradient. That is, you'd be going to the trouble of updating the weights but taking into account only how that would improve the model's performance on that single item.\n",
+    "In order to take an optimiser step we need to calculate the loss over one or more data items. How many should we use? We could calculate it for the whole dataset, and take the average, or we could calculate it for a single data item. But neither of these is ideal. Calculating it for the whole dataset would take a very long time. Calculating it for a single item would not use much information, and so it would result in a very imprecise and unstable gradient. That is, you'd be going to the trouble of updating the weights but taking into account only how that would improve the model's performance on that single item.\n",
    "\n",
    "So instead we take a compromise between the two: we calculate the average loss for a few data items at a time. This is called a *mini-batch*. The number of data items in the mini batch is called the *batch size*. A larger batch size means that you will get a more accurate and stable estimate of your datasets gradient on the loss function, but it will take longer, and you will get less mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately. We will talk about how to make this choice throughout this book.\n",
    "\n",
-    "Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time. So it is helpful if we can give them lots of data items to work on at a time. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!.\n",
+    "Another good reason for using mini-batches rather than calculating the gradient on individual data items is that, in practice, we nearly always do our training on an accelerator such as a GPU. These accelerators only perform well if they have lots of work to do at a time. So it is helpful if we can give them lots of data items to work on at a time. Using mini-batches is one of the best ways to do this. However, if you give them too much data to work on at once, they run out of memory--making GPUs happy is also tricky!\n",
    "\n",
-    "As we've seen, in the discussion of data augmentation, we get better generalisation if we can very things during training. A simple and effective thing we can vary during training is what data items we put in each mini batch. Rather than simply enumerating our data set in order for every epoch, instead what we normally do in practice is to randomly shuffle it on every epoch, before we create mini batches. PyTorch and fastai provide a class that will do the shuffling and mini batch collation for you, called `DataLoader`.\n",
+    "As we've seen, in the discussion of data augmentation, we get better generalisation if we can vary things during training. A simple and effective thing we can vary during training is what data items we put in each mini batch. Rather than simply enumerating our data set in order for every epoch, instead what we normally do is to randomly shuffle it on every epoch, before we create mini batches. PyTorch and fastai provide a class that will do the shuffling and mini batch collation for you, called `DataLoader`.\n",
    "\n",
    "A `DataLoader` can take any Python collection, and turn it into an iterator over many batches, like so:"
   ]
@ -3688,7 +3702,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When we pass a Dataset to a DataLoader we will get back many batches which are themselves tuples of independent and dependent variable many batches:"
+    "When we pass a Dataset to a DataLoader we will get back many batches which are themselves tuples of independent and dependent variables:"
   ]
  },
  {
@ -3977,7 +3991,7 @@
   "source": [
    "Whilst we could use a python for loop to calculate the prediction for each image, that would be very slow. Because Python loops don't run on the GPU, and because Python is a slow language for loops in general, we need to represent as much of the computation in a model as possible using higher-level functions.\n",
    "\n",
-    "In this case, there's an extremely convenient mathematical operation that calculates `w*x` for every row of a matrix--it's called *matrix multiplication*. <<matmul>> show what matrix multiplication looks like (diagram from Wikipedia)."
+    "In this case, there's an extremely convenient mathematical operation that calculates `w*x` for every row of a matrix--it's called *matrix multiplication*. <<matmul>> shows what matrix multiplication looks like (diagram from Wikipedia)."
   ]
  },
  {
@ -4741,7 +4755,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So far we have a general procedure for optimising the parameters of a function, and we have tried it out on a very boring function: a simple linear classifier. A linear classifier is very constrained in terms of what it can do. To make it a bit more complex (and able to handle more tasks), we need to add a non-linearity between two linear classifiers, and this is what will gived us a neural network.\n",
+    "So far we have a general procedure for optimising the parameters of a function, and we have tried it out on a very boring function: a simple linear classifier. A linear classifier is very constrained in terms of what it can do. To make it a bit more complex (and able to handle more tasks), we need to add a non-linearity between two linear classifiers, and this is what gives us a neural network.\n",
    "\n",
    "Here is the entire definition of a basic neural network:"
   ]
@ -4815,14 +4829,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: There is an enormous amount of jargon in deep learning, such as: _rectified linear unit_. The vast vast majority of this jargon is no more complicated than can be implemented in a short line of code and Python, as we saw in this example. The reality is that for academics to get their papers published they need to make them sound as impressive and sophisticated as possible. One of the ways that they do that is to introduce jargon. Unfortunately, this has the result that the field ends up becoming far more intimidating and difficult to get into than it should be. You do have to learn the jargon, because otherwise papers and tutorials are not going to mean much to you. But that doesn't mean you have to find the jargon intimidating. Just remember, when you come across a word or phrase that you haven't seen before, it will almost certainly turn out that it is a very simple concept that it is referring to."
+    "> J: There is an enormous amount of jargon in deep learning, such as: _rectified linear unit_. The vast vast majority of this jargon is no more complicated than can be implemented in a short line of code and Python, as we saw in this example. The reality is that for academics to get their papers published they need to make them sound as impressive and sophisticated as possible. One of the ways that they do that is to introduce jargon. Unfortunately, this has the result that the field ends up becoming far more intimidating and difficult to get into than it should be. You do have to learn the jargon, because otherwise papers and tutorials are not going to mean much to you. But that doesn't mean you have to find the jargon intimidating. Just remember, when you come across a word or phrase that you haven't seen before, it will almost certainly turn out that it is a very simple concept that it is referring to."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The basic idea is that by using more linear layers, we can have our model do more computation, and therefore model more complex functions. But there's no point just putting one linear layout directly after another one, because when we multiply things together and then at them up multiple times, that can be replaced by multiplying different things together and adding them up just once! That is to say, a series of any number of linear layers in a row can be replaced with a single linear layer with a different set of parameters.\n",
+    "The basic idea is that by using more linear layers, we can have our model do more computation, and therefore model more complex functions. But there's no point just putting one linear layout directly after another one, because when we multiply things together and then add them up multiple times, that can be replaced by multiplying different things together and adding them up just once! That is to say, a series of any number of linear layers in a row can be replaced with a single linear layer with a different set of parameters.\n",
    "\n",
    "But if we put a non-linear function between them, such as max, then this is no longer true. Now, each linear layer is actually somewhat decoupled from the other ones, and can do its own useful work. The max function is particularly interesting, because it operates as a simple \"if\" statement. For any arbitrarily wiggly function, we can approximate it as a bunch of lines joined together; to make it more close to the wiggly function, we just have to use shorter lines."
   ]
@ -4831,7 +4845,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> s: Mathematically, we say the composition of two linear functions is another linear function. So we can stack as many linear classifiers on top or each other, without non-linear functions between them, it will jsut be the same as one linear classifier."
+    "> S: Mathematically, we say the composition of two linear functions is another linear function. So we can stack as many linear classifiers on top of each other, without non-linear functions between them, it will just be the same as one linear classifier."
   ]
  },
  {
@ -5357,8 +5371,8 @@
    "[options=\"header\"]\n",
    "|=====\n",
    "| Term | Meaning\n",
-    "|**ReLU** | Funxtion that returns 0 for negatives numbers and doesn't change positive numbers\n",
-    "|**mini-batch** | A few inputs and labels gathered together in two big arrays\n",
+    "|**ReLU** | Function that returns 0 for negative numbers and doesn't change positive numbers\n",
+    "|**mini-batch** | A few inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch).\n",
    "|**forward pass** | Applying the model to some input and computing the predictions\n",
    "|**loss** | A value that represents how well (or badly) our model is doing\n",
    "|**gradient** | The derivative of the loss with respect to some parameter of the model\n",
@ -5430,7 +5444,7 @@
    "1. What is \"ReLU\"? Draw a plot of it for values from `-2` to `+2`.\n",
    "1. What is an \"activation function\"?\n",
    "1. What's the difference between `F.relu` and `nn.ReLU`?\n",
-    "1. The universal approximation theorem shows that any function can be approximately as closely as needed using just one nonlinearity. So why do we normally use more?"
+    "1. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?"
   ]
  },
  {
@ -5457,9 +5471,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
@ -5467,5 +5478,5 @@
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/05_pet_breeds.ipynb
+++ b/05_pet_breeds.ipynb
@ -116,7 +116,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can see that this dataset provides us with \"images\" and \"annotations\" directories. The website for this dataset tells us that the annotations directory contains information about where the pets are rather than what they are. In this chapter we will be doing classification, not localisation, which is to say that we care about what the pets are not where they are. Therefore we will ignore the annotations directory for now. So let's have a look inside the images directory:"
+    "We can see that this dataset provides us with \"images\" and \"annotations\" directories. The website for this dataset tells us that the annotations directory contains information about where the pets are rather than what they are. In this chapter, we will be doing classification, not localization, which is to say that we care about what the pets are not where they are. Therefore we will ignore the annotations directory for now. So let's have a look inside the images directory:"
   ]
  },
  {
@ -161,7 +161,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The most powerful and flexible way to extract information from strings like this is to use a *regular expression*, also known as a *regex*. A regular expression is a special string, written in the regular expression language, which specifies a general rule for for deciding if another string passes a test (i.e., \"matches\" the regular expression), and also possibly for plucking a particular part or parts out of that other string. \n",
+    "The most powerful and flexible way to extract information from strings like this is to use a *regular expression*, also known as a *regex*. A regular expression is a special string, written in the regular expression language, which specifies a general rule for deciding if another string passes a test (i.e., \"matches\" the regular expression), and also possibly for plucking a particular part or parts out of that other string. \n",
    "\n",
    "In this case, we need a regular expressin that extracts the pet breed from the file name.\n",
    "\n",
@ -1159,7 +1159,7 @@
    "\n",
    "The really interesting thing here is that this actually works just as well with more than two columns. To see this, consider what would happen if we added a activation column above for every digit (zero through nine), and then `targ` contained a number from zero to nine. As long as the activation columns sum to one (as they will, if we use softmax), then we'll have a loss function that shows how well we're predicting each digit.\n",
    "\n",
-    "We're only picking the loss from the column containing the correct label. We don't to consider the other columns, because by the definition of softmax, they add up to one minus the activation corresponding to the correct label. Therefore, making the activation for the correct label as high as possible, must mean we're also decreasing the activations of the remaining columns.\n",
+    "We're only picking the loss from the column containing the correct label. We don't need to consider the other columns, because by the definition of softmax, they add up to one minus the activation corresponding to the correct label. Therefore, making the activation for the correct label as high as possible, must mean we're also decreasing the activations of the remaining columns.\n",
    "\n",
    "PyTorch provides a function that does exactly the same thing as `sm_acts[range(n), targ]` (except it takes the negative, because when applying the log afterward, we will have negative numbers), called `nll_loss` (*NLL* stands for *negative log likelihood*):"
   ]
@ -2525,25 +2525,10 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/09_tabular.ipynb
+++ b/09_tabular.ipynb
@ -9733,9 +9733,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
--- a/11_midlevel_data.ipynb
+++ b/11_midlevel_data.ipynb
@ -266,7 +266,7 @@
    "\n",
    "A good example of `decode` is found in the `Normalize` transform that we saw in <<chapter_sizing_and_tta>>: to be able to plot the images its `decode` method undoes the normalization (i.e. it multiplies by the std and adds back the mean). On the other hand, data augmentation transforms do not have a `decode` method, since we want to show the effects on images, to make sure the data augmentation is working as we want.\n",
    "\n",
-    "The second special behavior of `Transform`s is that they always get applied over tuples: in general, our data is always a tuple `(input,target)` (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as `Resize`, we don't want to resize the tuple, but resize the input (if applicable) and the target (if applicable). It's the same for the batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.\n",
+    "A special behavior of `Transform`s is that they always get applied over tuples: in general, our data is always a tuple `(input,target)` (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as `Resize`, we don't want to resize the tuple, but resize the input (if applicable) and the target (if applicable). It's the same for the batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.\n",
    "\n",
    "We can see this behavior if we pass a tuple of texts to `tok`:"
   ]
@ -279,8 +279,8 @@
    {
     "data": {
      "text/plain": [
-       "((#374) ['xxbos','xxmaj','well',',','\"','cube','\"','(','1997',')'...],\n",
-       " (#207) ['xxbos','xxmaj','conrad','xxmaj','hall','went','out','with','a','bang'...])"
+       "((#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at'...],\n",
+       " (#238) ['xxbos','i','stopped','watching','this','film','half','way','through','.'...])"
      ]
     },
     "execution_count": null,
@ -303,7 +303,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If you want to write a custom transform to apply to your data, the easiest way is to write a function:"
+    "If you want to write a custom transform to apply to your data, the easiest way is to write a function. As you can see in this example, a `Transform` will only be applied to a matching type, if a type is provided (otherwise it will always be applied). In the following code, the `:int` in the function signature means that the `f` only gets applied to ints. That's why `tfm(2.0)` returns `2.0`, but `tfm(2)` returns `3` here:"
   ]
  },
  {
@ -314,7 +314,7 @@
    {
     "data": {
      "text/plain": [
-       "3"
+       "(3, 2.0)"
      ]
     },
     "execution_count": null,
@ -323,16 +323,47 @@
    }
   ],
   "source": [
-    "def f(x): return x+1\n",
+    "def f(x:int): return x+1\n",
    "tfm = Transform(f)\n",
-    "tfm(2)"
+    "tfm(2),tfm(2.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "`tfm` will automatically convert `f` to a `Transform` with no setup and no decode method. If you need either of those, you will need to subclass `Transform`. When writing this subclass, you need to implement the actual function in `encodes`, then (optionally), the setup behavior in `setups` and the decoding behavior in `decodes`:"
+    "Here `f` is converted to a `Transform` with no `setup` and no `decode` method.\n",
+    "\n",
+    "Python has a special syntax for passing a function (like `f`) to another function (or something that behaves like a function, known as a `callable` in Python), which is a *decorator*. A decorator is used by prepending a callable with `@`, and placing it before a function definition (there's lots of good online tutorials for Python decorators, so take a look if this is a new concept for you). The following is identical to the previous code:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(3, 2.0)"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "@Transform\n",
+    "def f(x:int): return x+1\n",
+    "f(2),f(2.0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you need either `setup` or `decode`, you will need to subclass `Transform`. When writing this subclass, you need to implement the actual function in `encodes`, then (optionally), the setup behavior in `setups` and the decoding behavior in `decodes`:"
   ]
  },
  {
@ -894,7 +925,7 @@
   "source": [
    "...except that now, you know how to customize every single piece of it!\n",
    "\n",
-    "Let's practice what we just learned on this mid-level API for data preprocessing on a computer vision example now, with a Siamese Model input pipeline."
+    "Let's practice what we just learned on this mid-level API for data preprocessing on a computer vision example now, with a *Siamese Model* input pipeline."
   ]
  },
  {
@ -908,14 +939,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A Siamese model takes two images and has to determine if they are of the same classe or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. We will explain here how to prepare the data for such a model, then we will train that model in <<chapter_arch_details>>."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Firs things first, let's get all the images in our dataset."
+    "A *Siamese model* takes two images and has to determine if they are of the same class or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. We will explain here how to prepare the data for such a model, then we will train that model in <<chapter_arch_details>>.\n",
+    "\n",
+    "Firs things first, let's get the images in our dataset."
   ]
  },
  {
@ -935,7 +961,7 @@
   "source": [
    "If we didn't care about showing our objects at all, we could directly create one transform to completely preprocess that list of files. We will want to look at those images though, so we need to create a custom type. When you call the `show` method on a `TfmdLists` or a `Datasets` object, it will decode items until it reaches a type that contains a `show` method and use it to show the object. That `show` method gets passed a `ctx`, which could be a matplotlib axes for images, or the row of a dataframe for texts.\n",
    "\n",
-    "Here we create a `SiameseImage` object that subclasses tuples and is inteneded to be contain three things: two images and a boolean to know if they are the same or not. We implement the `show` method that concatenates the two images with a black line in the middle. You can skip the part that is in the if test (which is to show the `SiameseImage` when the images are pillow images and not tensors), the important part is in the last three lines."
+    "Here we create a `SiameseImage` object that subclasses `Tuple` and is intended to contain three things: two images, and a boolean that's `True` if they are the breed. We also implement the special `show` method, such that it concatenates the two images, with a black line in the middle. Don't worry too much about the part that is in the `if` test (which is to show the `SiameseImage` when the images are Pillow images, and not tensors), the important part is in the last three lines."
   ]
  },
  {
@ -1231,25 +1257,10 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/12_nlp_dive.ipynb
+++ b/12_nlp_dive.ipynb
@ -1568,7 +1568,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this picture, our input $x_{t}$ enters on the bottom with the previous hidden state ($h_{t-1}$) and cell state ($x_{t-1}$). The four orange boxes represent four layers with the activation being either sigmoid (for $\\sigma$) or tanh. tanh is just a sigmoid rescaled to the range -1 to 1. Its mathematical expression can be written like this:\n",
+    "In this picture, our input $x_{t}$ enters on the bottom with the previous hidden state ($h_{t-1}$) and cell state ($c_{t-1}$). The four orange boxes represent four layers with the activation being either sigmoid (for $\\sigma$) or tanh. tanh is just a sigmoid rescaled to the range -1 to 1. Its mathematical expression can be written like this:\n",
    "\n",
    "$$\\tanh(x) = \\frac{e^{x} + e^{-x}}{e^{x}-e^{-x}} = 2 \\sigma(2x) - 1$$\n",
    "\n",
@ -2347,18 +2347,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/13_convolutions.ipynb
+++ b/13_convolutions.ipynb
@ -47,7 +47,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One of the most powerful tools that machine learning practitioners have at their disposal is *feature engineering*. A *feature* is a transformation of the data which is designed to make it easier to model. For instance, the `add_datepart` function that we used for our tabular data set preprocessing added date features to the Bulldozers dataset. What kind of features might we be able to create from images?"
+    "One of the most powerful tools that machine learning practitioners have at their disposal is *feature engineering*. A *feature* is a transformation of the data which is designed to make it easier to model. For instance, the `add_datepart` function that we used for our tabular dataset preprocessing added date features to the Bulldozers dataset. What kind of features might we be able to create from images?"
   ]
  },
  {
@ -79,7 +79,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The grey grid to the left is our *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel, to each element of a 3x3 block of the image. The results of these multiplications are then added together. The diagram above shows an example of applying a kernel to a single location in the image, the 3x3 block around cell 18.\n",
+    "The 7x7 grid to the left is our *image* we're going to apply the kernel to. The convolution operation multiplies each element of the kernel, to each element of a 3x3 block of the image. The results of these multiplications are then added together. The diagram above shows an example of applying a kernel to a single location in the image, the 3x3 block around cell 18.\n",
    "\n",
    "Let's do this with code. First, we create a little 3x3 matrix like so:"
   ]
@ -1526,7 +1526,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We won't implement this function from scratch, using PyTorch's implementation (that is way faster than anything we could do in python) instead."
+    "We won't implement this convolution function from scratch, but use PyTorch's implementation instead (it is way faster than anything we could do in Python)."
   ]
  },
  {
@ -1540,7 +1540,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Convolution is such an important and widely-used operation that PyTorch has it builtin. It's called `F.conv2d`. The PyTorch docs tell us that it includes these parameters:\n",
+    "Convolution is such an important and widely-used operation that PyTorch has it builtin. It's called `F.conv2d` (Recall F is a fastai import from torch.nn.functional as recommended by PyTorch). The PyTorch docs tell us that it includes these parameters:\n",
    "\n",
    "- **input**: input tensor of shape `(minibatch, in_channels, iH, iW)`\n",
    "- **weight**: filters of shape `(out_channels, in_channels, kH, kW)`\n",
@ -1764,7 +1764,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With a 5x5 input, and 4x4 kernel, and 2 pixels of padding, we end up with a 6x6 activation map, as we can see in <<four_by_five_conv>>/"
+    "With a 5x5 input, and 4x4 kernel, and 2 pixels of padding, we end up with a 6x6 activation map, as we can see in <<four_by_five_conv>>"
   ]
  },
  {
@ -2027,7 +2027,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One thing to note here is that we didn't need to specify \"28x28\" as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel. So it needs to know how many pixels there are. But a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels, and the kernel size, as we say in the previous section.\n",
+    "One thing to note here is that we didn't need to specify \"28x28\" as the input size. That's because a linear layer needs a weight in the weight matrix for every pixel. So it needs to know how many pixels there are. But a convolution is applied over each pixel automatically. The weights only depend on the number of input and output channels, and the kernel size, as we saw in the previous section.\n",
    "\n",
    "Have a think about what the output shape is going to be.\n",
    "\n",
@ -2446,14 +2446,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When writing this particular chapter, we had a lot of questions we needed answer to, to be able to explain to you those CNNs as best we could. Believe it or not, we found most of the answers on twitter. "
+    "When writing this particular chapter, we had a lot of questions we needed answers for, to be able to explain to you those CNNs as best we could. Believe it or not, we found most of the answers on Twitter. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### A note about twitter"
+    "### A note about Twitter"
   ]
  },
  {
@ -2462,7 +2462,7 @@
   "source": [
    "We are not, to say the least, big users of social networks in general. But our goal of this book is to help you become the best deep learning practitioner you can, and we would be remiss not to mention how important Twitter has been in our own deep learning journeys.\n",
    "\n",
-    "You see, there's another part of Twitter, far away from Donald Trump and the Kardashians, which is the part of Twitter where deep learning researchers and practitioners talk shop every day. As we were writing the section above, Jeremy wanted to double-check to ensure that what we were saying about stride 2 convolutions was accurate, so he asked on twitter:"
+    "You see, there's another part of Twitter, far away from Donald Trump and the Kardashians, which is the part of Twitter where deep learning researchers and practitioners talk shop every day. As we were writing the section above, Jeremy wanted to double-check to ensure that what we were saying about stride 2 convolutions was accurate, so he asked on Twitter:"
   ]
  },
  {
@ -2626,7 +2626,7 @@
   "source": [
    "We saw what the convolution operation was for one filter on one channel of the image (our examples were done on a square). A convolution layer will take an image with a certain number of channels (3 for the first layer for regular RGB color images) and output an image with a different number of channels. Like our hidden size that represented the numbers of neurons in a linear layer, we can decide to have has many filters as we want, and each of them will be able to specialize, some to detect horizontal edges, other to detect vertical edges and so forth, to give something like we studied in <<chapter_production>>.\n",
    "\n",
-    "On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the example fiven by <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
+    "On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the example given by <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
   ]
  },
  {
@ -2660,7 +2660,7 @@
    "\n",
    "Additionally, we may want to have a bias for each filter. In the example above, the final result for our convolutional layer would be $y_{R} + y_{G} + y_{B} + b$ in that case. Like in a linear layer, there are as many bias as we have kernels, so the bias is a vector of size `ch_out`.\n",
    "\n",
-    "There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer as 3 inputs.\n",
+    "There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer has 3 inputs.\n",
    "\n",
    "There are lots of ways of processing color images. For instance, you can change them to black and white, or change from RGB to HSV (Hue, Saturation, and Value) color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won't make any difference to your model results, as long as you don't lose information in the transformation. So transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance a pet breed may have a distinctive color); but converting to HSV generally won't make any difference.\n",
    "\n",
@ -2765,7 +2765,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Always a good idea to look at your data before you use it:"
+    "It is always a good idea to look at your data before you use it:"
   ]
  },
  {
@ -2827,7 +2827,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's start with a basic CNN as a baseline. We'll use the same as we had in the last chapter, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
+    "Let's start with a basic CNN as a baseline. We'll use the same as we had in the last Section, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
    "\n",
    "As we discussed, we generally want to double the number of filters each time we have a stride 2 layer. So, one way to increase the number of filters throughout our network is to double the number of activations in the first layer – then every layer after that will end up twice as big as the previous version as well.\n",
    "\n",
@ -3123,7 +3123,7 @@
    "\n",
    "Then, once we have found a nice smooth area for our parameters, we then want to find the very best part of that area, which means we have to bring out learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models, and trains more quickly. That is why it is the approach that is used by default for `fine_tune` in fastai.\n",
    "\n",
-    "In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rate, we use less momentum, and we use more again in the annealing phase.\n",
+    "In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase.\n",
    "\n",
    "We can use 1cycle training in fastai by calling `fit_one_cycle`:"
   ]
@ -3496,7 +3496,7 @@
   "source": [
    "This is just what we hope to see: a smooth development of activations, with no \"crashes\". Batchnorm has really delivered on its promise here! In fact, batchnorm has been so successful that we see it (or something very similar) today in nearly all modern neural networks.\n",
    "\n",
-    "An interesting observation about models containing batch normalisation layers is that they tend to generalise better than models that don't contain them. Although we haven't as yet seen a rigourous analysis of what's going on here, most researchers believe that the reason for this is that batch normalisation add some extra randomness to the training process. Each mini batch will have a somewhat different mean and standard deviation to each other mini batch. Therefore, the activations will be normalised by different values each time. In order for the model to make accurate predictions, it will have to learn to become insensitive to these variations. In general, adding additional randomisation to the training process often helps.\n",
+    "An interesting observation about models containing batch normalisation layers is that they tend to generalise better than models that don't contain them. Although we haven't as yet seen a rigourous analysis of what's going on here, most researchers believe that the reason for this is that batch normalisation add some extra randomness to the training process. Each mini batch will have a somewhat different mean and standard deviation to other mini batches. Therefore, the activations will be normalised by different values each time. In order for the model to make accurate predictions, it will have to learn to become robust with these variations. In general, adding additional randomisation to the training process often helps.\n",
    "\n",
    "Since things are going so well, let's train for a few more epochs and see how it goes. In fact, let's even *increase* the learning rate, since the abstract of the batchnorm paper claimed we should be able to \"train at much higher learning rates\":"
   ]
@ -3757,20 +3757,8 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/14_resnet.ipynb
+++ b/14_resnet.ipynb
@ -30,9 +30,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this chapter, we will build on top of the CNNs introduced in the previous chapter and explain to you the ResNet (for residual network) architecture. It was introduced in 2015 in [this article](https://arxiv.org/abs/1512.03385) and is by far the most used model architecture nowadays. More recent developments in models almost always use the same trick of residual connections, and most of the time, they are just a tweak of the original ResNet.\n",
+    "In this chapter, we will build on top of the CNNs (Convolutional Neural Networks) introduced in the previous chapter and explain to you the ResNet (for residual network) architecture. It was introduced in 2015 in [this article](https://arxiv.org/abs/1512.03385) and is by far the most used model architecture nowadays. More recent developments in image models almost always use the same trick of residual connections, and most of the time, they are just a tweak of the original ResNet.\n",
    "\n",
-    "We will frist show you the basic ResNet as it was first designed, then explain to you what modern tweaks to it make it more performamt. But first, we will need a problem a little bit more difficult that the MNIST dataset, since we are already close to 100% accuracy with a regular CNN on it."
+    "We will first show you the basic ResNet as it was first designed, then explain to you what modern tweaks make it more performamt. But first, we will need a problem a little bit more difficult than the MNIST dataset, since we are already close to 100% accuracy with a regular CNN on it."
   ]
  },
  {
@ -103,14 +103,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "When we looked at MNIST we were dealing with 28 x 28 pixel images. For Imagenette we are going to be training with 128 x 128 pixel images, and later on would like to be able to use larger images as well — at least as big as 224 x 224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
+    "When we looked at MNIST we were dealing with 28 x 28 pixel images. For Imagenette we are going to be training with 128 x 128 pixel images. Later on we would like to be able to use larger images as well — at least as big as 224 x 224 pixels, the ImageNet standard. Do you recall how we managed to get a single vector of activations for each image out of the MNIST convolutional neural network?\n",
    "\n",
-    "The approach we used was to ensure that there was enough stride two convolutions such that the final layer would have a grid size of one. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so a matrix of activations for a mini batch). We could do the same thing for Imagenette, but that's going to cause two problems:\n",
+    "The approach we used was to ensure that there were enough stride two convolutions such that the final layer would have a grid size of one. Then we just flattened out the unit axes that we ended up with, to get a vector for each image (so a matrix of activations for a mini batch). We could do the same thing for Imagenette, but that's going to cause two problems:\n",
    "\n",
    "- We are going to need lots of stride two layers to make our grid one by one at the end — perhaps more than we would otherwise choose\n",
    "- The model will not work on images of any size other than the size we originally trained on.\n",
    "\n",
-    "One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than one by one. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always did. The most famous, still sometimes used today, is the 2013 ImageNet winner VGG. But there was another problem with this architecture: not only does it not work with images other than those of the same size as the training set, but it required a lot of memory, because flattening out the convolutional create resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
+    "One approach to dealing with the first of these issues would be to flatten the final convolutional layer in a way that handles a grid size other than one by one. That is, we could simply flatten a matrix into a vector as we have done before, by laying out each row after the previous row. In fact, this is the approach that convolutional neural networks up until 2013 nearly always did. The most famous is the 2013 ImageNet winner VGG, still sometimes used today. But there was another problem with this architecture: not only does it not work with images other than those of the same size as the training set, but it required a lot of memory, because flattening out the convolutional create resulted in many activations being fed into the final layers. Therefore, the weight matrices of the final layers were enormous.\n",
    "\n",
    "This problem was solved through the creation of *fully convolutional networks*. The trick in fully convolutional networks is to take the average of activations across a convolutional grid. In other words, we can simply use this function:"
   ]
@ -172,9 +172,9 @@
   "source": [
    "Once we are done with our convolutional layers, we will get activations of size `bs x ch x h x w` (batch size, a certain number of channels, height and width). We want to convert this to a tensor of size `bs x ch`, so we take the average over the last two dimensions and flatten the trailing `1 x 1` dimension like we did in our previous model. \n",
    "\n",
-    "This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size: for instance max pooling layers of size 2 that were very popular in older CNNs reduce the size of our image by on each dimension by taking the maximum of each 2 by 2 window (with a stride of 2).\n",
+    "This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size: for instance max pooling layers of size 2 that were very popular in older CNNs reduce the size of our image by half on each dimension by taking the maximum of each 2 by 2 window (with a stride of 2).\n",
    "\n",
-    "As before, we can define a `Learner` with our custom model and the data we grabbed before then train it:"
+    "As before, we can define a `Learner` with our custom model and then train it on the data we grabbed before:"
   ]
  },
  {
@ -326,7 +326,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We now have all the pieces needed to build the models we have been using in each computer vision task since the beginning of this book: ResNets. We'll introduce the main idea behind them and show how it improves accuracy Imagenette compared to our previous model, before building a version with all the recent tweaks."
+    "We now have all the pieces needed to build the models we have been using in each computer vision task since the beginning of this book: ResNets. We'll introduce the main idea behind them and show how it improves accuracy on Imagenette compared to our previous model, before building a version with all the recent tweaks."
   ]
  },
  {
@ -340,7 +340,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In 2015 the authors of ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using less layers — and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalisation issue, but a training issue. As the paper explains:\n",
+    "In 2015 the authors of the ResNet paper noticed something that they found curious. Even after using batchnorm, they saw that a network using more layers was doing less well than a network using less layers — and there were no other differences between the models. Most interestingly, the difference was observed not only in the validation set, but also in the training set; so, it wasn't just a generalisation issue, but a training issue. As the paper explains:\n",
    "\n",
    "> : Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as [previously reported] and thoroughly verified by our experiments.\n",
    "\n",
@ -666,9 +666,9 @@
   "source": [
    "Now we're making good progress!\n",
    "\n",
-    "The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting point for the breakthroughs were experimental observations. Observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kind of networks can be trained, in the case of the ResNet authors. This ability to design and analyse thoughtful experiments, or even just to see an unexpected result say \"hmmm, that's interesting\" — and then, most importantly, to figure out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be strong practitioner, not just a theoretician.\n",
+    "The authors of the ResNet paper went on to win the 2015 ImageNet challenge. At the time, this was by far the most important annual event in computer vision. We have already seen another ImageNet winner: the 2013 winners, Zeiler and Fergus. It is interesting to note that in both cases the starting point for the breakthroughs were experimental observations. Observations about what layers actually learn, in the case of Zeiler and Fergus, and observations about which kind of networks can be trained, in the case of the ResNet authors. This ability to design and analyse thoughtful experiments, or even just to see an unexpected result say \"hmmm, that's interesting\" — and then, most importantly, to figure out what on earth is going on, with great tenacity, is at the heart of many scientific discoveries. Deep learning is not like pure mathematics. It is a heavily experimental field, so it's important to be a strong practitioner, not just a theoretician.\n",
    "\n",
-    "Since the ResNet was introduced, there's been many papers studying it and applying it to many domains. One of the most interesting, published in 2018, is [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). It shows that using skip connections help smoothen the loss function, which makes training easier as it avoids us falling into a very sharp area. <<resnet_surface>> shows a stunning picture from the paper, showing the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right)."
+    "Since the ResNet was introduced, there's been many papers studying it and applying it to many domains. One of the most interesting, published in 2018, is [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). It shows that using skip connections help smoothen the loss function, which makes training easier as it avoids falling into a very sharp area. <<resnet_surface>> shows a stunning picture from the paper, showing the bumpy terrain that SGD has to navigate to optimize a regular CNN (left) versus the smooth surface of a ResNet (right)."
   ]
  },
  {
@ -710,7 +710,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So, as we scale up to the full ResNet, we won't show the original one, but the tweaked one, since it's substantially better. It differs a little bit from our the implementation we had before in that it begins with a few convolutional layers followed by a max pooling layer, instead of just starting with ResNet blocks. This is what the first layers look like:"
+    "So, as we scale up to the full ResNet, we won't show the original one, but the tweaked one, since it's substantially better. It differs a little bit from our previous implementation, in that instead of just starting with ResNet blocks, it begins with a few convolutional layers followed by a max pooling layer. This is what the first layers look like:"
   ]
  },
  {
@ -831,7 +831,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The `_make_layer` function is just there to create a series of `nl` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now--we'll discuss it in the next section' for now, it'll be `1`, so it doesn't do anything.)\n",
+    "The `_make_layer` function is just there to create a series of `n_layers` blocks. The first one is is going from `ch_in` to `ch_out` with the indicated `stride` and all the others are blocks of stride 1 with `ch_out` to `ch_out` tensors. Once the blocks are defined, our model is purely sequential, which is why we define it as a subclass of `nn.Sequential`. (Ignore the `expansion` parameter for now--we'll discuss it in the next section. For now, it'll be `1`, so it doesn't do anything.)\n",
    "\n",
    "The various versions of the models (ResNet 18, 34, 50, etc) just change the number of blocks in each of those groups. This is the definition of a ResNet18:"
   ]
@ -928,7 +928,7 @@
   "source": [
    "Even although we have more channels (and our model is therefore even more accurate), our training is just as fast as before, thanks to our optimized stem.\n",
    "\n",
-    "To make our model deeper without taking too much compute or memory, the ResNet paper introduced anotehr kind of blocks for ResNets with a depth of 50 or more, using something called a bottleneck. "
+    "To make our model deeper without taking too much compute or memory, the ResNet paper introduced another kind of block for ResNets with a depth of 50 or more, using something called a bottleneck. "
   ]
  },
  {
@ -956,7 +956,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Why is that useful? 1x1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first resnet block we saw. This then lets us use more filters: as we see on the illustration, the number of filters in and out is 4 times higher (256) and the 1 by 1 convs are here to diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
+    "Why is that useful? 1x1 convolutions are much faster, so even if this seems to be a more complex design, this block executes faster than the first ResNet block we saw. This then lets us use more filters: as we see on the illustration, the number of filters in and out is 4 times higher (256) and the 1 by 1 convs are here to diminish then restore the number of channels (hence the name bottleneck). The overall impact is that we can use more filters in the same amount of time.\n",
    "\n",
    "Let's try replacing our ResBlock with this bottleneck design:"
   ]
@ -1219,7 +1219,7 @@
    "1. Why do skip connections allow us to train deeper models?\n",
    "1. What does <<resnet_depth>> show? How did that lead to the idea of skip connections?\n",
    "1. What is an identity mapping?\n",
-    "1. What is the basic equation for a resnet block (ignoring batchnorm and relu layers)?\n",
+    "1. What is the basic equation for a ResNet block (ignoring batchnorm and relu layers)?\n",
    "1. What do ResNets have to do with \"residuals\"?\n",
    "1. How do we deal with the skip connection when there is a stride 2 convolution? How about when the number of filters changes?\n",
    "1. How can we express a 1x1 convolution in terms of a vector dot product?\n",
@ -1227,8 +1227,8 @@
    "1. Explain what is shown in <<resnet_surface>>.\n",
    "1. When is top-5 accuracy a better metric than top-1 accuracy?\n",
    "1. What is the stem of a CNN?\n",
-    "1. Why use plain convs in the CNN stem, instead of resnet blocks?\n",
-    "1. How does a bottleneck block differ from a plain resnet block?\n",
+    "1. Why use plain convs in the CNN stem, instead of ResNet blocks?\n",
+    "1. How does a bottleneck block differ from a plain ResNet block?\n",
    "1. Why is a bottleneck block faster?\n",
    "1. How do fully convolution nets (and nets with adaptive pooling in general) allow for progressive resizing?"
   ]
@ -1255,9 +1255,7 @@
   "execution_count": null,
   "metadata": {},
   "outputs": [],
-   "source": [
-    "s"
-   ]
+   "source": []
  }
 ],
 "metadata": {
@ -1268,20 +1266,8 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/19_learner.ipynb
+++ b/19_learner.ipynb
@ -394,7 +394,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we have a dataset and a collation function, we're ready to create `DataLoader`. We'll add two more things here: optional `shuffle` for the training set, and `ProcessPoolExecutor` to do our preprocessing in parallel. A parallel data loader is very important, because opening and decoding a jpeg image is a slow process. One CPU is not enough to decode images fast enough to keep a modern GPU busy."
+    "Now that we have a dataset and a collation function, we're ready to create `DataLoader`. We'll add two more things here: optional `shuffle` for the training set, and `ProcessPoolExecutor` to do our preprocessing in parallel. A parallel data loader is very important, because opening and decoding a jpeg image is a slow process. One CPU core is not enough to decode images fast enough to keep a modern GPU busy."
   ]
  },
  {
@ -1888,18 +1888,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/20_conclusion.ipynb
+++ b/20_conclusion.ipynb
@ -18,7 +18,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Congratulations! You've made it! If you have worked through all of the notebooks to this point, then you have joined a small, but growing group of people that are able to harness the power of deep learning to solve real problems. You may not feel that way; in fact you probably do not feel that way. We have seen again and again that students that complete the fast.AI courses dramatically underestimate how effective they are as deep learning practitioners. We've also seen that these people are often underestimated by those that have come out of a classic academic background. So for you to rise above your own expectations and the expectations of others what you do next, after closing this book, is even more important than what you've done to get to this point.\n",
+    "Congratulations! You've made it! If you have worked through all of the notebooks to this point, then you have joined a small, but growing group of people that are able to harness the power of deep learning to solve real problems. You may not feel that way; in fact you probably do not feel that way. We have seen again and again that students that complete the fast.ai courses dramatically underestimate how effective they are as deep learning practitioners. We've also seen that these people are often underestimated by those that have come out of a classic academic background. So for you to rise above your own expectations and the expectations of others what you do next, after closing this book, is even more important than what you've done to get to this point.\n",
    "\n",
    "The most important thing is to keep the momentum going. In fact, as you know from your study of optimisers, momentum is something which can build upon itself! So think about what it is you can do now to maintain and accelerate your deep learning journey. <<do_next>> can give you a few ideas."
   ]
@ -48,7 +48,7 @@
    "\n",
    "Hopefully, by this point, you have a few little projects that you put together, and experiments that you've run. Our recommendation is generally to pick one of these and make it as good as you can. Really polish it up into the best piece of work that you can — something you are really proud of. This will force you to go much deeper into a topic, which will really test out your understanding, and give you the opportunity to see what you can do when you really put your mind to it.\n",
    "\n",
-    "Also, you may want to take a look at the fast.AI free online course which covers the same material as this book. Sometimes, seeing the same material in two different ways, can really help to crystallise the ideas. In fact, human learning researchers have found that this is one of the best ways to learn material — to see the same thing from different angles, described in different ways.\n",
+    "Also, you may want to take a look at the fast.ai free online course which covers the same material as this book. Sometimes, seeing the same material in two different ways, can really help to crystallise the ideas. In fact, human learning researchers have found that this is one of the best ways to learn material — to see the same thing from different angles, described in different ways.\n",
    "\n",
    "Your final mission, should you choose to accept it, is to take this book, and give it to somebody that you know — and let somebody else start their way down their own deep learning journey!"
   ]
@ -69,20 +69,8 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/app_jupyter.ipynb
+++ b/app_jupyter.ipynb
@ -172,7 +172,7 @@
    "\n",
    "You can find more shortcuts by typing <kbd>h</kbd> (for help).\n",
    "\n",
-    "You may need to use a terminal in a Jupyter Notebook environment (for example to git pull on a repository). That is very easy to do, just press 'New' in your Home directory and 'Terminal'. Don't know how to use the Terminal? We made a tutorial for that as well. You can find it [here](http://course-v3.fast.ai/terminal_tutorial.html).\n",
+    "You may need to use a terminal in a Jupyter Notebook environment (for example to git pull on a repository). That is very easy to do, just press 'New' in your Home directory and 'Terminal'. Don't know how to use the Terminal? We made a tutorial for that as well. You can find it [here](https://course.fast.ai/terminal_tutorial.html).\n",
    "\n",
    "![Terminal](images/chapter1_terminal.png)"
   ]
@ -526,9 +526,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
--- a/clean/01_intro.ipynb
+++ b/clean/01_intro.ipynb
@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@ -89,7 +89,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -178,7 +178,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -187,7 +187,7 @@
       "2"
      ]
     },
-     "execution_count": 3,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -198,7 +198,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -208,7 +208,7 @@
       "<PIL.Image.Image image mode=RGB size=151x192 at 0x7F67678F7E90>"
      ]
     },
-     "execution_count": 4,
+     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -227,7 +227,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -252,7 +252,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
@ -265,7 +265,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
@ -966,7 +966,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
   "outputs": [
    {
     "data": {
--- a/clean/02_production.ipynb
+++ b/clean/02_production.ipynb
@ -1003,25 +1003,10 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/clean/03_ethics.ipynb
+++ b/clean/03_ethics.ipynb
@ -247,9 +247,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
--- a/clean/04_mnist_basics.ipynb
+++ b/clean/04_mnist_basics.ipynb
@ -1865,7 +1865,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Stochastic Gradient descent (SGD)"
+    "## Stochastic Gradient Descent (SGD)"
   ]
  },
  {
@ -3952,9 +3952,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
@ -3962,5 +3959,5 @@
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/clean/05_pet_breeds.ipynb
+++ b/clean/05_pet_breeds.ipynb
@ -1706,25 +1706,10 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/clean/09_tabular.ipynb
+++ b/clean/09_tabular.ipynb
@ -8287,9 +8287,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
--- a/clean/11_midlevel_data.ipynb
+++ b/clean/11_midlevel_data.ipynb
@ -185,8 +185,8 @@
    {
     "data": {
      "text/plain": [
-       "((#374) ['xxbos','xxmaj','well',',','\"','cube','\"','(','1997',')'...],\n",
-       " (#207) ['xxbos','xxmaj','conrad','xxmaj','hall','went','out','with','a','bang'...])"
+       "((#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at'...],\n",
+       " (#238) ['xxbos','i','stopped','watching','this','film','half','way','through','.'...])"
      ]
     },
     "execution_count": null,
@ -213,7 +213,7 @@
    {
     "data": {
      "text/plain": [
-       "3"
+       "(3, 2.0)"
      ]
     },
     "execution_count": null,
@ -222,9 +222,31 @@
    }
   ],
   "source": [
-    "def f(x): return x+1\n",
+    "def f(x:int): return x+1\n",
    "tfm = Transform(f)\n",
-    "tfm(2)"
+    "tfm(2),tfm(2.0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(3, 2.0)"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "@Transform\n",
+    "def f(x:int): return x+1\n",
+    "f(2),f(2.0)"
   ]
  },
  {
@ -814,25 +836,10 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/clean/12_nlp_dive.ipynb
+++ b/clean/12_nlp_dive.ipynb
@ -1571,18 +1571,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/clean/13_convolutions.ipynb
+++ b/clean/13_convolutions.ipynb
@ -1815,7 +1815,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### A note about twitter"
+    "### A note about Twitter"
   ]
  },
  {
@ -2607,20 +2607,8 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/clean/14_resnet.ipynb
+++ b/clean/14_resnet.ipynb
@ -830,9 +830,7 @@
   "execution_count": null,
   "metadata": {},
   "outputs": [],
-   "source": [
-    "s"
-   ]
+   "source": []
  }
 ],
 "metadata": {
@ -843,20 +841,8 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/clean/19_learner.ipynb
+++ b/clean/19_learner.ipynb
@ -1300,18 +1300,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
--- a/clean/20_conclusion.ipynb
+++ b/clean/20_conclusion.ipynb
@ -23,20 +23,8 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/clean/app_jupyter.ipynb
+++ b/clean/app_jupyter.ipynb
@ -259,9 +259,6 @@
  }
 ],
 "metadata": {
-  "jupytext": {
-   "split_at_heading": true
-  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",