updates

2020-03-05 12:12:45 -08:00 · 2020-03-05 12:12:45 -08:00 · 91206f880d
commit 91206f880d
parent e477ef2d85
6 changed files with 283 additions and 313 deletions
--- a/01_intro.ipynb
+++ b/01_intro.ipynb
@ -10,35 +10,6 @@
    "from utils import *"
   ]
  },
-  {
-   "cell_type": "raw",
-   "metadata": {},
-   "source": [
-    "[preface]\n",
-    "== Introduction for early release"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Thanks a lot for reading the early release of our notebooks! The cell above is an \"asciidoc\" cell--you can ignore them since they're not relevant for the notebooks. There are also some other special cells that will appear differently once we create PDF and/or paper versions of these notebooks. Notes, warnings and tips that get their own special blocks like this one:\n",
-    "\n",
-    "> note: This is an example of note\n",
-    "\n",
-    "There are also jargon cells (for the first time a new obscure term is mentioned):\n",
-    "\n",
-    "> jargon: Here we will introduce a new term\n",
-    "\n",
-    "We have asides from each of us that look like this:\n",
-    "\n",
-    "> s: This is an aside from Sylvain!\n",
-    "\n",
-    "You may see bits in the text like this: \"TK: figure showing bla here\" or \"TK: expand introduction\". \"TK\" is used to make places where we know something is missing and we will add them. This does not alter any of the core content as those are usually small parts/figures that are relatively independent form the flow and self-explanatory.\n",
-    "\n",
-    "Throughout the book, the version of the fastai library used is version 2. That version is not yet officially released and is for now separate from the main project. You can find it [here](https://github.com/fastai/fastai2)."
-   ]
-  },
  {
   "cell_type": "raw",
   "metadata": {},
@ -98,7 +69,7 @@
   "source": [
    "Deep learning has power, flexibility, and simplicity. That's why we believe it should be applied across many disciplines. These include the social and physical sciences, the arts, medicine, finance, scientific research, and much more. To give a personal example, despite having no background in medicine, Jeremy started Enlitic, a company that uses deep learning algorithms to diagnose illness and disease. Within months of starting the company, it was announced that their algorithm could identify malignent tumors [more accurately than radiologists](https://www.nytimes.com/2016/02/29/technology/the-promise-of-artificial-intelligence-unfolds-in-small-steps.html).\n",
    "\n",
-    "Here's a list of some of the thousands of tasks that deep learning (or methods heavily using deep learning) is now the best in the world at:\n",
+    "Here's a list of some of the thousands of tasks where deep learning, or methods heavily using deep learning, is now the best in the world:\n",
    "\n",
    "- NLP:: answering questions; speech recognition; summarizing documents; classifying documents; finding names, dates, etc. in documents; searching for articles mentioning a concept\n",
    "- Computer vision:: satellite and drone imagery interpretation (e.g. for disaster resilience); face recognition; image captioning; reading traffic signs; locating pedestrians and vehicles in autonomous vehicles\n",
@ -106,7 +77,7 @@
    "- Biology:: folding proteins; classifying proteins; many genomics tasks, such as tumor-normal sequencing and classifying clinically actionable genetic mutations; cell classification; analyzing protein/protein interactions\n",
    "- Image generation:: Colorizing images; increasing image resolution; removing noise from images; converting images to art in the style of famous artists\n",
    "- Recommendation systems:: web search; product recommendations; home page layout\n",
-    "- Playing games (better than humans and better than any other computer algorithm): Chess, Go, most Atari videogames, many real-time strategy games\n",
+    "- Playing games:: Better than humans and better than any other computer algorithm at Chess, Go, most Atari videogames, many real-time strategy games\n",
    "- Robotics:: handling objects that are challenging to locate (e.g. transparent, shiny, lack of texture) or hard to pick up\n",
    "- Other applications:: financial and logistical forecasting; text to speech; much much more..."
   ]
@ -115,7 +86,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Deep learning is based on a type of models called neural networks. Before we explain all about them to you, let's start with a bit of history."
+    "What is remarkable is that deep learning has such varied application yet nearly all of deep learning is based on a single type of model, the neural network.\n",
+    "\n",
+    "But neural networks are not in fact completely new. In order to have a wider perspective on the field, it is worth it to start with a bit of history."
   ]
  },
  {
@ -165,7 +138,9 @@
    "\n",
    "> : _…people are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at. …we will introduce a computational framework for modeling cognitive processes that seems… closer than other frameworks to the style of computation as it might be done by the brain._ (PDP, chapter 1)\n",
    "\n",
-    "The premise that PDP is using here is that traditional computer programs work very differently to brains, and that might be why computer programs had (at that point) been so bad at doing things that brains find easy (such as recognizing objects in pictures). The authors claim that the PDP approach is \"closer than other frameworks\" to how the brain works, and therefore it might be better able to handle these kinds of tasks. The approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined \"Parallel Distributed Processing\" as requiring:\n",
+    "The premise that PDP is using here is that traditional computer programs work very differently to brains, and that might be why computer programs had (at that point) been so bad at doing things that brains find easy (such as recognizing objects in pictures). The authors claim that the PDP approach is \"closer than other frameworks\" to how the brain works, and therefore it might be better able to handle these kinds of tasks.\n",
+    "\n",
+    "In fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined \"Parallel Distributed Processing\" as requiring:\n",
    "\n",
    "1. A set of *processing units*\n",
    "1. A *state of activation*\n",
@ -180,7 +155,9 @@
    "\n",
    "In the 1980's most models were built with a second layer of neurons, thus avoiding the problem that had been identified by Minsky (this was their \"pattern of connectivity among units\", to use the framework above). And indeed, neural networks were widely used during the 80s and 90s for real, practical projects. However, again a misunderstanding of the theoretical issues held back the field. In theory, adding just one extra layer of neurons was enough to allow any mathematical model to be approximated with these neural networks, but in practice such networks were often too big and slow to be useful.\n",
    "\n",
-    "Although researchers showed 30 years ago that to get practical good performance you need to use even more layers of neurons, it is only in the last decade that this has been more widely appreciated. Neural networks are now finally living up to their potential, thanks to the understanding to use more layers as well as improved ability to do so thanks to improvements in computer hardware, increases in data availability, and algorithmic tweaks that allow neural networks to be trained faster and more easily. We now have what Rosenblatt had promised: \"a machine capable of perceiving, recognizing and identifying its surroundings without any human training or control\". And you will learn how to build them in this book."
+    "Although researchers showed 30 years ago that to get practical good performance you need to use even more layers of neurons, it is only in the last decade that this has been more widely appreciated. Neural networks are now finally living up to their potential, thanks to the understanding to use more layers as well as improved ability to do so thanks to improvements in computer hardware, increases in data availability, and algorithmic tweaks that allow neural networks to be trained faster and more easily. We now have what Rosenblatt had promised: \"a machine capable of perceiving, recognizing and identifying its surroundings without any human training or control\".\n",
+    "\n",
+    "This is what you will learn how to build them in this book."
   ]
  },
  {
@ -194,7 +171,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "After reading this book, you will know:\n",
+    "To be exact, after reading this book you will know:\n",
    "\n",
    "- How to train models that achieve state of the art results in:\n",
    "  - Computer vision: Image classification (e.g. classify pet photos by breed), and image localization and detection (e.g. find where the animals in an image are)\n",
@ -717,7 +694,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So, how do we know if this model is any good? You can see the error rate printed as the last column of the table. This is the proportion of images that were incorrectly identified. As you can see, the model is nearly perfect, even though the training time was only a few seconds (not including the one-time downloading of the dataset and the pretrained model). In fact, the accuracy you've achieved already is far better than anybody had ever achieved just 10 years ago!\n",
+    "So, how do we know if this model is any good? In the last column of the table you can see the error rate, which is the proportion of images that were incorrectly identified. The error rate serves as our metric -- our measure of model quality, chosen to be intuitive and comprehensible. As you can see, the model is nearly perfect, even though the training time was only a few seconds (not including the one-time downloading of the dataset and the pretrained model). In fact, the accuracy you've achieved already is far better than anybody had ever achieved just 10 years ago!\n",
    "\n",
    "Finally, let's check that this model actually works. Go and get a photo of a dog, or a cat; if you don't have one handy, just search Google images and download an image that you find there. Now execute the cell with `uploader` defined. It will output a button you can click, so you can select the image you want to classify."
   ]
@ -808,7 +785,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now let's take a step back and have a look at what we actually did when running those lines of code."
+    "Congratulations on your first classifier!\n",
+    "\n",
+    "But what does this mean? But what did we actually do? In order to explain this, let's zoom out again to take in the big picture. "
   ]
  },
  {
@ -822,9 +801,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Well that was impressive--we trained a model! But... what does that actually *mean*? What did we actually *do*?\n",
+    "Your classifier is a deep learning model. As was already mentioned, deep learning models use neural networks, which originally date from the 1950s and have become powerful very recently thanks to recent advancements.\n",
    "\n",
-    "To answer those questions, we need to zoom out a level from *deep learning* and discuss the more general *machine learning*. *Machine learning* is, like regular programming, a way to get computers to complete a specific task. But how would you use regular programming to do what we just did in the last section: recognize dogs vs cats in photos? We would have to write down for the computer the exact steps necessary to complete the task.\n",
+    "Another key piece of context is that deep learning is just a modern area in the more general discipline of *machine learning*. To understand the essence of what you did when you trained your own classication model, you don't need to understand deep learning. It is enough to see how your model and your training process are examples of the concepts that apply to machine learning in general.\n",
+    "\n",
+    "So in this section, we will describe what machine learning is. We will introduce the key concepts, and see how they can be traced back to the original essay that introduced the concept.\n",
+    "\n",
+    "*Machine learning* is, like regular programming, a way to get computers to complete a specific task. But how would you use regular programming to do what we just did in the last section: recognize dogs vs cats in photos? We would have to write down for the computer the exact steps necessary to complete the task.\n",
    "\n",
    "Normally, it's easy enough for us to write down the steps to complete a task when we're writing a program. We just think about the steps we'd take if we had to do the task by hand, and then we translate them into code. For instance, we can write a function that sorts a list. In general, we write a function that looks something like <<basic_program>> (where *inputs* might be an unsorted list, and *results* a sorted list)."
   ]
@ -1050,7 +1033,7 @@
    "\n",
    "Finally, he says we need *a mechanism for altering the weight assignment so as to maximize the performance*. For instance, we could look at the difference in weights between the winning model and the losing model, and adjust the weights a little further in the winning *direction*.\n",
    "\n",
-    "We can now see why he said that such a procedure *could be made entirely automatic and... a machine so programed would \"learn\" from its experience*. Learning would become entirely automatic when the adjustment of the weight was also automatic -- when instead of us improving a model by adjusting its weights, we had and automated mechanism that produced adjustments based on performance.\n",
+    "We can now see why he said that such a procedure *could be made entirely automatic and... a machine so programed would \"learn\" from its experience*. Learning would become entirely automatic when the adjustment of the weights was also automatic -- when instead of us improving a model by adjusting its weights manually, we relied on an automated mechanism that produced adjustments based on performance.\n",
    "\n",
    "<<training_loop>> shows the full picture of Samuel's idea of training a machine learning model."
   ]
@ -1279,14 +1262,14 @@
    "\n",
    "But what about that process?  One could imagine that you might need to find a new \"mechanism\" for automatically updating weight for every problem. This would be laborious. What we'd like here as well is a completely general way to update the weights of a neural network, to make it improve at any given task. Conveniently, this also exists!\n",
    "\n",
-    "This is called *stochastic gradient descent* (SGD). We'll see how neural networks and SGD work in detail later in this book, as well as explaining the universal approximation theorem. For now, however, we will instead use Samuel's own words: *We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programed would \"learn\" from its experience.*"
+    "This is called *stochastic gradient descent* (SGD). We'll see how neural networks and SGD work in detail in <<chapter_mnist_basics>>, as well as explaining the universal approximation theorem. For now, however, we will instead use Samuel's own words: *We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programed would \"learn\" from its experience.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> J: Don't worry, neither SGD nor neural nets are mathematically complex. In fact, I'll tell you *exactly* how they work right now! In a neural net, we take the input (e.g. the pixels of an image), multiply it by some (initially random) numbers (the \"weights\" or \"parameters\"), and add them up. We do that a few times with different weights to get a few values. We then replace all the negative numbers with zeros. Those two steps are called a *layer*. Then we repeat those two steps a few times, creating more *layers*. Finally, we add up the values. That's it: a neural net! Then we compare the value that comes out to our target (e.g. we might decide \"dog\" is `1` and \"cat\" is `0`), and calculate the *derivative* of the error with regards to the model’s weights (except we don't have to do it ourselves; it's entirely automated by PyTorch). This tells us how much each weight impacted the loss. We multiply that by a small number (around 0.01, normally), and subtract it from the weights. We repeat this process a few times for every input. That's it: the entirety of creating a training a neural net! In the rest of this book we'll learn about *how* and *why* this works, along with some tricks to speed it up and make it more reliable, and how to implement it in fastai and PyTorch. *(TK AG: Jeremy, I think we should cut this aside entirely.  There are already probably too many parenthetical notes in this chapter which risk obscuring the thread of explanation, and this one is so terse I fear it is more likely to confuse than to reassure.)*"
+    "> J: Don't worry, neither SGD nor neural nets are mathematically complex. Both SGD and neural nets nearly entirely rely on addition and multiplication to do their work (but they do a *lot* of addition and multiplication!) The main reaction we hear from students when they see the details is: \"is that all it is?\""
   ]
  },
  {
@ -1295,11 +1278,13 @@
   "source": [
    "In other words, to recap, a neural network is a particular kind of machine learning model, which fits right in to Samuel's original conception. Neural networks are special because they are highly flexible, which means they can solve an unusually range of problems just by finding the right weights. This is powerful, because stochastic gradient descent provides us a way to find those weight values automatically.\n",
    "\n",
-    "Let's now try to fit our image classification problem into Samuel's framework.\n",
+    "Having zoomed out, let's now zoom back in and revisit our image classification problem into Samuel's framework.\n",
    "\n",
-    "Our inputs, those are the images. Our weights, those are the weights in the neural net. Our model is a neural net. Our results those are the values that are calculated by the neural net.\n",
+    "Our inputs, those are the images. Our weights, those are the weights in the neural net. Our model is a neural net. Our results -- those are the values that are calculated by the neural net, like \"dog\" or \"cat\".\n",
    "\n",
-    "So now we just need some *automatic means of testing the effectiveness of any current weight assignment in terms of actual performance*. Well that's easy enough: we can see how accurate our model is at predicting the correct answers! So put this all together, and we have an image recognizer."
+    "What about the next piece, an *automatic means of testing the effectiveness of any current weight assignment in terms of actual performance*? Determining \"actual performance\" is easy enough: we can simply define our model's performance as its accuracy at predicting the correct answers.\n",
+    "\n",
+    "Putting this all together, and assuming that SGD is our mechanism for updating the weight assginments, we can see how our image classifier is a machine learning model, much like Samuel envisioned."
   ]
  },
  {
@ -1313,11 +1298,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Our picture is almost complete.\n",
-    "\n",
-    "All that remains is to add this last concept, of measuring a model's performance by comparing with the correct answer, and to update some of its terminology to match the usage of 2020 instead of 1961.\n",
-    "\n",
-    "Here is the modern deep learning terminology for all the pieces we have discussed:\n",
+    "Samuel was working in the 1960s but terminology has changed. Here is the modern deep learning terminology for all the pieces we have discussed:\n",
    "\n",
    "- The functional form of the *model* is called its *architecture* (but be careful--sometimes people use *model* as a synonym of *architecture*, so this can get confusing) ;\n",
    "- The *weights* are called *parameters* ;\n",
@ -1452,7 +1433,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can now see some fundamental things about training a deep learning model:\n",
+    "### Limitations inherent to machine learning\n",
+    "\n",
+    "From this picture we can now see some fundamental things about training a deep learning model:\n",
    "\n",
    "- A model cannot be created without data ;\n",
    "- A model can only learn to operate on the patterns seen in the input data used to train it ;\n",
@ -1482,7 +1465,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we have seen the base of the theory, let's go back to our code example and see how the code corresponds to the process we just described."
+    "Now that we have seen the base of the theory, let's go back to our code example and see in detail how the code corresponds to the process we just described."
   ]
  },
  {
@ -1865,7 +1848,8 @@
    "|**pretrained model** | A model that has already been trained, generally using a large dataset, and will be fine-tuned\n",
    "|**fine tune** | Update a pretrained model for a different task\n",
    "|**epoch** | One complete pass through the input data\n",
-    "|**metric** | A measurement of how good the model is, using the validation set\n",
+    "|**loss** | A measure of how good the model is, chosen to drive training via SGD\n",
+    "|**metric** | A measurement of how good the model is, using the validation set, chosen for human consumption\n",
    "|**validation set** | A set of data held out from training, used only for measuring how good the model is\n",
    "|**training set** | The data used for fitting the model; does not include any data from the validation set\n",
    "|**overfitting** | Training a model in such a way that it _remembers_ specific features of the input data, rather than generalizing well to data not seen during training\n",
@ -1880,13 +1864,13 @@
   "source": [
    "With this vocabulary in hand, we are now in a position to bring together all the key concepts so far. Take a moment to review those definitions and read the following summary. If you can follow the explanation, then you have laid down the basic coordinates for understanding many discussions to come.\n",
    "\n",
-    "*Machine learning* is a discipline where we define a program not by writing it entirely ourselves, but by learning from data. *Deep learning* is a specialty within machine learning with uses *neural networks* using multiple *layers*. *Image classification* is a representative example (also known as *image recognition*). We start with *labeled data*, that is, a set of images where we have assigned a *label* to each image indicating what it represents. Our goal is to produce a program, called a *model*, which, given a new image, will make an accurate *prediction* regarding what that new image represents.\n",
+    "*Machine learning* is a discipline where we define a program not by writing it entirely ourselves, but by learning from data. *Deep learning* is a specialty within machine learning which uses *neural networks* using multiple *layers*. *Image classification* is a representative example (also known as *image recognition*). We start with *labeled data*, that is, a set of images where we have assigned a *label* to each image indicating what it represents. Our goal is to produce a program, called a *model*, which, given a new image, will make an accurate *prediction* regarding what that new image represents.\n",
    "\n",
-    "Every model starts with a choice of *architecture*, a general template for how that kind of model works internally. The process of *training* (or *fitting*) the model is the process of finding a set of *parameter values* (or *weights*) which specializes that general architecture into a model that works well for our particular kind of data. In order to define how well a model does on a single prediction, we need to define a *loss function*, which defines how we score a prediction as good or bad.\n",
+    "Every model starts with a choice of *architecture*, a general template for how that kind of model works internally. The process of *training* (or *fitting*) the model is the process of finding a set of *parameter values* (or *weights*) which specializes that general architecture into a model that works well for our particular kind of data. In order to define how well a model does on a single prediction, we need to define a *loss function*, which defines how we score a prediction as good or bad, in order to support training.\n",
    "\n",
    "In order to make the training process go faster, we might start with a *pretrained model*, a model which has already been trained on someone else's data. We then adapt it to our data by training it a bit more on our data, a process called *fine tuning*.\n",
    "\n",
-    "When we train a model, a key concern is to ensure that our model *generalizes* -- that is, that it learns general lessons from our data which also apply to new items it will encounter, so that it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called *overfitting*. In order to avoid this, we always divide our data into two parts, the *training set* and the *validation set*. We train the model by showing it only the *training set* and then we evaluate how well the model is doing by seeing how well it predicts on items from the *validation set* . In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order to assess how well the model is doing on the validation set overall, we define a *metric* . During the training process, when the model has seen every item in the training set, we call that an *epoch*.\n",
+    "When we train a model, a key concern is to ensure that our model *generalizes* -- that is, that it learns general lessons from our data which also apply to new items it will encounter, so that it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called *overfitting*. In order to avoid this, we always divide our data into two parts, the *training set* and the *validation set*. We train the model by showing it only the *training set* and then we evaluate how well the model is doing by seeing how well it predicts on items from the *validation set* . In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order for a person to assess how well the model is doing on the validation set overall, we define a *metric* . During the training process, when the model has seen every item in the training set, we call that an *epoch*.\n",
    "\n",
    "All these concepts apply to machine learning in general. That is, they apply to all sorts of schemes for defining a model by training it with data. What makes deep learning distinctive is a particular class of architectures, the architectures based on *neural networks*. In particular, tasks like image classification rely heavily on *convolutional neural networks*, which we will discuss shortly."
   ]
@ -2243,7 +2227,8 @@
    "from fastai2.text.all import *\n",
    "\n",
    "dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')\n",
-    "learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)\n",
+    "learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, \n",
+    "                                metrics=accuracy)\n",
    "learn.fine_tune(4, 1e-2)\n",
    "```\n",
    "\n",
@ -2271,41 +2256,9 @@
    "doc(learn.predict)\n",
    "```\n",
    "\n",
-    "This will make a small window pop with content like this:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hide_input": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/markdown": [
-       "<h4 id=\"Learner.predict\" class=\"doc_header\"><code>Learner.predict</code><a href=\"https://github.com/fastai/fastai2/tree/master/fastai2/learner.py#L330\" class=\"source_link\" style=\"float:right\">[source]</a></h4>\n",
-       "\n",
-       "> <code>Learner.predict</code>(**`item`**, **`rm_type_tfms`**=*`None`*)\n",
-       "\n",
-       "Return the prediction on `item`, fully decoded, loss function decoded and probabilities\n",
-       "\n",
-       "<a href=\"https://dev.fast.ai/learner#Learner.predict\" target=\"_blank\" rel=\"noreferrer noopener\">Show in docs</a>"
-      ],
-      "text/plain": [
-       "<IPython.core.display.Markdown object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "#hide_input\n",
-    "from IPython.display import display, HTML, Markdown\n",
-    "md = show_doc(learn.predict, disp=False)\n",
-    "md += f'\\n\\n<a href=\"https://dev.fast.ai/learner#Learner.predict\" target=\"_blank\" rel=\"noreferrer noopener\">Show in docs</a>'\n",
-    "display(Markdown(md))"
+    "This will make a small window pop with content like this:\n",
+    "\n",
+    "<img src=\"images/doc_ex.png\" width=\"600\">"
   ]
  },
  {
@ -2941,31 +2894,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.5"
-  },
-  "toc": {
-   "base_numbering": 1,
-   "nav_menu": {},
-   "number_sections": false,
-   "sideBar": true,
-   "skip_h1_title": true,
-   "title_cell": "Table of Contents",
-   "title_sidebar": "Contents",
-   "toc_cell": false,
-   "toc_position": {},
-   "toc_section_display": true,
-   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/02_production.ipynb
+++ b/02_production.ipynb
@ -29,7 +29,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The five lines of code we saw in <<chaptter_intro>> are just one small part of the process of using deep learning in practice. In this chapter, we're going to use a computer vision example to look at the end-to-end process of creating a deep learning application. More specifically: we're going to build a bear classifier! In the process, we'll discuss the capabilities and constraints of deep learning, learn about how to create datasets, look at possible gotchas when using deep learning in practice, and more. Many of the key points will apply equally well to other deep learning problems, such as we showed in <<chaptter_intro>>. If you work through a problem similar in key respects to our example problems, we expect you to get excellent results with little code, quickly.\n",
+    "The five lines of code we saw in <<chapter_intro>> are just one small part of the process of using deep learning in practice. In this chapter, we're going to use a computer vision example to look at the end-to-end process of creating a deep learning application. More specifically: we're going to build a bear classifier! In the process, we'll discuss the capabilities and constraints of deep learning, learn about how to create datasets, look at possible gotchas when using deep learning in practice, and more. Many of the key points will apply equally well to other deep learning problems, such as we showed in <<chapter_intro>>. If you work through a problem similar in key respects to our example problems, we expect you to get excellent results with little code, quickly.\n",
    "\n",
    "Let's start with how you should frame your problem."
   ]
@ -274,7 +274,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> important: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use one particular provider, *Bing Image Search*, using the service they have as this book as written. We'll be providing more options and more up to date information on the [book website](https://book.fast.ai), so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
+    "> important: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use one particular provider, _Bing Image Search_, using the service they have as this book as written. We'll be providing more options and more up to date information on the http://book.fast.ai[book website], so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning."
   ]
  },
  {
@ -467,7 +467,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: I just love this about working in Jupyter notebooks! It's so easy to gradually build what I want, and check my work every step of the way. I make a *lot* of mistakes, so this is really helpful to me..."
+    "> j: I just love this about working in Jupyter notebooks! It's so easy to gradually build what I want, and check my work every step of the way. I make a _lot_ of mistakes, so this is really helpful to me..."
   ]
  },
  {
@ -622,7 +622,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: DataLoaders: a fastai class which stores whatever `DataLoader` objects you pass to it, and makes them available as properties."
+    "> jargon: DataLoaders: A fastai class which stores whatever `DataLoader` objects you pass to it, and makes them available as properties."
   ]
  },
  {
@ -1181,7 +1181,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> note: After cleaning the dataset using the above steps, we generally are seeing 100% accuracy on this task. We even see that result when we download a lot less images than the 150 per class we're using here. As you can see, the common complaint *you need massive amounts of data to do deep learning* can be a very long way from the truth!"
+    "\n",
+    "> note: After cleaning the dataset using the above steps, we generally are seeing 100% accuracy on this task. We even see that result when we download a lot less images than the 150 per class we're using here. As you can see, the common complaint _you need massive amounts of data to do deep learning_ can be a very long way from the truth!"
   ]
  },
  {
@ -1707,20 +1708,6 @@
    "6. Click \"Launch\"."
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -1820,7 +1807,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> j: I started a company 20 years ago called *Optimal Decisions* which used machine learning and optimisation to help giant insurance companies set their pricing, impacting tens of billions of dollars of risks. We used the approaches described above to manage the potential downsides of something that might go wrong. Also, before we worked with our clients to put anything in production, we tried to simulate the impact by testing the end to end system on their previous year's data. It was always quite a nerve-wracking process, putting these new algorithms in production, but every rollout was successful."
+    "> j: I started a company 20 years ago called _Optimal Decisions_ which used machine learning and optimisation to help giant insurance companies set their pricing, impacting tens of billions of dollars of risks. We used the approaches described above to manage the potential downsides of something that might go wrong. Also, before we worked with our clients to put anything in production, we tried to simulate the impact by testing the end to end system on their previous year's data. It was always quite a nerve-wracking process, putting these new algorithms in production, but every rollout was successful."
   ]
  },
  {
@ -1834,13 +1821,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "One of the biggest challenges in rolling out a model is that your model may change the behaviour of the system it is a part of. For instance, consider YouTube's recommendation system. A couple of years ago Google talked about how they had introduced reinforcement learning (closely related to deep learning, but where your loss function represents a result which could be a long time after an action occurs) to improve their recommendation system. They described how they used an algorithm which made recommendations such that watch time would be optimised.\n",
+    "One of the biggest challenges in rolling out a model is that your model may change the behaviour of the system it is a part of. For instance, consider a \"predictive policing\" algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. In the Royal Statiscal Society paper [To predict and serve](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2016.00960.x), Kristian Lum and William Isaac write: \"predictive policing is aptly named: it is predicting future policing, not future crime\".\n",
    "\n",
-    "However, human beings tend to be drawn towards controversial content. This meant that videos about things like conspiracy theories started to get recommended more and more by the recommendation system. Furthermore, it turns out that the kinds of people that are interested in conspiracy theories are also people that watch a lot of online videos! So, they started to get drawn more and more towards YouTube. The increasing number of conspiracy theorists watching YouTube resulted in the algorithm recommending more and more conspiracy theories and other extremist content, which resulted in more extremists watching videos on YouTube, and more people watching YouTube developing extremist views, which led to the algorithm recommending more extremist content... The system became so out of control that in February 2019 it led the New York Times to run the headline \"YouTube Unleashed a Conspiracy Theory Boom. Can It Be Contained?\"footnote:[https://www.nytimes.com/2019/02/19/technology/youtube-conspiracy-stars.html]\n",
-    "\n",
-    "One of our reviewers for this book, Aurélien Géron, led YouTube's video classification team from 2013 to 2016. He pointed out that it's not just feedback loops involving humans that are a problem. There can also be feedback loops without humans! He told us about an example from YouTube:\n",
-    "\n",
-    "> \"One important signal to classify the main topic of a video is the channel it comes from. For example, a video uploaded to a cooking channel is very likely to be a cooking video. But how do we know what topic a channel is about? Well… in part by looking at the topics of the videos it contains! Do you see the loop? For example, many videos have a description which indicates what camera was used to shoot the video. As a result, some of these videos might get classified as videos about “photography”. If a channel has such as misclassified video, it might be classified as a “photography” channel, making it even more likely for future videos on this channel to be wrongly classified as “photography”. This could even lead to runaway virus-like classifications! One way to break this feedback loop is to classify videos with and without the channel signal. Then when classifying the channels, you can only use the classes obtained without the channel signal. This way, the feedback loop is broken.\"\n",
+    "Part of the issue in this case is that in the presence of *bias* (which we'll discuss in depth in the next chapter), feedback loops can result in negative implications of that bias getting worse and worse. For instance, there are concerns that this is already happening in the US, where there is significant bias in arrest rates on racial grounds. [According to the ACLU](https://www.aclu.org/issues/smart-justice/sentencing-reform/war-marijuana-black-and-white), \"despite roughly equal usage rates, Blacks are 3.73 times more likely than whites to be arrested for marijuana\". The impact of this bias, along with the roll-out of predictive policing algorithms in many parts of the US, led Bärí Williams to [write in the NY Times](https://www.nytimes.com/2017/12/02/opinion/sunday/intelligent-policing-and-my-innocent-children.html): \"The same technology that’s the source of so much excitement in my career is being used in law enforcement in ways that could mean that in the coming years, my son, who is 7 now, is more likely to be profiled or arrested — or worse — for no reason other than his race and where we live.\"\n",
    "\n",
    "A helpful exercise prior to rolling out a significant machine learning system is to consider this question: \"what would happen if it went really, really well?\" In other words, what if the predictive power was extremely high, and its ability to influence behaviour was extremely significant? In that case, who would be most impacted? What would the most extreme results potentially look like? How would you know what was really going on?\n",
    "\n",
--- a/03_ethics.ipynb
+++ b/03_ethics.ipynb
@ -18,7 +18,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Acknowledgement: Dr Rachel Thomas**"
+    "### Sidebar: Acknowledgement: Dr Rachel Thomas"
   ]
  },
  {
@ -28,6 +28,13 @@
    "This chapter was co-authored by Dr Rachel Thomas, the co-founder of fast.ai, and founding director of the Center for Applied Data Ethics at the University of San Francisco. It largely follows a subset of her syllabus for the \"Introduction to Data Ethics\" course that she developed."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### End sidebar"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -195,8 +202,15 @@
    "\n",
    "On the other hand, if a project you are involved in turns out to make a huge positive impact on even one person, this is going to make you feel pretty great!\n",
    "\n",
-    "Okay, so hopefully we have convinced you that you ought to care. Now the question is: can you actually do anything can you make an impact beyond just maximising the predictive power of your models? Consider the pipeline are steps that occurs between the development of a model or an algorithm by a researcher or practitioner, and the point at which this work is actually used to make some decision. Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything. It's especially tricky if you're involved in data collection, which is even earlier in the pipeline.\n",
+    "Okay, so hopefully we have convinced you that you ought to care. But what should you do? As data scientists, we're naturally inclined to focus on making our model better at optimizing some metric. But optimizing that metric may not actually lead to better outcomes. And even if optimizing that metric *does* help create better outcomes, it almost certainly won't be the only thing that matters. Consider the pipeline of steps that occurs between the development of a model or an algorithm by a researcher or practitioner, and the point at which this work is actually used to make some decision. This entire pipeline needs to be considered *as a whole* if we're to have a hope of getting the kinds of outcomes we want.\n",
    "\n",
+    "Normally there is a very long chain from one end to the other. This is especially true if you are a researcher where you don't even know if your research will ever get used for anything, or if you're involved in data collection, which is even earlier in the pipeline. But no-one is better placed to "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
    "Data often ends up being used for different purposes than why it was originally collected.  IBM began selling to Nazi Germany well before the Holocaust, including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany. US census data was used to round up Japanese-Americans (who were US citizens) for internment during World War II. It is important to recognize how data and images collected can be weaponized later. Columbia professor [Tim Wu wrote](https://www.nytimes.com/2019/04/10/opinion/sunday/privacy-capitalism.html) that “You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”"
   ]
  },
@ -291,7 +305,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We have already explained in <<chapter_intro>> how an algorithm can interact with its enviromnent to create a feedback loop, making prediction that reinforces actions taken in the field, which lead to predictions even more pronounced in the same direciton. The New York Times published another article on YouTube's recommendation system, titled [On YouTube’s Digital Playground, an Open Gate for Pedophiles](https://www.nytimes.com/2019/06/03/world/americas/youtube-pedophiles.html). The article started with this chilling story:"
+    "We have already explained in <<chapter_intro>> how an algorithm can interact with its enviromnent to create a feedback loop, making prediction that reinforces actions taken in the field, which lead to predictions even more pronounced in the same direction. \n",
+    "As an example, we'll discuss YouTube's recommendation system. A couple of years ago Google talked about how they had introduced reinforcement learning (closely related to deep learning, but where your loss function represents a result which could be a long time after an action occurs) to improve their recommendation system. They described how they used an algorithm which made recommendations such that watch time would be optimised.\n",
+    "\n",
+    "However, human beings tend to be drawn towards controversial content. This meant that videos about things like conspiracy theories started to get recommended more and more by the recommendation system. Furthermore, it turns out that the kinds of people that are interested in conspiracy theories are also people that watch a lot of online videos! So, they started to get drawn more and more towards YouTube. The increasing number of conspiracy theorists watching YouTube resulted in the algorithm recommending more and more conspiracy theories and other extremist content, which resulted in more extremists watching videos on YouTube, and more people watching YouTube developing extremist views, which led to the algorithm recommending more extremist content... The system became so out of control that in February 2019 it led the New York Times to run the headline \"YouTube Unleashed a Conspiracy Theory Boom. Can It Be Contained?\"footnote:[https://www.nytimes.com/2019/02/19/technology/youtube-conspiracy-stars.html]\n",
+    "\n",
+    "The New York Times published another article on YouTube's recommendation system, titled [On YouTube’s Digital Playground, an Open Gate for Pedophiles](https://www.nytimes.com/2019/06/03/world/americas/youtube-pedophiles.html). The article started with this chilling story:"
   ]
  },
  {
@ -329,7 +348,9 @@
   "source": [
    "Russia Today's coverage of the Mueller report was an extreme outlier in how many channels were recommending it. This suggests the possibility that Russia Today, a state-owned Russia media outlet, has been successful in gaming YouTube's recommendation algorithm.  The lack of transparency of systems like this make it hard to uncover the kinds of problems that we're discussing.\n",
    "\n",
-    "Another example of a feedback loop is a predictive policing algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. University of Utah computer science processor Suresh Venkatasubramanian says about this: \"Predictive policing is aptly named: it is predicting future policing, not future crime.”\n",
+    "One of our reviewers for this book, Aurélien Géron, led YouTube's video classification team from 2013 to 2016 (well before the events discussed above). He pointed out that it's not just feedback loops involving humans that are a problem. There can also be feedback loops without humans! He told us about an example from YouTube:\n",
+    "\n",
+    "> : \"One important signal to classify the main topic of a video is the channel it comes from. For example, a video uploaded to a cooking channel is very likely to be a cooking video. But how do we know what topic a channel is about? Well… in part by looking at the topics of the videos it contains! Do you see the loop? For example, many videos have a description which indicates what camera was used to shoot the video. As a result, some of these videos might get classified as videos about “photography”. If a channel has such as misclassified video, it might be classified as a “photography” channel, making it even more likely for future videos on this channel to be wrongly classified as “photography”. This could even lead to runaway virus-like classifications! One way to break this feedback loop is to classify videos with and without the channel signal. Then when classifying the channels, you can only use the classes obtained without the channel signal. This way, the feedback loop is broken.\"\n",
    "\n",
    "There are positive examples of people and organizations attempting to combat these problems. Evan Estola, lead machine learning engineer at Meetup, [discussed the example](https://www.youtube.com/watch?v=MqoRzNhrTnQ) of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop, but explicitly not using gender for that part of their model. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact. \"You need to decide which feature not to use in your algorithm… the most optimal algorithm is perhaps not the best one to launch into production\", he said.\n",
    "\n",
@ -347,7 +368,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It is extremely important to keep in mind this kind of behavior can happen, and to either anticipate a feedback loop or take positive action to break it when you can the first signs of it in your own projects. Another thing to keep in mind is bias."
+    "It is extremely important to keep in mind this kind of behavior can happen, and to either anticipate a feedback loop or take positive action to break it when you can the first signs of it in your own projects. Another thing to keep in mind is *bias*, which, as we discussed in the previous chapter, can interact with feedback loops in very troublesome ways."
   ]
  },
  {
@ -880,9 +901,7 @@
    "Ethical behavior in industry is necessary as well, since:\n",
    "\n",
    "- Law will not always keep up\n",
-    "- Edge cases will arise in which practitioners must use their best judgement.\n",
-    "\n",
-    "TK expand"
+    "- Edge cases will arise in which practitioners must use their best judgement."
   ]
  },
  {
--- a/04_mnist_basics.ipynb
+++ b/04_mnist_basics.ipynb
@ -31,9 +31,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now that we’ve seen what it looks like to actually train a variety of models, let’s now dig under the hood and see exactly what is going on. We’ll start with computer vision, and will use that to introduce many of the key concepts of deep learning. In future chapters we’ll do deep dives into other applications as well, and we’ll see how to use these insights to both improve our model’s accuracy, speed up its training, and turn it into a real working web application.\n",
+    "Now that we’ve seen what it looks like to actually train a variety of models, let’s now dig under the hood and see exactly what is going on. We’ll start with computer vision, and will use that to introduce many key tools and concepts of deep learning. We'll discuss the role of arrays and tensors, and of brodcasting, a powerful technique for using them expressively. We'll explain stochastic gradient descent (SGD), the mechanism for learning by updating weights automatically. We'll discuss the choice of loss function for for such a classification task, and the role of mini-batches. And we'll put these pieces together, to see how they all fit.\n",
    "\n",
-    "First, let's start by how images are represented in a computer, then we will make our way up to how to classify different type of images."
+    "In future chapters we’ll do deep dives into other applications as well, and see how these concepts and tools generalize.\n",
+    "\n",
+    "But for now, let's start by considering how images are represented in a computer, then we will make our way up to how to classify different type of images."
   ]
  },
  {
@ -136,7 +138,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The MNIST dataset shows a very common layout for machine learning datasets: separate folders for the *training set*, which is used to train a model, and the *validation set* (and/or *test set*), which is used to evaluate the model (we'll be talking a lot of these concepts very soon!) Let's see what's inside the training set:"
+    "The MNIST dataset shows a very common layout for machine learning datasets: separate folders for the *training set*, which is used to train a model, and the *validation set* (and/or *test set*), which is used to evaluate the model (we'll be talking a lot about these concepts very soon!) Let's see what's inside the training set:"
   ]
  },
  {
@ -1371,7 +1373,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "So, here is a first idea: how about we find the average pixel value for every pixel of the threes and do the same for each of the sevens. Then, to classify a digit see which of these two group averages it is most similar to. This certainly seems like it should be better than nothing, so it will make a good baseline."
+    "So, here is a first idea: how about we find the average pixel value for every pixel of the threes and do the same for each of the sevens. This will give us two group averages, defining what we might call the \"ideal\" 3 and 7. Then, to classify an image as digit, we see which of these two ideal digits the image is most similar to. This certainly seems like it should be better than nothing, so it will make a good baseline."
   ]
  },
  {
@ -1456,7 +1458,7 @@
   "source": [
    "For every pixel position, we want to compute the average over all the images of the intensity of that pixel. To do this we first combine all the images in this list into a single three-dimensional tensor. The most common way to describe such a tensor is to call it a *rank-3 tensor*. We often need to stack up individual tensors in a collection into a single tensor. Unsurprisingly, PyTorch comes with a function called `stack`.\n",
    "\n",
-    "Some operations in PyTorch, such as taking a mean, require us to cast our integer types to float types. Since we'll be needing this later, we'll also cast our stcked tensor to `float` now. Casting in PyTorch is as simple as typing the name of the type you wish to cast to, and treating it as a method.\n",
+    "Some operations in PyTorch, such as taking a mean, require us to cast our integer types to float types. Since we'll be needing this later, we'll also cast our stacked tensor to `float` now. Casting in PyTorch is as simple as typing the name of the type you wish to cast to, and treating it as a method.\n",
    "\n",
    "Generally when images are floats, the pixels are expected to be be zero and one, so we will also divide by 255 here."
   ]
@ -1720,7 +1722,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> s: Intuitively, the difference between L1 norm and mean squared error (*MSE*) is that the latter will penalize more heavily bigger mistakes than the former (and be more lenient with small mistakes)."
+    "> s: Intuitively, the difference between L1 norm and mean squared error (_MSE_) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes)."
   ]
  },
  {
@ -1775,21 +1777,24 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "[Numpy](https://numpy.org/) is the most widely used library for scientific and numeric programming in Python, and provides very similar functionality and a very similar API to that provided by PyTorch; however, it does not support using the GPU, or calculating gradients, which are both critical for deep learning. Therefore, in this book we will generally use PyTorch tensors instead of NumPy arrays, where possible. (Note that fastai adds some features to NumPy and PyTorch to make them a bit more similar to each other; if any code in this book doesn't work on your computer, it's possible that you forgot to include a line at the start of your notebook such as: `from fastai.vision.all import *`.)\n",
+    "[Numpy](https://numpy.org/) is the most widely used library for scientific and numeric programming in Python, and provides very similar functionality and a very similar API to that provided by PyTorch; however, it does not support using the GPU, or calculating gradients, which are both critical for deep learning. Therefore, in this book we will generally use PyTorch tensors instead of NumPy arrays, where possible.\n",
    "\n",
-    "So, what's an array? And what's a tensor?\n",
+    "(Note that fastai adds some features to NumPy and PyTorch to make them a bit more similar to each other. If any code in this book doesn't work on your computer, it's possible that you forgot to include a line at the start of your notebook such as: `from fastai.vision.all import *`.)\n",
    "\n",
-    "And why should you care?"
+    "But what are arrays and tensors, and why should you care?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A numpy array is multidimensional table of data, with all items of the same type. Since that can be any type at all, they could even be arrays of arrays, with the innermost array potentially being different sizes — this is called a \"jagged array\". By \"multidimensional table\" we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a \"table of tables\" or a \"cube\" (dimension of three), and so forth. If the items are all of some simple type such as an integer or a float then numpy will store them as a compact C data structure in memory. This is where numpy shines. Numpy has a wide variety of operators and methods which can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C!\n",
+    "A numpy array is multidimensional table of data, with all items of the same type. Since that can be any type at all, they could even be arrays of arrays, with the innermost arrays potentially being different sizes — this is called a \"jagged array\". By \"multidimensional table\" we mean, for instance, a list (dimension of one), a table or matrix (dimension of two), a \"table of tables\" or a \"cube\" (dimension of three), and so forth. If the items are all of some simple type such as an integer or a float then numpy will store them as a compact C data structure in memory. This is where numpy shines. Numpy has a wide variety of operators and methods which can run computations on these compact structures at the same speed as optimized C, because they are written in optimized C.\n",
    "\n",
-    "**Arrays and tensors can finish computations many thousands of times faster than using pure Python!**\n",
-    "A PyTorch tensor is nearly the same thing. It, too, is a multidimensional table of data, with all items of the same type. However, they cannot be just any old type — they have to be a basic numeric type. Therefore, a PyTorch tensor cannot be a jagged array. It is always a regularly shaped multidimensional rectangular structure. The vast majority of methods and operators supported by numpy on these structures are also supported by PyTorch. But PyTorch has the very big benefit that these structures can live on the GPU, in which case this computation will be optimised for the GPU. And furthermore, PyTorch can automatically calculate derivatives of these operations, including combinations of them. As you'll see, it would be impossible to do deep learning in practice without this capability.\n",
+    "In fact, **arrays and tensors can finish computations many thousands of times faster than using pure Python.**\n",
+    "\n",
+    "A PyTorch tensor is nearly the same thing as a numpy array, but with an additional restriction which unlocks some additional capabilities. It's the same in that it, too, is a multidimensional table of data, with all items of the same type. However, the restriction is that a tensor cannot use just any old type — it has to use a single basic numeric type for all componentss. As a result, a tensor is not as flexible as a genuine array of arrays, which allows jagged arrays, where the inner arrays could have different sizes. So a PyTorch tensor cannot be jagged. It is always a regularly shaped multidimensional rectangular structure.\n",
+    "\n",
+    "The vast majority of methods and operators supported by numpy on these structures are also supported by PyTorch. But PyTorch tensors have additional capabilities. One major capability is that these structures can live on the GPU, in which case their computation will be optimised for the GPU, and can run much faster. In addition, PyTorch can automatically calculate derivatives of these operations, including combinations of operations. As you'll see, it would be impossible to do deep learning in practice without this capability.\n",
    "\n",
    "> s: If you don't know what C is, do not worry as you won't need it at all. In a nutshell, it's a low-level  (low-level means more similar to the language that computers use internally) language that is very fast compared to Python. To take advantage of its speed while programming in Python, try to avoid as much as possible writing loops and replace them by commands that work directly on arrays or tensors.\n",
    "\n",
@ -1998,7 +2003,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Tensors will automatically change from int to float if needed"
+    "Tensors will automatically change from `int` to `float` if needed"
   ]
  },
  {
@ -2033,16 +2038,16 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Metrics and broadcasting"
+    "## Computing metrics using broadcasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A metric is a number which is calculated from the predictions of our model, and the correct labels in our dataset, and tells us something about how good our model is. For instance, we could use either of the functions we saw in the previous section, mean squared error or mean absolute error, and take the average of them over the whole dataset. However, neither of these are numbers that are very understandable to most people; in practice, we normally use *accuracy* as the metric for classification models.\n",
+    "Recall that a metric is a number which is calculated from the predictions of our model, and the correct labels in our dataset, in order to tell us how good our model is. For instance, we could use either of the functions we saw in the previous section, mean squared error, or mean absolute error, and take the average of them over the whole dataset. However, neither of these are numbers that are very understandable to most people; in practice, we normally use *accuracy* as the metric for classification models.\n",
    "\n",
-    "As we've discussed, we need to use a *validation set* to calculate our metric. That means we need to do is remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called \"valid\"? That's what this directory is for!\n",
+    "As we've discussed, we need to use a *validation set* to calculate our metric. That means we need to remove some of the data from training entirely, so it is not seen by the model at all. As it turns out, the creators of the MNIST dataset have already done this for us. Do you remember how there was a whole separate directory called \"valid\"? That's what this directory is for!\n",
    "\n",
    "So to start with, let's create tensors for our threes and sevens from that directory."
   ]
@ -2064,9 +2069,11 @@
    }
   ],
   "source": [
-    "valid_3_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'3').ls()])\n",
+    "valid_3_tens = torch.stack([tensor(Image.open(o)) \n",
+    "                            for o in (path/'valid'/'3').ls()])\n",
    "valid_3_tens = valid_3_tens.float()/255\n",
-    "valid_7_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'7').ls()])\n",
+    "valid_7_tens = torch.stack([tensor(Image.open(o)) \n",
+    "                            for o in (path/'valid'/'7').ls()])\n",
    "valid_7_tens = valid_7_tens.float()/255\n",
    "valid_3_tens.shape,valid_7_tens.shape"
   ]
@ -2132,7 +2139,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It's returned the distance for every single image, as a vector (i.e. rank 1 tensor) of length 1010 (the number of threes in our validation set). How did that happen? Have a look again at our function `mnist_distance`, and you'll see we have there `(a-b)`. The magic trick is that PyTorch, when it sees two tensors of different ranks, will `broadcast` the tensor with the smaller rank to have the same size as the one with the larger rank. Then, when PyTorch sees an operation on two tensors of the same rank, it completes the operation on each corresponding element of the two tensors, and returns the tensor result. For instance:"
+    "It's returned the distance for every single image, as a vector (i.e. a rank 1 tensor) of length 1010 (the number of threes in our validation set). How did that happen? Have a look again at our function `mnist_distance`, and you'll see we have there `(a-b)`.\n",
+    "\n",
+    "The magic trick is that PyTorch, when it sees two tensors of different ranks, will *broadcast* the tensor with the smaller rank to have the same size as the one with the larger rank. Broadcasting is an important capability that makes tensor code much easier to write.\n",
+    "\n",
+    "Thanks to  broadcasting, when PyTorch sees an operation on two tensors of the same rank, it completes the operation on each corresponding element of the two tensors, and returns the tensor result. For instance:"
   ]
  },
  {
@ -2201,7 +2212,7 @@
    "\n",
    "We'll be learning lots more about broadcasting throughout this book, especially in <<chapter_foundations>>, and will be practising it regularly too.\n",
    "\n",
-    "We can use this `mnist_distance` to figure out whether an image is a three or not by using the logic: if the distance between the digit in question and the ideal 3 is less than the distance to the ideal 7, then it's a 3. This function will automatically do broadcasting and be applied elementwise, just like all PyTorch functions and operators."
+    "We can use this `mnist_distance` to figure out whether an image is a three or not by using the following logic: if the distance between the digit in question and the ideal 3 is less than the distance to the ideal 7, then it's a 3. This function will automatically do broadcasting and be applied elementwise, just like all PyTorch functions and operators."
   ]
  },
  {
@ -2303,7 +2314,9 @@
   "source": [
    "This looks like a pretty good start! We're getting over 90% accuracy on both threes and sevens.\n",
    "\n",
-    "But let's be honest: threes and sevens are very different looking digits. And we're only classifying two out of the ten possible digits so far. So we're going to need to do better! To do better, perhaps we should try some deep learning."
+    "But let's be honest: threes and sevens are very different looking digits. And we're only classifying two out of the ten possible digits so far. So we're going to need to do better!\n",
+    "\n",
+    "To do better, perhaps it is time to try a system that doess some learning -- that is, that can automatically modify itself to improve its performance. In other words, it's time to talk about the training process, and SGD."
   ]
  },
  {
@ -2334,7 +2347,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here we are assuming that X is the image, represented as a vector. In other words, with all of the rows stacked up end to end into a single long line. And we are assuming that the weights are a vector W. If we have this function, then we just need some way to update the weights to make them a little bit better. With such an approach, we can repeat that step a number of times, making the weights better and better, until they are as good as we can make them.\n",
+    "Here we are assuming that X is the image, represented as a vector -- in other words, with all of the rows stacked up end to end into a single long line. And we are assuming that the weights are a vector W. If we have this function, then we just need some way to update the weights to make them a little bit better. With such an approach, we can repeat that step a number of times, making the weights better and better, until they are as good as we can make them.\n",
    "\n",
    "We want to find the specific values for the vector W which causes our function to be high for those images that are actually an eight, and low for those images which are not. Searching for the best vector W is a way to search for the best function for recognising eights. (Because we are not yet using a deep neural network, we are limited by what our function can actually do — we are going to fix that constraint later in this chapter.) \n",
    "\n",
@ -2344,7 +2357,7 @@
    "1. For each image, use these weights to *predict* whether it appears to be a three or a seven\n",
    "1. Based on these predictions, calculate how good the model is (its *loss*)\n",
    "1. Calculate the *gradient*, which measures for each weight, how changing that weight would change the loss\n",
-    "1. *Step* all weights based on that calculation\n",
+    "1. *Step* (that is, change) all weights based on that calculation\n",
    "1. Go back to the second step, and *repeat* the process\n",
    "1. ...until you decide to *stop* the training process (for instance because the model is good enough, or you don't want to wait any longer)"
   ]
@ -2474,7 +2487,7 @@
    "There are many different ways to do each of these seven steps, and we will be learning about them throughout the rest of this book. These are the details which make a big difference for deep learning practitioners. But it turns out that the general approach to each one generally follows some basic principles:\n",
    "\n",
    "- **Initialize**:: we initialise the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initialising them to the percentage of times that that pixel is activated for that category. But since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well\n",
-    "- **Loss**:: This is the thing Arthur Samuel refered to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good, and vice versa (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
+    "- **Loss**:: This is the thing Arthur Samuel refered to: \"*testing the effectiveness of any current weight assignment in terms of actual performance*\". We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention)\n",
    "- **Step**:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it. Increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount which works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out which direction, and roughly how much, to change each weight, without having to try all these small changes, by calculating *gradients*. This is just a performance optimisation, we would get exactly the same results by using the slower manual process as well\n",
    "- **Stop**:: We have already discussed how to choose how many epochs to train a model for. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time."
   ]
@ -2483,7 +2496,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's look at a picture of what this would look like. First we will define a very simple function, the quadratic — let's pretend that this is our loss function:"
+    "Before applying these steps to our image classification problem, let's illustrate what they look like in a simpler case. First we will define a very simple function, the quadratic — let's pretend that this is our loss function, and `x` is a weight parameter of the function:"
   ]
  },
  {
@ -2613,7 +2626,9 @@
   "source": [
    "One important thing to be aware of: our function has lots of weights that we need to adjust, so when we calculate the derivative we won't get back one number, but lots of them — a gradient for every weight. But there is nothing mathematically tricky here; you can calculate the derivative with respect to one weight, and treat all the other ones as constant. Then repeat that for each weight. This is how all of the gradients are calculated, for every weight.\n",
    "\n",
-    "We mentioned just now that you won't have to calculate any gradients yourselves. How can that be? Amazingly enough, PyTorch is able to automatically compute the derivative of nearly any function! What's more, it does it very fast. Most of the time, it will be at least as fast as any derivative function that you can create by hand. Let's see an example. First, pick a value (which must be a tensor) we want gradients at:"
+    "We mentioned just now that you won't have to calculate any gradients yourselves. How can that be? Amazingly enough, PyTorch is able to automatically compute the derivative of nearly any function! What's more, it does it very fast. Most of the time, it will be at least as fast as any derivative function that you can create by hand. Let's see an example.\n",
+    "\n",
+    "First, pick a tensor value which we want gradients at:"
   ]
  },
  {
@ -2629,9 +2644,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Notice the special method `requires_grad_`? That's the magical incantation we use to tell PyTorch that we want to calculate gradients for that value.\n",
+    "Notice the special method `requires_grad_`? That's the magical incantation we use to tell PyTorch that we want to calculate gradients with respect to that variable at that value. It is essentially tagging the variable, so PyTorch will remember to keep track of how to compute gradients of the other, direct calculations on it which you will ask for.\n",
    "\n",
-    "Now we calculate our function with that value (notice how PyTorch prints not just the value calculated, but also a note that it has a gradient function it'll be using to calculate our gradient when needed):"
+    "> a: This API might throw you if you're coming from math or physics. In those contexts the \"gradient\" of a function is just another function (i.e., its derivative), so you might expect gradient-related API to give you a new function. But in deep learning, \"gradients\" usually means the _value_ of a function's derivative at a particular argument value. PyTorch API also puts the focus on that argument, not the function you're actually computing gradients of. It may feel backwards at first but it's just a different perspective.\n",
+    "\n",
+    "Now we calculate our function with that value. Notice how PyTorch prints not just the value calculated, but also a note that it has a gradient function it'll be using to calculate our gradient when needed:"
   ]
  },
  {
@ -2675,7 +2692,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> note: The \"backward\" here refers to \"back propagation\", which is the name given to the process of calculating the derivative of each layer (we'll see how this is done exactly in chapter <chapter_foundations>, when we calculate the gradients of a deep neural net from scratch). This is called the \"backward pass\" of the network, as opposed to the \"forward pass\", which is where the activations are calculated. Life would probably be easier if `backward` was just called `calculate_grad`, but deep learning folks really do like to add jargon everywhere they can!"
+    "The \"backward\" here refers to \"back propagation\", which is the name given to the process of calculating the derivative of each layer (we'll see how this is done exactly in chapter <chapter_foundations>, when we calculate the gradients of a deep neural net from scratch). This is called the \"backward pass\" of the network, as opposed to the \"forward pass\", which is where the activations are calculated. Life would probably be easier if `backward` was just called `calculate_grad`, but deep learning folks really do like to add jargon everywhere they can!"
   ]
  },
  {
@ -2927,7 +2944,7 @@
   "source": [
    "We've added a bit of random noise, since measuring things manually isn't precise. This means it's not that easy to answer the question: what was the roller coaster's lowest speed? Using SGD we can try to find a function that matches our observations. We can't consider every possible function, so let's use a guess that it will be quadratic, i.e. a function of the form `a*(time**2)+(b*time)+c`.\n",
    "\n",
-    "We want to distinguish clearly between the function's input (the time when we are measuring the coaster's speed) and its parameters (the values that define *which* quadratic we're trying). So let us collect the parameters in one argument and separate the input, `t`, and the parameters, `params` in the function's signature: "
+    "We want to distinguish clearly between the function's input (the time when we are measuring the coaster's speed) and its parameters (the values that define *which* quadratic we're trying). So let us collect the parameters in one argument and separate the input, `t`, and the parameters, `params`, in the function's signature: "
   ]
  },
  {
@ -2949,7 +2966,7 @@
    "\n",
    "If we can solve this problem for the three parameters of a quadratic function, we'll be able to apply the same approach for other, more complex functions with more parameters--such as a neural net. So let's find the parameters for `f` first, and then we'll come back and do the same thing for the MNIST dataset with a neural net.\n",
    "\n",
-    "We need to define first what we mean by \"best\". We can define this precisely by choosing a *loss function*, which will return a value based on a prediction and a target, where lower values of the function correspond to \"better\" predictions. For continuous data, it's common to use *mean squared error*:"
+    "We need to define first what we mean by \"best\". We define this precisely by choosing a *loss function*, which will return a value based on a prediction and a target, where lower values of the function correspond to \"better\" predictions. For continuous data, it's common to use *mean squared error*:"
   ]
  },
  {
@ -2965,7 +2982,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's work through our 7 step process.\n",
+    "Now, let's work through our 7 step process.\n",
    "\n",
    "Step 1--*Initialize* the parameters to random values, and tell PyTorch that we want to track their gradients, using `requires_grad_`:"
   ]
@ -3350,7 +3367,7 @@
    "\n",
    "To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare to the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
    "\n",
-    "To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradient*. (actually, we let PyTorch do it for us!) Let's imagine you are lost in the mountains with your car parked at the lowest point. To find your way, you might wander in a random direction but that probably won't help much. Since you know your vehicle is at the lowest point, you would be better to go downhill. By always taking a step in the direction of the steepest slope, you should eventually arrive at your destination. We use the gradient to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size."
+    "To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradient*. (Actually, we let PyTorch do it for us!) Let's imagine you are lost in the mountains with your car parked at the lowest point. To find your way, you might wander in a random direction but that probably won't help much. Since you know your vehicle is at the lowest point, you would be better to go downhill. By always taking a step in the direction of the steepest downward slope, you should eventually arrive at your destination. We use the magnitude of the gradient (i.e., the steepness of the slope) to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size."
   ]
  },
  {
@ -3364,9 +3381,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's get back to our MNIST problem. As we've seen, we need gradients in order to improve our model, and in order to calculate gradients we need some *loss function* that represents how good our model is. That is because the gradients are a measure of how that loss function changes with small tweaks to the weights. The obvious approach would be to use the accuracy for this purpose. In this case, we would calculate our prediction for each image, and then calculate the overall accuracy (remember, at first we simply use random weights), and then calculate the gradients of each weight with respect to that accuracy calculation.\n",
+    "Let's get back to our MNIST problem. As we've seen, we need gradients in order to improve our model, and in order to calculate gradients we need some *loss function* that represents how good our model is. That is because the gradients are a measure of how that loss function changes with small tweaks to the weights.\n",
    "\n",
-    "Unfortunately, we have a significant technical problem here. The gradient of a function is its *slope*, or its steepness, which can be defined as *rise over run* -- that is, how much the value of function goes up or down, divided by how much you changed the input. We can write this in maths: `(y_new-y_old) / (x_new-x_old)`. Specifically, it is defined when x_new is very similar to x_old, meaning that their difference is very small. But accuracy only changes at all when a prediction changes from a 3 to a 7, or vice versa. So the problem is that a small change in weights from from x_old to x_new isn't likely to cause any prediction to change, so `(y_new - y_old)` will be zero. (In other words, the gradient is zero almost everywhere.) As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!\n",
+    "The obvious approach would be to use the accuracy as our loss function. In this case, we would calculate our prediction for each image, and then calculate the overall accuracy (remember, at first we simply use random weights), and then calculate the gradients of each weight with respect to that accuracy calculation.\n",
+    "\n",
+    "Unfortunately, we have a significant technical problem here. The gradient of a function is its *slope*, or its steepness, which can be defined as *rise over run* -- that is, how much the value of function goes up or down, divided by how much you changed the input. We can write this in maths: `(y_new-y_old) / (x_new-x_old)`. Specifically, it is defined when x_new is very similar to x_old, meaning that their difference is very small. But accuracy only changes at all when a prediction changes from a 3 to a 7, or vice versa. So the problem is that a small change in weights from from x_old to x_new isn't likely to cause any prediction to change, so `(y_new - y_old)` will be zero. In other words, the gradient is zero almost everywhere.\n",
+    "\n",
+    "As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!\n",
    "\n",
    "> s: In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5) so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are zero or infinite, so useless to do an update of gradient descent.\n",
    "\n",
@ -3872,7 +3893,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> jargon: Parameters: the *weights* and *biases* of a model. The weights are the `w` in the equation `w*x+b`, and the biases are the `b` in that equation."
+    "> jargon: Parameters: The _weights_ and _biases_ of a model. The weights are the `w` in the equation `w*x+b`, and the biases are the `b` in that equation."
   ]
  },
  {
@ -4134,7 +4155,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> note: Methods in PyTorch that end in an underscore modify their object *in-place*. For instance, `bias.zero_()` sets all elements of the tensor `bias` to zero."
+    "> note: Methods in PyTorch that end in an underscore modify their object _in-place_. For instance, `bias.zero_()` sets all elements of the tensor `bias` to zero."
   ]
  },
  {
@ -5308,7 +5329,24 @@
    "\n",
    "A neural network contains a number of layers. Each layer is either linear or nonlinear. We generally alternate between these two kinds of layers in a neural network. Sometimes people refer to both a linear layer and its subsequent nonlinearity together as a single *layer*. Yes, this is confusing. Sometimes a nonlinearity is referred to as an activation function.\n",
    "\n",
-    "TK: Table jargon recap"
+    "<<dljargon1>> contains the concepts related to SGD.\n",
+    "\n",
+    "```asciidoc\n",
+    "[[dljargon1]]\n",
+    ".Deep learning vocabulary\n",
+    "[options=\"header\"]\n",
+    "|=====\n",
+    "| Term | Meaning\n",
+    "|**ReLU** | Funxtion that returns 0 for negatives numbers and doesn't change positive numbers\n",
+    "|**mini-batch** | A few inputs and labels gathered together in two big arrays\n",
+    "|**forward pass** | Applying the model to some input and computing the predictions\n",
+    "|**loss** | A value that represents how well (or badly) our model is doing\n",
+    "|**gradient** | The derivative of the loss with respect to some parameter of the model\n",
+    "|**backard pass** | Computing the gradients of the loss with respect to all model parameters\n",
+    "|**gradient descent** | Taking a step in the directions opposite to the gradients to make the model parameters a little bit better\n",
+    "|**learning rate** | The size of the step we take when applying SGD to update the paramters of the model\n",
+    "|=====\n",
+    "```"
   ]
  },
  {
@ -5406,31 +5444,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.5"
-  },
-  "toc": {
-   "base_numbering": 1,
-   "nav_menu": {},
-   "number_sections": false,
-   "sideBar": true,
-   "skip_h1_title": true,
-   "title_cell": "Table of Contents",
-   "title_sidebar": "Contents",
-   "toc_cell": false,
-   "toc_position": {},
-   "toc_section_display": true,
-   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/13_convolutions.ipynb
+++ b/13_convolutions.ipynb
@ -31,33 +31,16 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## The magic of convolutions"
+    "In <<chapter_mnist_basics>> we learned how to create a neural network recognising images. We were able to achieve a bit over 98% accuracy at recognising threes from sevens. But we also saw that fastai's built in classes were able to get close to 100%. Let's start trying to close the gap.\n",
+    "\n",
+    "In this chapter, we will start by digging into what convolutions are and build a CNN from scratch. We will then study a range of techniques to improve training stability and learn all the tweaks the library usually applies for us to get great results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In <<chapter_mnist_basics>> we learned how to create a neural network recognising images. We were able to achieve a bit over 98% accuracy at recognising threes from sevens. But we also saw that fastai's built in classes were able to get close to 100%. Let's start trying to close the gap."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "path = untar_data(URLs.MNIST_SAMPLE)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#hide\n",
-    "Path.BASE_PATH = path"
+    "## The magic of convolutions"
   ]
  },
  {
@ -82,7 +65,7 @@
    "\n",
    "It turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition — two operations which are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n",
    "\n",
-    "A convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3x3 matrix in the top right of this image:"
+    "A convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3x3 matrix in the top right of <<basic_conv>>."
   ]
  },
  {
@ -120,6 +103,25 @@
    "(because that's what fancy computer vision researchers call these). And we'll need an image, of course:"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "path = untar_data(URLs.MNIST_SAMPLE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#hide\n",
+    "Path.BASE_PATH = path"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -1520,6 +1522,13 @@
    "Look at the shape of the result. If the original image has a height of `h` and a width of `w`, how many 3 by 3 windows can we find? As you see from the example, there are `h-2` by `w-2` windows, so the image we get as a result as a height of `h-2` and a witdh of `w-2`."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We won't implement this function from scratch, using PyTorch's implementation (that is way faster than anything we could do in python) instead."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -1723,6 +1732,13 @@
    "The most important trick that PyTorch has up its sleeve is that it can use the GPU to do all this work in parallel. That is, applying multiple kernels, to multiple images, across multiple channels. Doing lots of work in parallel is critical to getting GPUs to work efficiently; if we did each of these one at a time, we'll often run hundreds of times slower (and if we used our manual convolution loop from the previous section, we'd be millions of times slower!) Therefore, to become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It would be nice to not lose those two pixels on each axis. The way we do that is to add *padding*, which is simply additional pixels added around the outside of our image. Most commonly, pixels of zeros are added. "
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -1734,21 +1750,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "It would be nice to not lose those two pixels on each axis. The way we do that is to add *padding*, which is simply additional pixels added around the outside of our image. Most commonly, pixels of zeros are added. With appropriate padding, we can ensure that the output activation map is the same size as the original image, which can make things a lot simpler when we construct our architectures."
+    "With appropriate padding, we can ensure that the output activation map is the same size as the original image, which can make things a lot simpler when we construct our architectures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<img src=\"images/chapter9_padconv.svg\" id=\"pad_conv\" caption=\"Padding with a convolution\" alt=\"Padding with a convolution\" width=\"600\">"
+    "<img src=\"images/chapter9_padconv.svg\" id=\"pad_conv\" caption=\"A convolution with padding\" alt=\"A convolution with padding\" width=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "With a 5x5 input, and 4x4 kernel, and 2 pixels of padding, we end up with a 6x6 activation map:"
+    "With a 5x5 input, and 4x4 kernel, and 2 pixels of padding, we end up with a 6x6 activation map, as we can see in <<four_by_five_conv>>/"
   ]
  },
  {
@ -1764,7 +1780,7 @@
   "source": [
    "If we add a kernel of size `ks` by `ks` (with `ks` an odd number), the necessary padding on each side to keep the same shape is `ks//2`. An even number for `ks` would require a different amount of padding on the top/bottom, left/right, but in practice we almost never use an even filter size.\n",
    "\n",
-    "So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application. This is known as a *stride 2* convolution. The most common kernel size in practice is 3x3, and the most common padding is 1. As you'll see, stride 2 convolutions are useful for decreasing the size of our outputs, and stride 1 convolutions are useful for adding layers without changing the output size."
+    "So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application as in <<three_by_five_conv>>. This is known as a *stride 2* convolution. The most common kernel size in practice is 3x3, and the most common padding is 1. As you'll see, stride 2 convolutions are useful for decreasing the size of our outputs, and stride 1 convolutions are useful for adding layers without changing the output size."
   ]
  },
  {
@ -1785,14 +1801,21 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### CNNs from different viewpoints"
+    "Let's now have a look at how the pixel values of the result of our convolutions are computed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "fast.ai student Matt Kleinsmith came up with the very clever idea of showing [CNNs from different viewpoints](https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c). In fact, it's so clever, and so helpful, we're going to show it here too!\n",
+    "### Understanding the convolution equations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To explain the math behing convolutions, fast.ai student Matt Kleinsmith came up with the very clever idea of showing [CNNs from different viewpoints](https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c). In fact, it's so clever, and so helpful, we're going to show it here too!\n",
    "\n",
    "Here's our 3x3 pixel *image*, with each *pixel* labeled with a letter:"
   ]
@ -1895,6 +1918,13 @@
    "<img alt=\"Convolution as matrix multiplication\" width=\"683\" caption=\"Convolution as matrix multiplication\" id=\"conv_matmul\" src=\"images/att_00038.png\">"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we understand what a convolution is, let's use them to build a neural net."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -1906,14 +1936,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Learning kernels"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "There is no reason to believe that these particular edge filters are the most useful kernels for image recognition. Furthermore, we've seen that in later layers convolutional kernels become complex transformations of features from lower levels — we do not have a good idea of how to manually construct these.\n",
+    "There is no reason to believe that some particular edge filters are the most useful kernels for image recognition. Furthermore, we've seen that in later layers convolutional kernels become complex transformations of features from lower levels — we do not have a good idea of how to manually construct these.\n",
    "\n",
    "Instead, it would be best to learn the values of the kernels. We already know how to do this — SGD! In effect, the model will learn the features that are useful for classification.\n",
    "\n",
@ -1931,7 +1954,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here's the basic neural network we had in <<chapter_mnist_basics>>:"
+    "Let's go back to the  basic neural network we had in <<chapter_mnist_basics>>. It was defined like this:"
   ]
  },
  {
@ -2368,6 +2391,13 @@
    "So what happened here is that our stride 2 conv halved the *grid size* from `14x14` to `7x7`, and we doubled the *number of filters* from 8 to 16, resulting in no overall change in the amount of computation. If we left the number of channels the same in each stride 2 layer, the amount of computation being done in the net would get less and less as it gets deeper. But we know that the deeper layers have to compute semantically rich features (such as eyes, or fur), so we wouldn't expect that doing *less* compute would make sense."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Another way to think of this is based on *receptive fields*."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -2379,7 +2409,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Another way to think of this is based on *receptive fields*. The \"receptive field\" is the area of an image that is involved in the calculation of a layer. On the book website, you'll find an Excel spreadsheet called `conv-example.xlsx` that shows the calculation of two stride 2 convolutional layers using an MNIST digit. Each layer has a single kernel. If we click on one of the cells in the *conv2* section, which shows the output of the second convolutional layer, and click *trace precendents*, we see this:"
+    "The \"receptive field\" is the area of an image that is involved in the calculation of a layer. On the book website, you'll find an Excel spreadsheet called `conv-example.xlsx` that shows the calculation of two stride 2 convolutional layers using an MNIST digit. Each layer has a single kernel. If we click on one of the cells in the *conv2* section, which shows the output of the second convolutional layer, and click *trace precendents*, we see this:"
   ]
  },
  {
@ -2412,6 +2442,13 @@
    "As you see from this example, the deeper we are in the network (specifically, the more stride 2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation in that layer. So we know now that in the deeper layers of the network, we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of seeing the same thing we saw in the previous section: when we introduce a stride 2 conv in our network, we should also increase the number of channels."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When writing this particular chapter, we had a lot of questions we needed answer to, to be able to explain to you those CNNs as best we could. Believe it or not, we found most of the answers on twitter. "
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -2488,6 +2525,13 @@
    "Twitter is the main way we both stay up to date with interesting papers, software releases, and other deep learning news. For making connections with the deep learning community, we recommend getting involved both in the [fast.ai forums](https://forums.fast.ai) and Twitter."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Up until now, we have only shown you examples of pictures in black and white, with only one value per pixel. In practice, most colored images have three values per pixel to define is color."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -2582,7 +2626,7 @@
   "source": [
    "We saw what the convolution operation was for one filter on one channel of the image (our examples were done on a square). A convolution layer will take an image with a certain number of channels (3 for the first layer for regular RGB color images) and output an image with a different number of channels. Like our hidden size that represented the numbers of neurons in a linear layer, we can decide to have has many filters as we want, and each of them will be able to specialize, some to detect horizontal edges, other to detect vertical edges and so forth, to give something like we studied in <<chapter_production>>.\n",
    "\n",
-    "On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the following example, the result of our conv layer on that window is $y_{R} + y_{G} + y_{B}$."
+    "On one sliding window, we have a certain number of channels and we need as many filters (we don't use the same kernel for all the channels). So our kernel doesn't have a size of 3 by 3, but `ch_in` (for channel in) by 3 by 3. On each channel, we multiply the elements of our window by the elements of the coresponding filter then sum the results (as we saw before) and sum over all the filters. In the example fiven by <<rgbconv>>, the result of our conv layer on that window is red + green + blue."
   ]
  },
  {
@ -2598,7 +2642,7 @@
   "source": [
    "So, in order to apply a convolution to a colour picture we require a kernel tensor with a matching size as the first axis. At each location, the corresponding parts of the kernel and the image patch are multiplied together.\n",
    "\n",
-    "These are then all added together, to produce a single number, for each grid location, for each output feature:"
+    "These are then all added together, to produce a single number, for each grid location, for each output feature, as shown in <<rgbconv2>>."
   ]
  },
  {
@ -2618,7 +2662,7 @@
    "\n",
    "There are no special mechanisms required when setting up a CNN for training with color images. Just make sure your first layer as 3 inputs.\n",
    "\n",
-    "There are lots of ways of processing color images. For instance, you can change them to black and white, or change from RGB to HSV color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won't make any difference to your model results, as long as you don't lose information in the transformation. So transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance a pet breed may have a distinctive color); but converting to HSV generally won't make any difference.\n",
+    "There are lots of ways of processing color images. For instance, you can change them to black and white, or change from RGB to HSV (Hue, Saturation, and Value) color space, and so forth. In general, it turns out experimentally that changing the encoding of colors won't make any difference to your model results, as long as you don't lose information in the transformation. So transforming to black and white is a bad idea, since it removes the color information entirely (and this can be critical; for instance a pet breed may have a distinctive color); but converting to HSV generally won't make any difference.\n",
    "\n",
    "Now you know what those pictures in <<chapter_intro>> of \"what a neural net learns\" from the Zeiler and Fergus paper mean! This is their picture of some of the layer 1 weights which we showed:"
   ]
@ -2634,7 +2678,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is taking the 3 slices of the convolutional kernel, for each output feature, and displaying them as images. We can see that even although the creators of the neural net never explicitly created kernels to find edges, for instance, the neural net automatically discovered these features using SGD."
+    "This is taking the 3 slices of the convolutional kernel, for each output feature, and displaying them as images. We can see that even although the creators of the neural net never explicitly created kernels to find edges, for instance, the neural net automatically discovered these features using SGD.\n",
+    "\n",
+    "Now let's see how we can train those CNNs, and show you all the techniques fastai uses behind the hood for efficient training."
   ]
  },
  {
@ -2781,7 +2827,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's start with a basic CNN as a baseline. We'll use the same as we had in the last chapter, but with one tweak: we'll use more activations.\n",
+    "Let's start with a basic CNN as a baseline. We'll use the same as we had in the last chapter, but with one tweak: we'll use more activations. Since we have more numbers to differentiate, it's likely we will need to learn more filters.\n",
    "\n",
    "As we discussed, we generally want to double the number of filters each time we have a stride 2 layer. So, one way to increase the number of filters throughout our network is to double the number of activations in the first layer – then every layer after that will end up twice as big as the previous version as well.\n",
    "\n",
@ -2950,7 +2996,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As expected, the problems get worse towards the end of the network, as the instability and zero activations compound over layers."
+    "As expected, the problems get worse towards the end of the network, as the instability and zero activations compound over layers. The first thing we can do to make training more stable is to increase the batch size."
   ]
  },
  {
@ -3077,7 +3123,7 @@
    "\n",
    "Then, once we have found a nice smooth area for our parameters, we then want to find the very best part of that area, which means we have to bring out learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models, and trains more quickly. That is why it is the approach that is used by default for `fine_tune` in fastai.\n",
    "\n",
-    "Later in this book we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rate, we use less momentum, and we use more again in the annealing phase.\n",
+    "In <<chapter_accel_sgd>> we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rate, we use less momentum, and we use more again in the annealing phase.\n",
    "\n",
    "We can use 1cycle training in fastai by calling `fit_one_cycle`:"
   ]
@ -3173,11 +3219,11 @@
   "source": [
    "Smith's original 1cycle paper used a linear warm-up and linear annealing. As you see above, we adapted the approach in fastai by combining it with another popular approach: cosine annealing. `fit_one_cycle` provides the following parameters you can adjust:\n",
    "\n",
-    "- `lr_max`: The highest learning rate that will be used (this can also be a list of learning rates for each layer group, or a python `slice` object containing the first and last layer group learning rates)\n",
-    "- `div`: How much to divide `lr_max` by to get the starting learning rate\n",
-    "- `div_final`:  How much to divide `lr_max` by to get the ending learning rate\n",
-    "- `pct_start`: What % of the batches to use for the warmup\n",
-    "- `moms`: A tuple `(mom1,mom2,mom3)` where mom1 is the initial momentum, mom2 is the minimum momentum, and mom3 is the final momentum.\n",
+    "- `lr_max`:: The highest learning rate that will be used (this can also be a list of learning rates for each layer group, or a python `slice` object containing the first and last layer group learning rates)\n",
+    "- `div`:: How much to divide `lr_max` by to get the starting learning rate\n",
+    "- `div_final`::  How much to divide `lr_max` by to get the ending learning rate\n",
+    "- `pct_start`:: What % of the batches to use for the warmup\n",
+    "- `moms`:: A tuple `(mom1,mom2,mom3)` where mom1 is the initial momentum, mom2 is the minimum momentum, and mom3 is the final momentum.\n",
    "\n",
    "Let's take a look at our layer stats again:"
   ]
@ -3267,6 +3313,13 @@
    "<img src=\"images/colorful_summ.png\" id=\"colorful_summ\" caption=\"Summary of 'colorful dimension'\" alt=\"Summary of 'colorful dimension'\" width=\"800\">"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "TK Add an explanation of the picture"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -3302,7 +3355,7 @@
   "source": [
    "This shows a classic picture of \"bad training\". We start with nearly all activations at zero--that's what we see at the far left, with nearly all the left hand side dark blue; the bright yellow at the bottom are the near-zero activations. Then over the first few batches we see the number of non-zero activations exponentially increasing. But it goes too far, and collapses! We see the dark blue return, and the bottom becomes bright yellow again. It almost looks like training restarts from scratch. Then we see the activations increase again, and then it collapses again. After repeating a few times, eventually we see a spread of activations throughout the range.\n",
    "\n",
-    "It's much better if training can be smooth from the start. The cycles of exponential increase and then collapse that we see above tend to result in a lot of near-zero activations, resulting in slow training, and poor final results."
+    "It's much better if training can be smooth from the start. The cycles of exponential increase and then collapse that we see above tend to result in a lot of near-zero activations, resulting in slow training, and poor final results. One way to solve this problem is to use Batch normalization."
   ]
  },
  {
@ -3316,9 +3369,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To fix this, we need to both fix the initial large percentage of near-zero activations, and then try to maintain a good distribution of activations throughout training. In the abstract, they describe just the problem that we've seen:\n",
+    "To fix the slow training and poor final results we ended up with in the previous section, we need to both fix the initial large percentage of near-zero activations, and then try to maintain a good distribution of activations throughout training.\n",
    "\n",
-    "Sergey Ioffe and Christian Szegedy showed a solution to this problem in the 2015 paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167). \n",
+    "Sergey Ioffe and Christian Szegedy showed a solution to this problem in the 2015 paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167). In the abstract, they describe just the problem that we've seen:\n",
    "\n",
    "> : \"Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization... We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.\"\n",
    "\n",
@ -3326,7 +3379,7 @@
    "\n",
    "> : \"...making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization.\"\n",
    "\n",
-    "The paper caused great excitement as soon as it was released, because they showed this chart, which clearly demonstrated that batch normalization could train a model that was even more accurate than the current state of the art (the *inception* architecture), around 5x faster:"
+    "The paper caused great excitement as soon as it was released, because they showed the chart in <<batchnorm>>, which clearly demonstrated that batch normalization could train a model that was even more accurate than the current state of the art (the *inception* architecture), around 5x faster:"
   ]
  },
  {
@ -3615,7 +3668,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "TK add last section to conclusion"
+    "TK add some takeways from the last section to the conclusion"
   ]
  },
  {
--- a/app_blog.ipynb
+++ b/app_blog.ipynb
@ -10,7 +10,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
@ -37,7 +37,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Unfortunately, when it comes to blogging, it seems like you have to make a decision: either use a platform that makes it easy, but subjects you and your readers to advertisements, pay walls, and fees, or spend hours setting up your own hosting and weeks learning about all kinds of intricate details. Perhaps the biggest benefit to the \"do-it-yourself\" approach is that you really owning your own posts, rather than being at the whim of a service provider, and their decisions about how to monetize your content in the future.\n",
+    "Unfortunately, when it comes to blogging, it seems like you have to make a difficult decision: either use a platform that makes it easy, but subjects you and your readers to advertisements, pay walls, and fees, or spend hours setting up your own hosting and weeks learning about all kinds of intricate details. Perhaps the biggest benefit to the \"do-it-yourself\" approach is that you really owning your own posts, rather than being at the whim of a service provider, and their decisions about how to monetize your content in the future.\n",
    "\n",
    "It turns out, however, that you can have the best of both worlds! You can host on a platform called [GitHub Pages](https://pages.github.com/), which is free, has no ads or pay wall, and makes your data available in a standard way such that you can at any time move your blog to another host. But all the approaches I’ve seen to using GitHub Pages have required knowledge of the command line and arcane tools that only software developers are likely to be familiar with. For instance, GitHub's [own documentation](https://help.github.com/en/github/working-with-github-pages/creating-a-github-pages-site-with-jekyll) on setting up a blog requires installing the Ruby programming language, using the git command line tool, copying over version numbers, and more. 17 steps in total!\n",
    "\n",
@ -63,7 +63,9 @@
    "\n",
    "> Important: Note that if you don't use username.github.io as the name, it won't work!\n",
    "\n",
-    "Once you’ve entered that, and any description you like, click on \"create repository from template\". You have the choice to make the repository \"private\" but since you are creating a blog that you want other people to read, having the underlying files publicly available hopefully won't be a problem for you."
+    "Once you’ve entered that, and any description you like, click on \"create repository from template\". You have the choice to make the repository \"private\" but since you are creating a blog that you want other people to read, having the underlying files publicly available hopefully won't be a problem for you.\n",
+    "\n",
+    "Now, let's set up your homepage!"
   ]
  },
  {
@ -132,18 +134,14 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true
-   },
+   "metadata": {},
   "source": [
    "### Creating posts"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "hidden": true
-   },
+   "metadata": {},
   "source": [
    "Now you’re ready to create your first post. All your posts will go in\n",
    "the \"\\_posts\" folder. Click on that now, and then click on the \"create\n",
@ -166,9 +164,7 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "hidden": true
-   },
+   "metadata": {},
   "source": [
    "As before, you can click on the \"preview\" button to see how your\n",
    "markdown formatting will look.\n",
@ -183,9 +179,7 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "hidden": true
-   },
+   "metadata": {},
   "source": [
    "Have a look at your blog homepage again, and you will see that this post\n",
    "has now appeared! (Remember that you will need to wait a minute or so\n",
@ -203,9 +197,7 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "hidden": true
-   },
+   "metadata": {},
   "source": [
    "In GitHub, nothing actually changes until you commit— including deleting\n",
    "a file! So, after you click the trash icon, scroll down to the bottom\n",
@ -223,6 +215,13 @@
    "<img width=\"400\" src=\"images/fast_template/image14.png\" alt=\"Screenshot showing how to upload new files\">"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's see how to do all of this directly from your computer."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -256,7 +255,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If you haven't used git before, GitHub Desktop and a blog is a great way to get started. As you'll discover, it's a fundamental tool used by most data scientists."
+    "If you haven't used git before, GitHub Desktop and a blog is a great way to get started. As you'll discover, it's a fundamental tool used by most data scientists. Another tool that we hope you now love too is Jupyter Notebooks. And there is a way to write your blog directly with it!"
   ]
  },
  {
@ -293,31 +292,6 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.5"
-  },
-  "toc": {
-   "base_numbering": 1,
-   "nav_menu": {},
-   "number_sections": false,
-   "sideBar": true,
-   "skip_h1_title": true,
-   "title_cell": "Table of Contents",
-   "title_sidebar": "Contents",
-   "toc_cell": false,
-   "toc_position": {},
-   "toc_section_display": true,
-   "toc_window_display": false
  }
 },
 "nbformat": 4,