This commit is contained in:
Jeremy Howard 2020-03-03 15:04:23 -08:00
parent 38096d44a7
commit 1202b218f4
9 changed files with 740 additions and 423 deletions

View File

@ -34,7 +34,7 @@
"\n",
"> s: This is an aside from Sylvain!\n",
"\n",
"You will see bits in the text like this: \"TK: figure showing bla here\" or \"TK: expand introduction\". \"TK\" is used to make places where we know something is missing and we will add them. This does not alter any of the core content as those are usually small parts/figures that are relatively independent form the flow and self-explanatory.\n",
"You may see bits in the text like this: \"TK: figure showing bla here\" or \"TK: expand introduction\". \"TK\" is used to make places where we know something is missing and we will add them. This does not alter any of the core content as those are usually small parts/figures that are relatively independent form the flow and self-explanatory.\n",
"\n",
"Throughout the book, the version of the fastai library used is version 2. That version is not yet officially released and is for now separate from the main project. You can find it [here](https://github.com/fastai/fastai2)."
]
@ -57,7 +57,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Add an introduction here. Todo when preface is settled. Maybe the `Deep learning is for everyone` can be this intro."
"Hello, and thank you for letting us join you on your deep learning journey, however far along that you may be! If you are a complete beginner to deep learning and machine learning, then you are most welcome here. Our only expectation is that you already know how to code, preferably in Python.\n",
"\n",
"> note: If you don't have any experience coding, that's OK too! The first three chapters have been explicitly written in a way that will allow executives, product managers, etc to understand the most important things they'll need to know about deep learning. When you see bits of code in the text, try to look over them to get an intuitive sense of what they're doing. We'll explain them line by line. The details of the syntax are not nearly as important as the high level understanding of what's going on.\n",
"\n",
"If you are already a confident deep learning practitioner, then you will also find a lot here. In this book we will be showing you how to achieve world-class results, including techniques from the latest research. As we will show, this doesn't require advanced mathematical training, or years of study. It just requires a bit of common sense and tenacity."
]
},
{
@ -71,12 +75,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Hello, and thank you for letting us join you on your deep learning journey, however far along that you may be! If you are a complete beginner to deep learning and machine learning, then you are most welcome here. Our only expectation is that you already know how to code, preferably in Python.\n",
"\n",
"> note: If you don't have any experience coding, that's OK too! The first three chapters have been explicitly written in a way that will allow executives, product managers, etc to understand the most important things they'll need to know about deep learning. When you see bits of code in the text, try to look over them to get an intuitive sense of what they're doing. We'll explain them line by line. The details of the syntax are not nearly as important as the high level understanding of what's going on.\n",
"\n",
"If you are already a confident deep learning practitioner, then you will also find a lot here. In this book we will be showing you how to achieve world-class results, including techniques from the latest research. As we will show, this doesn't require advanced mathematical training, or years of study. It just requires a bit of common sense and tenacity.\n",
"\n",
"A lot of people assume that you need all kinds of hard-to-find stuff to get great results with deep learning, but, as you'll see in this book, those people are wrong. Here's a list of a few thing you **absolutely don't need** to do world-class deep learning:\n",
"\n",
"```asciidoc\n",
@ -98,7 +96,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Deep learning has power, flexibility, and simplicity. That's why we believe it should be applied across many disciplines. These include the social and physical sciences, the arts, medicine, finance, scientific research, and much more. To give a personal example, despite having no background in medicine, Jeremy started Enlitic, a company that uses deep learning algorithms to diagnose illness and disease. And Enlitic now does better than doctors in certain cases. TK Jeremy: Give an example\n",
"Deep learning has power, flexibility, and simplicity. That's why we believe it should be applied across many disciplines. These include the social and physical sciences, the arts, medicine, finance, scientific research, and much more. To give a personal example, despite having no background in medicine, Jeremy started Enlitic, a company that uses deep learning algorithms to diagnose illness and disease. Within months of starting the company, it was announced that their algorithm could identify malignent tumors [more accurately than radiologists](https://www.nytimes.com/2016/02/29/technology/the-promise-of-artificial-intelligence-unfolds-in-small-steps.html).\n",
"\n",
"Here's a list of some of the thousands of tasks that deep learning (or methods heavily using deep learning) is now the best in the world at:\n",
"\n",
@ -163,13 +161,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Perhaps the most pivotal work in neural networks in the last 50 years is the multi-volume *Parallel Distributed Processing*, released in 1986 by MIT Press. Chapter 1 lays out a similar hope to that shown by Rosenblatt:\n",
"Perhaps the most pivotal work in neural networks in the last 50 years is the multi-volume *Parallel Distributed Processing* (PDP), released in 1986 by MIT Press. Chapter 1 lays out a similar hope to that shown by Rosenblatt:\n",
"\n",
"> : _…people are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at. …we will introduce a computational framework for modeling cognitive processes that seems… closer than other frameworks to the style of computation as it might be done by the brain._ (Parallel distributed processing, chapter 1)\n",
"> : _…people are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at. …we will introduce a computational framework for modeling cognitive processes that seems… closer than other frameworks to the style of computation as it might be done by the brain._ (PDP, chapter 1)\n",
"\n",
"TK Jeremy: Tell the reader what the takeaways from this are in your own words, before you dive into the list of requirements.\n",
"\n",
"It defined \"Parallel Distributed Processing\" as requiring:\n",
"The premise that PDP is using here is that traditional computer programs work very differently to brains, and that might be why computer programs had (at that point) been so bad at doing things that brains find easy (such as recognizing objects in pictures). The authors claim that the PDP approach is \"closer than other frameworks\" to how the brain works, and therefore it might be better able to handle these kinds of tasks. The approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined \"Parallel Distributed Processing\" as requiring:\n",
"\n",
"1. A set of *processing units*\n",
"1. A *state of activation*\n",
@ -180,7 +176,9 @@
"1. A *learning rule* whereby patterns of connectivity are modified by experience \n",
"1. An *environment* within which the system must operate\n",
"\n",
"We will learn in this book about how modern neural networks handle each of these requirements. In the 1980's most models were built with a second layer of neurons, thus avoiding the problem that had been identified by Minsky (this was their \"pattern of connectivity among units\", to use the framework above). And indeed, neural networks were widely used during the 80s and 90s for real, practical projects. However, again a misunderstanding of the theoretical issues held back the field. In theory, adding just one extra layer of neurons was enough to allow any mathematical model to be approximated with these neural networks, but in practice such networks were often too big and slow to be useful.\n",
"We will see in this book that modern neural networks handle each of these requirements.\n",
"\n",
"In the 1980's most models were built with a second layer of neurons, thus avoiding the problem that had been identified by Minsky (this was their \"pattern of connectivity among units\", to use the framework above). And indeed, neural networks were widely used during the 80s and 90s for real, practical projects. However, again a misunderstanding of the theoretical issues held back the field. In theory, adding just one extra layer of neurons was enough to allow any mathematical model to be approximated with these neural networks, but in practice such networks were often too big and slow to be useful.\n",
"\n",
"Although researchers showed 30 years ago that to get practical good performance you need to use even more layers of neurons, it is only in the last decade that this has been more widely appreciated. Neural networks are now finally living up to their potential, thanks to the understanding to use more layers as well as improved ability to do so thanks to improvements in computer hardware, increases in data availability, and algorithmic tweaks that allow neural networks to be trained faster and more easily. We now have what Rosenblatt had promised: \"a machine capable of perceiving, recognizing and identifying its surroundings without any human training or control\". And you will learn how to build them in this book."
]
@ -292,7 +290,7 @@
"\n",
"Paul Lockhart, a Columbia math PhD, former Brown professor, and K-12 math teacher, imagines in the influential essay A Mathematician's Lament a nightmare world where music and art are taught the way math is taught. Children would not be allowed to listen to or play music until they have spent over a decade mastering music notation and theory, spending classes transposing sheet music into a different key. In art class, students study colours and applicators, but aren't allowed to actually paint until college. Sound absurd? This is how math is taughtwe require students to spend years doing rote memorization, and learning dry, disconnected *fundamentals* that we claim will pay off later, long after most of them quit the subject.\n",
"\n",
"Unfortunately, this is where many teaching resources on deep learning beginasking learners to follow along with the definition of the Hessian and theorems for the Taylor approximation of your loss function, without ever giving examples of actual working code. We're not knocking calculus. We love calculus and have even taught it at the college level, but we don't think it's the best place to start when learning deep learning!\n",
"Unfortunately, this is where many teaching resources on deep learning beginasking learners to follow along with the definition of the Hessian and theorems for the Taylor approximation of your loss functions, without ever giving examples of actual working code. We're not knocking calculus. We love calculus and have even taught it at the college level, but we don't think it's the best place to start when learning deep learning!\n",
"\n",
"In deep learning, it really helps if you have the motivation to fix your model to get it to do better. That's when you start learning the relevant theory. But you need to have the model in the first place. We teach almost everything through real examples. As we build out those examples, we go deeper and deeper, and we'll show you how to make your projects better and better. This means that you'll be gradually learning all the theoretical foundations you need, in context, in a way that you'll see why it matters and how it works.\n",
"\n",
@ -450,7 +448,7 @@
"source": [
"The best choice for GPU servers for use with this book change over time, as companies come and go, and prices change. We keep a list of our recommended options on the [book website](https://book.fast.ai/). So, go there now, and follow the instructions to get connected to a GPU deep learning server. Don't worry, it only takes about two minutes to get set up on most platforms, and many don't even require any payment, or even a credit card to get started.\n",
"\n",
"> A: My two cents: heed this advice! If you like computers you will be tempted to setup your own box. Beware! It is feasible but surprisingly involved and distracting. There is a good reason this book is not titled, _Everything you ever wanted to know about Ubuntu system administration, NVIDIA driver installation, apt-get, conda, pip, and Jupyter notebook configuration_. That would be a book of its own. Having designed and deployed our production machine learning infrastructure at work, I can testify it has its satisfactions but it is as unrelated to understanding models as maintaining an airplane is from flying one.\n",
"> A: My two cents: heed this advice! If you like computers you will be tempted to setup your own box. Beware! It is feasible but surprisingly involved and distracting. There is a good reason this book is not titled, _Everything you ever wanted to know about Ubuntu system administration, NVIDIA driver installation, apt-get, conda, pip, and Jupyter notebook configuration_. That would be a book of its own. Having designed and deployed our production machine learning infrastructure at work, I can testify it has its satisfactions but it is as unrelated to modelling as maintaining an airplane is to flying one.\n",
"\n",
"Each option shown on the book website includes a tutorial; after completing the tutorial, you will end up with a screen looking like <<notebook_init>>."
]
@ -624,6 +622,20 @@
"learn.fine_tune(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You will probably not see exactly the same results that are in the book. There are a lot of sources of small random variation involved in training models. We generally see an error rate of well less than 0.02 in this example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> important: Depending on your network speed, it might take a few minutes to download the pretrained model and dataset. Running `fine_tune` might take a minute or so. Often models in this book take a few minutes to train, as will your own models. So it's a good idea to come up with good techniques to make the most of this time. For instance, keep reading the next section while your model trains, or open up another notebook and use it for some coding experiments."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -705,7 +717,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So, how do we know if this model is any good? You can see the error rate (proportion of images that were incorrectly identified) printed as the second last column of the table. As you can see, the model is nearly perfect, even although the training time was only a few seconds (not including the one-time downloading of the dataset and pretrained model). In fact, the accuracy you've achieved already is far better than anybody had ever achieved just 10 years ago!\n",
"So, how do we know if this model is any good? You can see the error rate printed as the last column of the table. This is the proportion of images that were incorrectly identified. As you can see, the model is nearly perfect, even though the training time was only a few seconds (not including the one-time downloading of the dataset and the pretrained model). In fact, the accuracy you've achieved already is far better than anybody had ever achieved just 10 years ago!\n",
"\n",
"Finally, let's check that this model actually works. Go and get a photo of a dog, or a cat; if you don't have one handy, just search Google images and download an image that you find there. Now execute the cell with `uploader` defined. It will output a button you can click, so you can select the image you want to classify."
]
@ -812,7 +824,7 @@
"source": [
"Well that was impressive--we trained a model! But... what does that actually *mean*? What did we actually *do*?\n",
"\n",
"To answer those questions, we need to step up a level from *deep learning* and discuss the more general *machine learning*. *Machine learning* is (like regular coding) a way to get computers to complete a specific task. But how would you use regular coding to do what we just did in the last section: recognize dogs vs cats in photos? We would have to write down for the computer the exact steps necessary to complete the task.\n",
"To answer those questions, we need to zoom out a level from *deep learning* and discuss the more general *machine learning*. *Machine learning* is, like regular programming, a way to get computers to complete a specific task. But how would you use regular programming to do what we just did in the last section: recognize dogs vs cats in photos? We would have to write down for the computer the exact steps necessary to complete the task.\n",
"\n",
"Normally, it's easy enough for us to write down the steps to complete a task when we're writing a program. We just think about the steps we'd take if we had to do the task by hand, and then we translate them into code. For instance, we can write a function that sorts a list. In general, we write a function that looks something like <<basic_program>> (where *inputs* might be an unsorted list, and *results* a sorted list)."
]
@ -926,7 +938,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To understand this statement, we need to understand what Samuel means by a *weight assignment*. To do so, we need to change our basic program model of <<basic_program>>, and replace it with something like <<weight_assignment>> (where *inputs* might be the pixels of a photo, and *results* might be the word \"dog\" or \"cat\"):"
"There a number of powerful concepts embedded in this short statement: \n",
"\n",
"- the idea of a \"weight assignment\" \n",
"- the fact that every weight assignment has some \"actual performance\"\n",
"- the requirement that there is an \"automatic means\" of testing that performance, \n",
"- and last, that there is a \"mechanism\" (i.e., another automatic process) for improving the performance by changing the weight assignments.\n",
"\n",
"Let us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.\n",
"\n",
"Weights are just variables, and a weight assignment is a particular choice of values for those variables. The program's inputs are values that it processes in other to products its results -- for instance, taking image pixels as inputs, and returning the classification \"dog\" as a result. But the program's weight assignments are other values which define how the program will operate.\n",
"\n",
"Since they will affect the program they are in a sense another kind of input, so we will update our basic picture of <<basic_program>> and replace it with <<weight_assignment>> in order to take this into account:"
]
},
{
@ -1019,11 +1042,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have not only our inputs, but something else going into our box: the *weights* (as Samuel called them--in this book however we'll be using the term *parameters*, because in deep learning *weights* refers to a particular type of parameter, as you'll learn). And we've changed the name of our box from *program* to *model*. The *model* is a very special kind of program: it's one that can do *many different things*, depending on the *weights*. It can be implemented in many different ways. For instance, in Samuel's checkers program, different values of the weights would result in different checkers-playing strategies. Each specific choice of values for the weights is what Samuel called a *weight assignment*.\n",
"We've changed the name of our box from *program* to *model*. This is to follow modern terminology and to reflect that the *model* is a special kind of program: it's one that can do *many different things*, depending on the *weights*. It can be implemented in many different ways. For instance, in Samuel's checkers program, different values of the weights would result in different checkers-playing strategies. \n",
"\n",
"Next, he said we need an *automatic means of testing the effectiveness of any current weight assignment in terms of actual performance*. In the case of his checkers program, that would involve having a model with one set of weights play against another with a different set, and seeing which one won.\n",
"(By the way, what Samuel called *weights* are most generally refered to as model *parameters* these days, in case you have encountered that term. The term *weights* is reserved for a particular type of model parameter.)\n",
"\n",
"Finally, he says we need *a mechanism for altering the weight assignment so as to maximize the performance*. For instance, he could look at the difference in weights between the winning model and the losing model, and adjust the weights a little further in the winning *direction*. We can now see why he said that such a procedure *could be made entirely automatic and... a machine so programmed would \"learn\" from its experience*.\n",
"Next, he said we need an *automatic means of testing the effectiveness of any current weight assignment in terms of actual performance*. In the case of his checkers program, the \"actual performance\" of a model would be how well it plays. And you could automatically test the performance of two models by setting them to play against each other, and see which one usually wins.\n",
"\n",
"Finally, he says we need *a mechanism for altering the weight assignment so as to maximize the performance*. For instance, we could look at the difference in weights between the winning model and the losing model, and adjust the weights a little further in the winning *direction*.\n",
"\n",
"We can now see why he said that such a procedure *could be made entirely automatic and... a machine so programed would \"learn\" from its experience*. Learning would become entirely automatic when the adjustment of the weight was also automatic -- when instead of us improving a model by adjusting its weights, we had and automated mechanism that produced adjustments based on performance.\n",
"\n",
"<<training_loop>> shows the full picture of Samuel's idea of training a machine learning model."
]
@ -1140,10 +1167,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For instance, the *results* for a checkers model are the moves that are made, and the *performance*\n",
"is the win or loss (possibly also including the number of moves the game lasted).\n",
"Notice the distinction between the model's *results* (e.g., the moves in a checkers game) and its *performance* (e.g., whether it wins the game, or how quickly it wins). \n",
"\n",
"Note that once the model is trained, we can think of the weights as being *part of the model*, since we're not varying them any more. Therefore actually *using* a model after it's trained looks like <<using_model>>."
"Also note that once the model is trained -- that is, once we've chosen our final, best, favorite weight assignment -- then we can think of the weights as being *part of the model*, since we're not varying them any more.\n",
"\n",
"Therefore actually *using* a model after it's trained looks like <<using_model>>."
]
},
{
@ -1245,23 +1273,33 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's not too hard to imagine what the model might look like for a checkers program. There might be a range of checkers strategies encoded, and some kind of search mechanism, and then the weights could vary how strategies are selected, what parts of the board are focused on during a search, and so forth. But it's not at all obvious what the model might look like for an image recognition program.\n",
"It's not too hard to imagine what the model might look like for a checkers program. There might be a range of checkers strategies encoded, and some kind of search mechanism, and then the weights could vary how strategies are selected, what parts of the board are focused on during a search, and so forth. But it's not at all obvious what the model might look like for an image recognition program, or for understanding text, or for many other interestings problems we might imagein.\n",
"\n",
"What we need is some kind of function that is so flexible, that it could be used to solve any given problem, just by varying its weights. Amazingly enough, this function actually exists! It's called the *neural network*. A mathematical proof called the *universal approximation theorem* shows that this function can solve any problem to any level of accuracy. In addition, there is a completely general way to update the weights of a neural network, to make it improve at any given task. This is called *stochastic gradient descent* (SGD). We'll see how neural networks and SGD work in detail later in this book, as well as explaining the universal approximation theorem. For now, however, we will instead use Samuel's own words: *We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would \"learn\" from its experience.*"
"What we would like is some kind of function that is so flexible that it could be used to solve any given problem, just by varying its weights. Amazingly enough, this function actually exists! It's the neural network, which we already discussed. That is, if you regard a neural network as a mathematical function, it turns out to be a function which is extremely flexible depending on its weights. A mathematical proof called the *universal approximation theorem* shows that this function can solve any problem to any level of accuracy, in theory. The fact that neural networks are so flexible means that, in practice, they are often a suitable kind of model, and you can focus your effort on the process of training them, that is, of finding good weight assignments.\n",
"\n",
"But what about that process? One could imagine that you might need to find a new \"mechanism\" for automatically updating weight for every problem. This would be laborious. What we'd like here as well is a completely general way to update the weights of a neural network, to make it improve at any given task. Conveniently, this also exists!\n",
"\n",
"This is called *stochastic gradient descent* (SGD). We'll see how neural networks and SGD work in detail later in this book, as well as explaining the universal approximation theorem. For now, however, we will instead use Samuel's own words: *We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programed would \"learn\" from its experience.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> J: Don't worry, neither SGD nor neural nets are mathematically complex. In fact, I'll tell you *exactly* how they work right now! In a neural net, we take the input (e.g. the pixels of an image), multiply it by some (initially random) numbers (the \"weights\" or \"parameters\"), and add them up. We do that a few times with different weights to get a few values. We then replace all the negative numbers with zeros. Those two steps are called a *layer*. Then we repeat those two steps a few times, creating more *layers*. Finally, we add up the values. That's it: a neural net! Then we compare the value that comes out to our target (e.g. we might decide \"dog\" is `1` and \"cat\" is `0`), and calculate the *derivative* of the error with regards to the models weights (except we don't have to do it ourselves; it's entirely automated by PyTorch). This tells us how much each weight impacted the loss. We multiply that by a small number (around 0.01, normally), and subtract it from the weights. We repeat this process a few times for every input. That's it: the entirety of creating a training a neural net! In the rest of this book we'll learn about *how* and *why* this works, along with some tricks to speed it up and make it more reliable, and how to implement it in fastai and PyTorch."
"> J: Don't worry, neither SGD nor neural nets are mathematically complex. In fact, I'll tell you *exactly* how they work right now! In a neural net, we take the input (e.g. the pixels of an image), multiply it by some (initially random) numbers (the \"weights\" or \"parameters\"), and add them up. We do that a few times with different weights to get a few values. We then replace all the negative numbers with zeros. Those two steps are called a *layer*. Then we repeat those two steps a few times, creating more *layers*. Finally, we add up the values. That's it: a neural net! Then we compare the value that comes out to our target (e.g. we might decide \"dog\" is `1` and \"cat\" is `0`), and calculate the *derivative* of the error with regards to the models weights (except we don't have to do it ourselves; it's entirely automated by PyTorch). This tells us how much each weight impacted the loss. We multiply that by a small number (around 0.01, normally), and subtract it from the weights. We repeat this process a few times for every input. That's it: the entirety of creating a training a neural net! In the rest of this book we'll learn about *how* and *why* this works, along with some tricks to speed it up and make it more reliable, and how to implement it in fastai and PyTorch. *(TK AG: Jeremy, I think we should cut this aside entirely. There are already probably too many parenthetical notes in this chapter which risk obscuring the thread of explanation, and this one is so terse I fear it is more likely to confuse than to reassure.)*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now try to fit this into Samuel's framework. Our inputs are the images; our weights are the weights in the neural net; our model is a neural net; our results are the values that are calculated by the neural net. So now we just need some *automatic means of testing the effectiveness of any current weight assignment in terms of actual performance*. Well that's easy enough: we can see how accurate our model is at predicting the correct answers! So put this all together, and we have an image recognizer!"
"In other words, to recap, a neural network is a particular kind of machine learning model, which fits right in to Samuel's original conception. Neural networks are special because they are highly flexible, which means they can solve an unusually range of problems just by finding the right weights. This is powerful, because stochastic gradient descent provides us a way to find those weight values automatically.\n",
"\n",
"Let's now try to fit our image classification problem into Samuel's framework.\n",
"\n",
"Our inputs, those are the images. Our weights, those are the weights in the neural net. Our model is a neural net. Ou results those are the values that are calculated by the neural net.\n",
"\n",
"So now we just need some *automatic means of testing the effectiveness of any current weight assignment in terms of actual performance*. Well that's easy enough: we can see how accurate our model is at predicting the correct answers! So put this all together, and we have an image recognizer."
]
},
{
@ -1275,13 +1313,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In deep learning we use specific terminology for these pieces:\n",
"Our picture is almost complete.\n",
"\n",
"- The functional form of the *model* is called its *architecture* ;\n",
"All that remains is to add this last concept, of measuring a model's performance by comparing wit the correct answer, and to update some of its terminology to match the usage of 2020 instead of 1961.\n",
"\n",
"Here is the modern deep learning terminology for all the pieces we have discussed:\n",
"\n",
"- The functional form of the *model* is called its *architecture* (but be careful--sometimes people use *model* as a synonym of *architecture*, so this can get confusing) ;\n",
"- The *weights* are called *parameters* ;\n",
"- The *predictions* are calculated from the *independent variables*, which is the *data* not including the *labels* ; \n",
"- The *results* of the model are called *predictions* ;\n",
"- The measure of *performance* is called the *loss* (or *cost* or *error*);\n",
"- The loss depends not only on the predictions, but also the correct *labels* (or *targets*), e.g. \"dog\" or \"cat\".\n",
"- The measure of *performance* is called the *loss*;\n",
"- The loss depends not only on the predictions, but also the correct *labels* (also known as *targets* or *dependent variable*), e.g. \"dog\" or \"cat\".\n",
"\n",
"After making these changes, our diagram in <<training_loop>> looks like <<detailed_loop>>."
]
@ -1409,11 +1452,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now see some critically important things about training a deep learning model:\n",
"We can now see some fundamental things about training a deep learning model:\n",
"\n",
"- A model can not be created without data;\n",
"- A model can only learn to operate on the patterns seen in the input data used to train it;\n",
"- This learning approach only creates *predictions*, not recommended *actions*;\n",
"- A model cannot be created without data ;\n",
"- A model can only learn to operate on the patterns seen in the input data used to train it ;\n",
"- This learning approach only creates *predictions*, not recommended *actions* ;\n",
"- It's not enough to just have examples of input data; we need *labels* for that data too (e.g. pictures of dogs and cats aren't enough to train a model; we need a label for each one, saying which ones are dogs, and which are cats).\n",
"\n",
"Generally speaking, we've seen that most organizations that think they don't have enough data, actually mean they don't have enough *labeled* data. If any organization is interested in doing something in practice with a model, then presumably they have some inputs they plan to run their model against. And presumably they've been doing that some other way for a while (e.g. manually, or with some heuristic program), so they have data from those processes! For instance, a radiology practice will almost certainly have an archive of medical scans (since they need to be able to check how their patients are progressing over time), but those scans may not have structured labels containing a list of diagnoses or interventions (since radiologists generally create free text natural language reports, not structured data). We'll be discussing labeling approaches a lot in this book, since it's such an important issue in practice.\n",
@ -1558,22 +1601,35 @@
"learn = cnn_learner(dls, resnet34, metrics=error_rate)\n",
"```\n",
"\n",
"The fifth line tells fastai to create a *convolutional neural network* (CNN), and selects what *architecture* to use (i.e. what kind of model to create), what data we want to train it on, and what *metric* to use. A CNN is the current state of the art approach to creating computer vision models. We'll be learning all about how they work in this book. Their structure is inspired by how the human vision system works.\n",
"The fifth line tells fastai to create a *convolutional neural network* (CNN), and selects what *architecture* to use (i.e. what kind of model to create), what data we want to train it on, and what *metric* to use. \n",
"\n",
"Why a CNN? A CNN is the current state of the art approach to creating computer vision models. We'll be learning all about how they work in this book. Their structure is inspired by how the human vision system works.\n",
"\n",
"There are many different architectures in fastai, which we will be learning about in this book, as well as discussing how to create your own. Most of the time, however, picking an architecture isn't a very important part of the deep learning process. It's something that academics love to talk about, but in practice it is unlikely to be something you need to spend much time on. There are some standard architectures that work most of the time, and in this case we're using one called _ResNet_ that we'll be learning a lot about during the book; it is both fast and accurate for many datasets and problems. The \"34\" in `resnet34` refers to the number of layers in this variant of the architecture (other options are \"18\", \"50\", \"101\", and \"152\"). Models using architectures with more layers take longer to train, and are more prone to overfitting (i.e. you can't train them for as many epochs before the accuracy on the validation set starts getting worse). On the other hand, when using more data, they can be quite a bit more accurate.\n",
"\n",
"A *metric* is a function that is called to measure how good the model is, using the validation set, and will be printed at the end of each *epoch*. In this case, we're using `error_rate`, which is a function provided by fastai which does just what it says: tells you what percentage of images in the validation set are being classified incorrectly. Another common metric for classification is `accuracy` (which is just `1.0 - error_rate`). fastai provides many more, which will be discussed throughout this book."
"What is a metric? A *metric* is a function that measures quality of the model's predictions using the validation set, and will be printed at the end of each *epoch*. In this case, we're using `error_rate`, which is a function provided by fastai which does just what it says: tells you what percentage of images in the validation set are being classified incorrectly. Another common metric for classification is `accuracy` (which is just `1.0 - error_rate`). fastai provides many more, which will be discussed throughout this book.\n",
"\n",
"The concept of a metric may remind you of loss, but there is an important distinction. The entire purpose of loss was to define a \"measure of performance\" that the training system could use to update weights automatically. In other words, a good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption. So a good metric is one that is easy for you to understand, and that hews as closely as possible to what you want the model to do. At times, you might decide that the loss function is a suitable metirc, but that is not necessarily the case."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`cnn_learner` also has a parameter `pretrained`, which defaults to `True` (so it's used in this case), which sets the weights in your model to values that have already been trained by experts to recognize a thousand different categories across 1.3 million photos (using the famous *ImageNet* dataset). A model that has weights that have already been trained on some other dataset is called a *pretrained model*. You should nearly always use a pretrained model, because it means that your model, before you've even shown it any of your data, is already very capable. And, as you'll see, in a deep learning model many of these capabilities are things you'll need, almost regardless of the details of your project (such as edge, gradient, and color detection).\n",
"`cnn_learner` also has a parameter `pretrained`, which defaults to `True` (so it's used in this case), which sets the weights in your model to values that have already been trained by experts to recognize a thousand different categories across 1.3 million photos (using the famous *ImageNet* dataset). A model that has weights that have already been trained on some other dataset is called a *pretrained model*. You should nearly always use a pretrained model, because it means that your model, before you've even shown it any of your data, is already very capable. And, as you'll see, in a deep learning model many of these capabilities are things you'll need, almost regardless of the details of your project. For instance, parts of pretrained models will handle edge-, gradient-, and color-detection, which are needed for many tasks.\n",
"\n",
"When using a pretrained model, `cnn_learner` will remove the last layer, since that is always specifically customized to the original training task (i.e. ImageNet dataset classification), and replace it with one or more new layers with randomized weights, of an appropriate size for the dataset you are working with. This last part of the model is known as the `head`.\n",
"\n",
"Using pretrained models is the *most* important method we have to allow us to train more accurate models, more quickly, with less data, and less time and money. You might think that would mean that using pretrained models would be the most studied area in academic deep learning... but you'd be very, very wrong! The importance of pretrained models is generally not recognized or discussed in most courses, books, or software library features, and is rarely considered in academic papers. As we write this at the start of 2020, things are just starting to change, but it's likely to take a while. So be careful: most people you speak to will probably greatly underestimate what you can do in deep learning with few resources, because they probably won't deeply understand how to use pretrained models."
"Using pretrained models is the *most* important method we have to allow us to train more accurate models, more quickly, with less data, and less time and money. You might think that would mean that using pretrained models would be the most studied area in academic deep learning... but you'd be very, very wrong! The importance of pretrained models is generally not recognized or discussed in most courses, books, or software library features, and is rarely considered in academic papers. As we write this at the start of 2020, things are just starting to change, but it's likely to take a while. So be careful: most people you speak to will probably greatly underestimate what you can do in deep learning with few resources, because they probably won't deeply understand how to use pretrained models.\n",
"\n",
"Using a pretrained model for a task different to what it was originally trained for is known as *transfer learning*. Unfortunately, because transfer learning is so under-studied, few domains have pretrained models available. For instance, there are currently few pretrained models available in medicine, making transfer learning challenging to use in that domain. In addition, it is not yet well understood how to use transfer learning for tasks such as time series analysis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Transfer learning: Using a pretrained model for a task different to what it was originally trained for."
]
},
{
@ -1588,19 +1644,26 @@
"\n",
"This is the key to deep learning — how to fit the parameters of a model to get it to solve your problem. In order to fit a model, we have to provide at least one piece of information: how many times to look at each image (known as number of *epochs*). The number of epochs you select will largely depend on how much time you have available, and how long you find it takes in practice to fit your model. If you select a number that is too small, you can always train for more epochs later.\n",
"\n",
"But why is the method called `fine_tune`, and not `fit`? fastai actually *does* have a method called `fit`, which does indeed fit a model (i.e. look at images in the training set multiple times, each time updating the *parameters* to make the predictions closer and closer to the *target labels*). But in this case, we've started with a pretrained model, and we don't want to throw away all those capabilities that it already has. As we'll learn in this book, there are some important tricks to adapt a pretrained model for a new dataset -- a process called *fine-tuning*. When you use the `fine_tune` method, fastai will use these tricks for you. There are a few parameters you can set (which we'll discuss later), but in the default form shown here, it does two steps:\n",
"\n",
"1. Use one *epoch* to fit just those parts of the model necessary to get the new random *head* to work correctly with your dataset\n",
"1. Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which, as we'll see, generally don't require many changes from the pretrained weights)\n",
"\n",
"The *head* of a model is the part that is newly added to be specific to the new dataset. An *epoch* is one complete pass through the dataset. After calling `fit`, the results after each epoch are printed, showing the epoch number, the training and validation set losses (the \"measure of performance\" used for training the model), and any *metrics* you've requested (error rate, in this case)."
"But why is the method called `fine_tune`, and not `fit`? fastai actually *does* have a method called `fit`, which does indeed fit a model (i.e. look at images in the training set multiple times, each time updating the *parameters* to make the predictions closer and closer to the *target labels*). But in this case, we've started with a pretrained model, and we don't want to throw away all those capabilities that it already has. As we'll learn in this book, there are some important tricks to adapt a pretrained model for a new dataset -- a process called *fine-tuning*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Metric and Loss: A *metric* is a calculation that is made after each epoch and displayed so that you can see how well your model is training. It's not used as part of the actual learning process. The *loss* is the \"measure of performance\" that is used by the learning process to define whether one set of parameters is better or worse than another; the learning process works to make the loss as low as possible."
"> jargon: Fine tuning: A transfer learning technique where the weights of a pretrained model are updated by training for additional epochs using a different task to that used for pretraining."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you use the `fine_tune` method, fastai will use these tricks for you. There are a few parameters you can set (which we'll discuss later), but in the default form shown here, it does two steps:\n",
"\n",
"1. Use one *epoch* to fit just those parts of the model necessary to get the new random *head* to work correctly with your dataset\n",
"1. Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which, as we'll see, generally don't require many changes from the pretrained weights)\n",
"\n",
"The *head* of a model is the part that is newly added to be specific to the new dataset. An *epoch* is one complete pass through the dataset. After calling `fit`, the results after each epoch are printed, showing the epoch number, the training and validation set losses (the \"measure of performance\" used for training the model), and any *metrics* you've requested (error rate, in this case)."
]
},
{
@ -1621,7 +1684,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"At this stage we have an image recogniser that is working very well, but we have no idea what it is actually doing! Although many people complain that deep learning results in impenetrable \"black box\" models (that is, something that gives predictions but that no one can understand), this really couldn't be further from the truth. There is a vast body of research showing how to deeply inspect deep learning models, and get rich insights from them.\n",
"At this stage we have an image recogniser that is working very well, but we have no idea what it is actually doing! Although many people complain that deep learning results in impenetrable \"black box\" models (that is, something that gives predictions but that no one can understand), this really couldn't be further from the truth. There is a vast body of research showing how to deeply inspect deep learning models, and get rich insights from them. Having said that, all kinds of machine learning models (including deep learning, and traditional statistical models) can be challenging to fully understand, especially when considering how they will behave when coming across data that is very different to the data used to train them. We'll be discussing this issue throughout this book.\n",
"\n",
"In 2013 a PhD student, Matt Zeiler, and his supervisor, Rob Fergus, published the paper [Visualizing and Understanding Convolutional Networks](https://arxiv.org/pdf/1311.2901.pdf), which showed how to visualise the neural network weights learned in each layer of a model. They carefully analysed the model that won the 2012 ImageNet competition, and used this analysis to greatly improve the model, such that they were able to go on to win the 2013 competition! <<img_layer1>> is the picture that they published of the first layers' weights."
]
@ -1690,7 +1753,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### What image recognizers can do"
"### Image recognizers can tackle non-image tasks"
]
},
{
@ -1817,7 +1880,7 @@
"source": [
"With this vocabulary in hand, we are now in a position to bring together all the key concepts so far. Take a moment to review those definitions and read the following summary. If you can follow the explanation, then you have laid down the basic coordinates for understanding many discussions to come.\n",
"\n",
"*Deep learning* is a specialty within *machine learning*, a discipline where we define a program not by writing it entirely ourselves but by using data. *Image classification* is a representative example. We start with *labeled data*, that is, a set of images where we have assigned a *label* to each image indicating what it represents. Our goal is to produce a program, called a *model*, which, given a new image, will make an accurate *prediction* regarding what that new image represents.\n",
"*Machine learning* is a discipline where we define a program not by writing it entirely ourselves, but by learning from data. *Deep learning* is a specialty within machine learning with uses *neural networks* using multiple *layers*. *Image classification* is a representative example (also known as *image recognition*). We start with *labeled data*, that is, a set of images where we have assigned a *label* to each image indicating what it represents. Our goal is to produce a program, called a *model*, which, given a new image, will make an accurate *prediction* regarding what that new image represents.\n",
"\n",
"Every model starts with a choice of *architecture*, a general template for how that kind of model works internally. The process of *training* (or *fitting*) the model is the process of finding a set of *parameter values* (or *weights*) which specializes that general architecture into a model that works well for our particular kind of data. In order to define how well a model does on a single prediction, we need to define a *loss function*, which defines how we score a prediction as good or bad.\n",
"\n",
@ -2484,7 +2547,9 @@
"source": [
"This model is predicting movie ratings on a scale of 0.5 to 5.0 to within around 0.6 average error. Since we're predicting a continuous number, rather than a category, we have to tell fastai what range our target has, using the `y_range` parameter.\n",
"\n",
"Although we're not actually using a pretrained model (for the same reason that we didn't for the tabular model), this example shows that fastai let's us use `fine_tune` even in this case (we'll learn how and why this works later in <<chapter_pet_breeds>>). We can use the same `show_results` call we saw earlier to view a few examples of user and movie IDs, actual ratings, and predictions:"
"Although we're not actually using a pretrained model (for the same reason that we didn't for the tabular model), this example shows that fastai let's us use `fine_tune` even in this case (we'll learn how and why this works later in <<chapter_pet_breeds>>). Sometimes it's best to experiment with `fine_tune` versus `fit_one_cycle` to see which works best for your dataset.\n",
"\n",
"We can use the same `show_results` call we saw earlier to view a few examples of user and movie IDs, actual ratings, and predictions:"
]
},
{
@ -2650,7 +2715,7 @@
"source": [
"As we've discussed, the goal of a model is to make predictions about data. But the model training process is fundamentally dumb. If we trained a model with all our data, and then evaluated the model using that same data, we would not be able to tell how well our model can perform on data it hasnt seen. Without this very valuable piece of information to guide us in training our model, there is a very good chance it would become good at making predictions about that data but would perform poorly on new data.\n",
"\n",
"It is in order to avoid this that our first step was to split our dataset into two sets, the training set (which our model sees in training) and the validation set (which is used only for evaluation). This lets us test that the model learns lessons from the training data which generalize to new data, the validation data.\n",
"It is in order to avoid this that our first step was to split our dataset into two sets, the *training set* (which our model sees in training) and the *validation set*, also known as the *development set* (which is used only for evaluation). This lets us test that the model learns lessons from the training data which generalize to new data, the validation data.\n",
"\n",
"One way to understand this situation is that, in a sense, we don't want our model to get good results by \"cheating\". If it predicts well on a data item, that should be because it has learned principles that govern that kind of item, and not because the model has been shaped by *actually having seeing that particular item*.\n",
"\n",
@ -2667,11 +2732,13 @@
"\n",
"The solution to this conundrum is to introduce another level of even more highly reserved data, the \"test set\". Just as we hold back the validation data from the training process, we must hold back the test set data even from ourselves. It cannot be used to improve the model; it can only be used to evaluate the model at the very end of our efforts. In effect, we define a hierarchy of cuts of our data, based on how fully we want to hide it from training and modelling processes -- training data is fully exposed, the validation data is less exposed, and test data is totally hidden. This hierarchy parallels the different kinds of modelling and evaluation processes themselves -- the automatic training process with back propagation, the more manual process of trying different hyper-parameters between training sessions, and the assessment of our final result.\n",
"\n",
"Having two levels of \"reserved data\", a validation set and a test set -- with one level representing data which you are virtually hiding from yourself -- may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest.\n",
"The test and validation sets should have enough data to ensure that you get a good estimate of your accuracy. If you're creating a cat detector, for instance, you generally want at least 30 cats in your validation set. That means that if you have a dataset with thousands of items, using the default 20% validation set size may be larger than you need. On the other hand, if you have lots of data, using some of it for the validation probably doesn't have any downsides.\n",
"\n",
"Having two levels of \"reserved data\", a validation set and a test set -- with one level representing data which you are virtually hiding from yourself -- may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we *always* need a separate test set--if you have very little data, you may need to just have a validation set--but generally it's best to use one if at all possible.\n",
"\n",
"This same discipline can be critical if you intend to hire a third-party to perform modelling work on your behalf. A third-party might not understand your requirements accurately, or their incentives might even encourage them to misunderstand them. But a good test set can greatly mitigate these risks and let you evaluate if their work solves your actual problem.\n",
"\n",
"To put it bluntly, if you're a senior decision maker in your organization (or you're advising senior decision makers) then the most important takeaway is this: if you ensure that you really understand what a test set is, and why it's important, then you'll avoid the single biggest source of failures we've seen when organizations decide to use AI. For instance, if you're considering bringing in an external vendor or service, make sure that you hold out some test data that the vendor *never gets to see*. Then *you* check their model on your test data, using a metric that *you* choose based on what actually matters to you in practice, and *you* decide what level of performance is adequate. (It's also a good idea for you to try out some simple baseline yourself, so you know what a really simple model can achieve. Often it'll turn out that your simple model can be just as good as an external \"expert\"!)"
"To put it bluntly, if you're a senior decision maker in your organization (or you're advising senior decision makers) then the most important takeaway is this: if you ensure that you really understand what test and validation sets are, and why they're important, then you'll avoid the single biggest source of failures we've seen when organizations decide to use AI. For instance, if you're considering bringing in an external vendor or service, make sure that you hold out some test data that the vendor *never gets to see*. Then *you* check their model on your test data, using a metric that *you* choose based on what actually matters to you in practice, and *you* decide what level of performance is adequate. (It's also a good idea for you to try out some simple baseline yourself, so you know what a really simple model can achieve. Often it'll turn out that your simple model can be just as good as an external \"expert\"!)"
]
},
{
@ -2742,7 +2809,7 @@
"source": [
"After time series, a second common case is when you can easily anticipate ways the data you will be making predictions for in production may be *qualitatively different* from the data you have to train your model with.\n",
"\n",
"In the Kaggle [distracted driver competition](https://www.kaggle.com/c/state-farm-distracted-driver-detection), the independent data are pictures of drivers at the wheel of a car, and the dependent variable is a category such as texting, eating, or safely looking ahead. Lots of pictures were of the same drivers in different positions, as we can see in <<imng_driver>>. If you were the insurance company building a model from this data, note that you would be most interested in how the model performs on drivers you haven't seen before (since you would likely have training data only for a small group of people). This is true of the Kaggle competition as well: the test data consists of people that weren't used in the training set."
"In the Kaggle [distracted driver competition](https://www.kaggle.com/c/state-farm-distracted-driver-detection), the independent variables are pictures of drivers at the wheel of a car, and the dependent variable is a category such as texting, eating, or safely looking ahead. Lots of pictures were of the same drivers in different positions, as we can see in <<imng_driver>>. If you were the insurance company building a model from this data, note that you would be most interested in how the model performs on drivers you haven't seen before (since you would likely have training data only for a small group of people). This is true of the Kaggle competition as well: the test data consists of people that weren't used in the training set."
]
},
{

View File

@ -121,7 +121,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many domains in which deep learning has not been used to analyse images yet, but those where it has been tried have nearly universally shown that computers can recognise what items are in an image at least as well as people can — even specially trained people, such as radiologists. This is known as *object recognition*. Deep learning is also good at recognizing whereabouts objects in an image are, and can highlight their location and name each found object. This is known as *object detection* (there is also a variant of this we saw in <<chapter_intro>>, where every pixel is categorized based on what kind of object it is part of--this is called *segmentation*). Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style to those used to train the model. For instance, if there were no black-and-white images in the training data, the model may do poorly on black-and-white images. If the training data did not contain hand-drawn images then the model will probably do poorly on hand-drawn images. There is no general way to check what types of image are missing in your training set, but we will show in this chapter some ways to try to recognize when unexpected image types arise in the data when the model is being used in production (this is known as checking for *out of domain* data). TK previous chapter showed a parabola overfitting. possibly a good place to show out of domain data outside the sampled parabola?\n",
"There are many domains in which deep learning has not been used to analyse images yet, but those where it has been tried have nearly universally shown that computers can recognise what items are in an image at least as well as people can — even specially trained people, such as radiologists. This is known as *object recognition*. Deep learning is also good at recognizing whereabouts objects in an image are, and can highlight their location and name each found object. This is known as *object detection* (there is also a variant of this we saw in <<chapter_intro>>, where every pixel is categorized based on what kind of object it is part of--this is called *segmentation*). Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style to those used to train the model. For instance, if there were no black-and-white images in the training data, the model may do poorly on black-and-white images. If the training data did not contain hand-drawn images then the model will probably do poorly on hand-drawn images. There is no general way to check what types of image are missing in your training set, but we will show in this chapter some ways to try to recognize when unexpected image types arise in the data when the model is being used in production (this is known as checking for *out of domain* data).\n",
"\n",
"One major challenge for object detection systems is that image labelling can be slow and expensive. There is a lot of work at the moment going into tools to try to make this labelling faster and easier, and require less handcrafted labels to train accurate object detection models. One approach which is particularly helpful is to synthetically generate variations of input images, such as by rotating them, or changing their brightness and contrast; this is called *data augmentation* and also works well for text and other types of model. We will be discussing it in detail in this chapter.\n",
"\n",
@ -210,7 +210,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: please add a transition"
"There are many accurate models that are of no use to anyone, and many inaccurate models that are highly useful. To ensure that your modeling work is useful in practice, you need to consider how your work will be used. In 2012 Jeremy, along with Margit Zwemer and Mike Loukides, introduced a method called *The Drivetrain Approach* for thinking about this issue."
]
},
{
@ -224,7 +224,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many accurate models that are of no use to anyone, and many inaccurate models that are highly useful. To ensure that your modeling work is useful in practice, you need to consider how your work will be used. In 2012 Jeremy, along with Margit Zwemer and Mike Loukides, introduced a method called *The Drivetrain Approach* for thinking about this issue, which we will summarize here, and illustrate in <<drivetrain>>. For more information, see the full article on oreilly.com [Designing Great Data Products](https://www.oreilly.com/radar/drivetrain-approach-data-products/).\n",
"The Drivetrain approach, illustrated in <<drivetrain>>, was described in detail in [Designing Great Data Products](https://www.oreilly.com/radar/drivetrain-approach-data-products/). The basic idea is to start with considering your objective, then think about what you can actually do to change that objective (\"levers\"), what data you have that might help you connect potential changes to levers to changes in your objective, and then to build a model of that. You can then use that model to find the best actions (that is, changes to levers) to get the best results in terms of your objective.\n",
"\n",
"Consider a model in an autonomous vehicle, you want to help a car drive safely from point A to point B without human intervention. Great predictive modeling is an important part of the solution, but it doesn't stand on its own; as products become more sophisticated, it disappears into the plumbing. Someone using a self-driving car is completely unaware of the hundreds (if not thousands) of models and the petabytes of data that make it work. But as data scientists build increasingly sophisticated products, they need a systematic design approach.\n",
"\n",

View File

@ -146,14 +146,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: \"Why does this matter?\" as an alternative title."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### So what?"
"### Why does this matter?"
]
},
{
@ -266,14 +259,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy-Rachel: Explain why those topics are important and transition to errors and recourse."
"Let's look at each in turn."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Errors and recourse"
"### Recourse and accountability"
]
},
{
@ -384,14 +377,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: \"Why only four? Tell the reader.\" If you have anything interesting to say about that here, otherwise we can ignore."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll discuss four of these types of bias here (see the paper for details on the others)."
"We'll discuss the four of these types of bias here that we've found most helpful in our own work (see the paper for details on the others)."
]
},
{
@ -504,16 +490,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: \"Tell the reader what the figure shows, what's the takeaway?\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we can see that the lower-income soap example is a very long way away from being accurate, with every commercial image recognition service predicting \"food\" as the most likely answer!\n",
"\n",
"As we will discuss shortly, in addition, the vast majority of AI researchers and developers are young white men. Most projects that we have seen do most user testing using friends and families of the immediate product development group. Given this, the kinds of problems we just discussed should not be surprising.\n",
"\n",
"Similar historical bias is found in the texts used as data for natural language processing models. This crops up in downstream machine learning tasks in many ways. For instance, until last year Google Translate showed systematic bias in how it translated the Turkish gender-neutral pronoun \"bir\" into English. For instance, when applied to jobs which are often associated with males, it used \"he\", and when applied to jobs which are often associated with females, it used \"she\":"
"Similar historical bias is found in the texts used as data for natural language processing models. This crops up in downstream machine learning tasks in many ways. For instance, it [was widely reported](https://nypost.com/2017/11/30/google-translates-algorithm-has-a-gender-bias/) that until last year Google Translate showed systematic bias in how it translated the Turkish gender-neutral pronoun \"o\" into English. For instance, when applied to jobs which are often associated with males, it used \"he\", and when applied to jobs which are often associated with females, it used \"she\":"
]
},
{
@ -523,13 +504,6 @@
"<img src=\"images/ethics/image11.png\" width=\"600\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: Link to the study needed"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -645,14 +619,9 @@
" - People are more likely to assume algorithms are objective or error-free (even if theyre given the option of a human override)\n",
" - Algorithms are more likely to be implemented with no appeals process in place\n",
" - Algorithms are often used at scale\n",
" - Algorithmic systems are cheap."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: Takeaway for readers and transition to disinformation."
" - Algorithmic systems are cheap.\n",
"\n",
"Even in the absence of bias, algorithms (and deep learning especially, since it is such an effective and scalable algorithm) can lead to negative societal problems, such as when used for *disinformation*."
]
},
{
@ -690,21 +659,16 @@
"\n",
"Disinformation through auto-generated text is a particularly significant issue, due to the greatly increased capability provided by deep learning. We discuss this issue in depth when we learn to create language models, in <<chapter_nlp>>.\n",
"\n",
"One proposed approach is to develop some form of digital signature, implement it in a seamless way, and to create norms that we should only trust content which has been verified. Head of the Allen Institute on AI, Oren Etzioni, wrote such a proposal in an article titled [How Will We Prevent AI-Based Forgery?](https://hbr.org/2019/03/how-will-we-prevent-ai-based-forgery), \"AI is poised to make high-fidelity forgery inexpensive and automated, leading to potentially disastrous consequences for democracy, security, and society. The specter of AI forgery means that we need to act to make digital signatures de rigueur as a means of authentication of digital content.\""
"One proposed approach is to develop some form of digital signature, implement it in a seamless way, and to create norms that we should only trust content which has been verified. Head of the Allen Institute on AI, Oren Etzioni, wrote such a proposal in an article titled [How Will We Prevent AI-Based Forgery?](https://hbr.org/2019/03/how-will-we-prevent-ai-based-forgery), \"AI is poised to make high-fidelity forgery inexpensive and automated, leading to potentially disastrous consequences for democracy, security, and society. The specter of AI forgery means that we need to act to make digital signatures de rigueur as a means of authentication of digital content.\"\n",
"\n",
"Whilst we can't hope to discuss all the ethical issues that deep learning, and algorithms more generally, bring up, hopefully this brief introduction has been a useful starting point you can build on. We'll now move on to the questions of how to identify ethical issues, and what to do about them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: Wrap up section and transition to next. Also change next title to What to do about bla or What to do with foo."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What to do"
"## Identifying and addressing ethical issues"
]
},
{
@ -754,7 +718,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: Expand--add some additional details and takeaways from the reader. What will they get out of doing this and how should they go about it? Then transition to \"Process to Implement\""
"These questions may be able to help you identify outstanding issues, and possible alternatives that are easier to understand and control. In addition to asking the right questions, it's also important to consider processes to implement."
]
},
{
@ -778,13 +742,6 @@
" - Who might use this product that we didnt expect to use it, or for purposes we didnt initially intend?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: Add takeaways and transition to Ethical Lenses"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -802,13 +759,8 @@
" - The Justice Approach:: Which option treats people equally or proportionately?\n",
" - The Utilitarian Approach:: Which option will produce the most good and do the least harm?\n",
" - The Common Good Approach:: Which option best serves the community as a whole, not just some members?\n",
" - The Virtue Approach:: Which option leads me to act as the sort of person I want to be?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - The Virtue Approach:: Which option leads me to act as the sort of person I want to be?\n",
"\n",
"Markkula's recommendations include a deeper dive into each of these perspectives, including looking at a project based on a focus on its *consequences*:\n",
"\n",
" - Who will be directly affected by this project? Who will be indirectly affected?\n",
@ -816,94 +768,16 @@
" - Are we thinking about all relevant types of harm/benefit (psychological, political, environmental, moral, cognitive, emotional, institutional, cultural)?\n",
" - How might future generations be affected by this project?\n",
" - Do the risks of harm from this project fall disproportionately on the least powerful in society? Will the benefits go disproportionately the well-off?\n",
" - Have we adequately considered dual-use?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - Have we adequately considered dual-use?\n",
"\n",
"The alternative lens to this is the *deontological* perspective, which focuses on basic *right* and *wrong*:\n",
"\n",
" - What rights of others & duties to others must we respect?\n",
" - How might the dignity & autonomy of each stakeholder be impacted by this project?\n",
" - What considerations of trust & of justice are relevant to this design/project?\n",
" - Does this project involve any conflicting moral duties to others, or conflicting stakeholder rights? How can we prioritize these?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fairness, accountability, and transparency"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The professional society for computer scientists, the ACM, runs a conference on data ethics called the \"Conference on Fairness, Accountability, and Transparency\". \"Fairness, Accountability, and Transparency\" sometimes goes under the acronym *FAT*, although nowadays it's changing to *FAccT*. Microsoft has a group focused on \"Fairness, Accountability, Transparency, and Ethics\" (FATE). The various versions of this lens have resulted in the acronym \"FAT*\" seeing wide usage. In this section, we'll use \"FAccT\" to refer to the concepts of *Fairness, Accountability, and Transparency*.\n",
" - Does this project involve any conflicting moral duties to others, or conflicting stakeholder rights? How can we prioritize these?\n",
"\n",
"FAccT is another lens that you may find useful in considering ethical issues. One useful resource for this is the free online book [Fairness and machine learning; Limitations and Opportunities](https://fairmlbook.org/), which \"gives a perspective on machine learning that treats fairness as a central concern rather than an afterthought.\" It also warns, however, that it \"is intentionally narrow in scope... A narrow framing of machine learning ethics might be tempting to technologists and businesses as a way to focus on technical interventions while sidestepping deeper questions about power and accountability. We caution against this temptation.\" Rather than provide an overview of the FAccT approach to ethics (which is better done in books such as the one linked above), our focus here will be on the limitations of this kind of narrow framing.\n",
"\n",
"One great way to consider whether an ethical lens is complete, is to try to come up with an example where the lens and our own ethical intuitions give diverging results. Os Keyes explored this in a graphic way in their paper [A Mulching Proposal\n",
"Analysing and Improving an Algorithmic System for Turning the Elderly into High-Nutrient Slurry](https://arxiv.org/abs/1908.06166). The paper's abstract says:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> : The ethical implications of algorithmic systems have been much discussed in both HCI and the broader community of those interested in technology design, development and policy. In this paper, we explore the application of one prominent ethical framework - Fairness, Accountability, and Transparency - to a proposed algorithm that resolves various societal issues around food security and population aging. Using various standardised forms of algorithmic audit and evaluation, we drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system. We discuss how this might serve as a guide to other researchers or practitioners looking to ensure better ethical outcomes from algorithmic systems in their line of work."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this paper, the rather controversial proposal (\"Turning the Elderly into High-Nutrient Slurry\") and the results (\"drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system\") are at odds... to say the least!\n",
"\n",
"In philosophy, and especially philosophy of ethics, this is one of the most effective tools: first, come up with a process, definition, set of questions, etc, which is designed to resolve some problem. Then try to come up with an example where that apparent solution results in a proposal that no-one would consider acceptable. This can then lead to a further refinement of the solution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: Add takeaways for the reader and transition to Role of Policy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Role of Policy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ethical issues that arise in the use of automated decision systems, such as machine learning, can be complex and far-reaching. To better address them, we will need thoughtful policy, in addition to the ethical efforts of those in industry. Neither is sufficient on its own.\n",
"\n",
"Policy is the appropriate tool for addressing:\n",
"\n",
"- Negative externalities\n",
"- Misaligned economic incentives\n",
"- “Race to the bottom” situations\n",
"- Enforcing accountability.\n",
"\n",
"Ethical behavior in industry is necessary as well, since:\n",
"\n",
"- Law will not always keep up\n",
"- Edge cases will arise in which practitioners must use their best judgement."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Jeremy: Expand this section. What does this mean for the reader? Add transition to The Power of Diversity"
"One of the best ways to help come up with complete and thoughtful answers to questions like these is to ensure that the people asking the questions are *diverse*."
]
},
{
@ -946,6 +820,71 @@
"This leaves a big opportunity for companies that are ready to look beyond status and pedigree, and focus on results!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fairness, accountability, and transparency"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The professional society for computer scientists, the ACM, runs a conference on data ethics called the \"Conference on Fairness, Accountability, and Transparency\". \"Fairness, Accountability, and Transparency\" sometimes goes under the acronym *FAT*, although nowadays it's changing to *FAccT*. Microsoft has a group focused on \"Fairness, Accountability, Transparency, and Ethics\" (FATE). The various versions of this lens have resulted in the acronym \"FAT*\" seeing wide usage. In this section, we'll use \"FAccT\" to refer to the concepts of *Fairness, Accountability, and Transparency*.\n",
"\n",
"FAccT is another lens that you may find useful in considering ethical issues. One useful resource for this is the free online book [Fairness and machine learning; Limitations and Opportunities](https://fairmlbook.org/), which \"gives a perspective on machine learning that treats fairness as a central concern rather than an afterthought.\" It also warns, however, that it \"is intentionally narrow in scope... A narrow framing of machine learning ethics might be tempting to technologists and businesses as a way to focus on technical interventions while sidestepping deeper questions about power and accountability. We caution against this temptation.\" Rather than provide an overview of the FAccT approach to ethics (which is better done in books such as the one linked above), our focus here will be on the limitations of this kind of narrow framing.\n",
"\n",
"One great way to consider whether an ethical lens is complete, is to try to come up with an example where the lens and our own ethical intuitions give diverging results. Os Keyes explored this in a graphic way in their paper [A Mulching Proposal\n",
"Analysing and Improving an Algorithmic System for Turning the Elderly into High-Nutrient Slurry](https://arxiv.org/abs/1908.06166). The paper's abstract says:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> : The ethical implications of algorithmic systems have been much discussed in both HCI and the broader community of those interested in technology design, development and policy. In this paper, we explore the application of one prominent ethical framework - Fairness, Accountability, and Transparency - to a proposed algorithm that resolves various societal issues around food security and population aging. Using various standardised forms of algorithmic audit and evaluation, we drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system. We discuss how this might serve as a guide to other researchers or practitioners looking to ensure better ethical outcomes from algorithmic systems in their line of work."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this paper, the rather controversial proposal (\"Turning the Elderly into High-Nutrient Slurry\") and the results (\"drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system\") are at odds... to say the least!\n",
"\n",
"In philosophy, and especially philosophy of ethics, this is one of the most effective tools: first, come up with a process, definition, set of questions, etc, which is designed to resolve some problem. Then try to come up with an example where that apparent solution results in a proposal that no-one would consider acceptable. This can then lead to a further refinement of the solution.\n",
"\n",
"So far, we've focused on things that you and your organization can do. But sometimes individual or organizational action is not enough. Sometimes, governments also need to consider policy implications."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Role of Policy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ethical issues that arise in the use of automated decision systems, such as machine learning, can be complex and far-reaching. To better address them, we will need thoughtful policy, in addition to the ethical efforts of those in industry. Neither is sufficient on its own.\n",
"\n",
"Policy is the appropriate tool for addressing:\n",
"\n",
"- Negative externalities\n",
"- Misaligned economic incentives\n",
"- “Race to the bottom” situations\n",
"- Enforcing accountability.\n",
"\n",
"Ethical behavior in industry is necessary as well, since:\n",
"\n",
"- Law will not always keep up\n",
"- Edge cases will arise in which practitioners must use their best judgement.\n",
"\n",
"TK expand"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1056,6 +995,31 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,

View File

@ -3346,11 +3346,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from of a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the specific task we are targeting. So the model will need to *learn* better weights.\n",
"To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the specific task we are targeting. So the model will need to *learn* better weights.\n",
"\n",
"To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare to the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
"\n",
"To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradient* (actually, we let PyTorch do it for us!) Let's imagine you are lost in the mountains with your car parked at the lowest point. To find your way, you might wander in a random direction but that probably won't help much. Since you know you your vehicle is at the lowest point, you would be better to go downhill. By always taking a step in the direction of the steepest slope, you should eventually arrive at your destination. We use the gradient to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size."
"To find how to change the weights to make the loss a bit better, we use calculus to calculate the *gradient*. (actually, we let PyTorch do it for us!) Let's imagine you are lost in the mountains with your car parked at the lowest point. To find your way, you might wander in a random direction but that probably won't help much. Since you know your vehicle is at the lowest point, you would be better to go downhill. By always taking a step in the direction of the steepest slope, you should eventually arrive at your destination. We use the gradient to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the *learning rate* to decide on the step size."
]
},
{
@ -3364,7 +3364,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get back to our MNIST problem. As we've seen, if we are going to calculate gradients (which we need), then we need some *loss function* that represents how good our model is. The obvious approach would be to use the accuracy for this purpose. In this case, we would calculate our prediction for each image, and then calculate the overall accuracy (remember, at first we simply use random weights), and then calculate the gradients of each weight with respect to that accuracy calculation.\n",
"Let's get back to our MNIST problem. As we've seen, we need gradients in order to improve our model, and in order to calculate gradients we need some *loss function* that represents how good our model is. That is because the gradients are a measure of how that loss function changes with small tweaks to the weights. The obvious approach would be to use the accuracy for this purpose. In this case, we would calculate our prediction for each image, and then calculate the overall accuracy (remember, at first we simply use random weights), and then calculate the gradients of each weight with respect to that accuracy calculation.\n",
"\n",
"Unfortunately, we have a significant technical problem here. The gradient of a function is its *slope*, or its steepness, which can be defined as *rise over run* -- that is, how much the value of function goes up or down, divided by how much you changed the input. We can write this in maths: `(y_new-y_old) / (x_new-x_old)`. Specifically, it is defined when x_new is very similar to x_old, meaning that their difference is very small. But accuracy only changes at all when a prediction changes from a 3 to a 7, or vice versa. So the problem is that a small change in weights from from x_old to x_new isn't likely to cause any prediction to change, so `(y_new - y_old)` will be zero. (In other words, the gradient is zero almost everywhere.) As a result, a very small change in the value of a weight will often not actually change the accuracy at all. This means it is not useful to use accuracy as a loss function. When we use accuracy as a loss function, most of the time our gradients will actually be zero, and the model will not be able to learn from that number. That is not much use at all!\n",
"\n",

View File

@ -37,7 +37,7 @@
"- make them better;\n",
"- apply them to a wider variety of types of data.\n",
"\n",
"In order to do these two things, we will have to learn all of the pieces of the deep learning puzzle. This includes: different types of layers, regularisation methods, optimisers, putting layers together into architectures, labelling techniques, and much more. We are not just going to dump all of these things out, but we will introduce them only as they are needed to solve an actual problem related to a project we are working on."
"In order to do these two things, we will have to learn all of the pieces of the deep learning puzzle. This includes: different types of layers, regularisation methods, optimisers, putting layers together into architectures, labelling techniques, and much more. We are not just going to dump all of these things out, but we will introduce them progressively as needed, to solve an actual problem related to the project we are working on."
]
},
{
@ -165,9 +165,9 @@
"\n",
"In this case, we need a regular expressin that extracts the pet breed from the file name.\n",
"\n",
"We do not have the space to give you a complete regular expression tutorial here, particularly because there are so many excellent ones online. And we know that many of you will already be familiar with this wonderful tool. If you're not, that is totally fine — this is a great opportunity for you to rectify that! We find that regular expressions are one of the most useful tools in our programming toolkit, and many of our students tell us that it is one of the things they are most excited to learn about. So head over to Google and search for *regular expressions tutorial* now, and then come back here after you've had a good look around.\n",
"We do not have the space to give you a complete regular expression tutorial here, particularly because there are so many excellent ones online. And we know that many of you will already be familiar with this wonderful tool. If you're not, that is totally fine — this is a great opportunity for you to rectify that! We find that regular expressions are one of the most useful tools in our programming toolkit, and many of our students tell us that it is one of the things they are most excited to learn about. So head over to Google and search for *regular expressions tutorial* now, and then come back here after you've had a good look around. The book website also provides a list of our favorites.\n",
"\n",
"> AG: Not only are regular expresssions dead handy, they also have interesting roots. They are \"regular\" becuase they they were originally examples of a \"regular\" language, the lowest rung within the \"Chomsky hierarchy\", a grammar classification due to the same linguist Noam Chomskey who wrote *Syntactic Structures*, the pioneering work searching for the formal grammar underlying human language. This is one of the charms of computing: it may be that the hammer you reach for every day in fact came from a space ship.\n",
"> a: Not only are regular expresssions dead handy, they also have interesting roots. They are \"regular\" because they they were originally examples of a \"regular\" language, the lowest rung within the \"Chomsky hierarchy\", a grammar classification due to the same linguist Noam Chomskey who wrote _Syntactic Structures_, the pioneering work searching for the formal grammar underlying human language. This is one of the charms of computing: it may be that the hammer you reach for every day in fact came from a space ship.\n",
"\n",
"When you are writing a regular expression, the best way to start is just to try it against one example at first. Let's use the `findall` method to try a regular expression against the filename of the `fname` object:"
]
@ -216,13 +216,6 @@
"dls = pets.dataloaders(path/\"images\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Presizing"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -234,13 +227,25 @@
"batch_tfms=aug_transforms(size=224, min_scale=0.75)\n",
"```\n",
"\n",
"These lines implement a fastai data augmentation strategy which we call *presizing*. Presizing is a particular way to do image augmentation, which is designed to minimize data destruction while maintaining good performance.\n",
"\n",
"These lines implement a fastai data augmentation strategy which we call *presizing*. Presizing is a particular way to do image augmentation, which is designed to minimize data destruction while maintaining good performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Presizing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need our images to have the same dimensions, so that they can collate into tensors to be passed to the GPU. We also want to minimize the number of distinct augmentation computations we perform. So the performance requirement suggests that we should, where possible, compose our augmentation transforms into fewer transforms (to reduce the number of computations, and reduce the number of lossy operations) and transform the images into uniform sizes (to run compute efficiently on the GPU).\n",
"\n",
"The challenge is that, if performed after resizing down to the augmented size, various common data augmentation transforms might introduce spurious empty zones, degrade data, or both. For instance, rotating an image by 45 degrees fills corner regions of the new bounds with emptyness, which will not teach the model anything. Many rotation and zooming operations will require interpolating to create pixels. These interpolated pixels are derived from the original image data but are still of lower quality.\n",
"\n",
"To workaround these challenges, presizing adopts two strategies:\n",
"To workaround these challenges, presizing adopts two strategies that are shown in <<presizing>>:\n",
"\n",
"1. First, resizing images to relatively \"large dimensions\" that is, dimensions significantly larger than the target training dimensions. \n",
"1. Second, composing all of the common augmentation operations (including a resize to the final target size) into one, and performing the combined operation on the GPU only once at the end of processing, rather than performing them individually and interpolating multiple times.\n",
@ -319,7 +324,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see here that the image on the right is less well defined, and has reflection padding artifacts in the bottom left, and the grass in the top left has disappeared entirely. We find that in practice using presizing significantly improves the accuracy of models, and often results in speedups too."
"You can see here that the image on the right is less well defined, and has reflection padding artifacts in the bottom left, and the grass in the top left has disappeared entirely. We find that in practice using presizing significantly improves the accuracy of models, and often results in speedups too.\n",
"\n",
"Checking your data looks right is extremely important before training a model. There are simple ways to do this (and debug if needed) in the fastai library, let's look at them now."
]
},
{
@ -333,7 +340,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can never just assume that our code is working perfectly. We know that it works for at least one filename — but we really need to check that our dataset actually makes sense. To do this, before we do any modelling, we should always use the `show_batch` method:"
"We can never just assume that our code is working perfectly. Writing a `DataBlock` is just like writing a blueprint. You will get an error message if you have a syntax error somewhere in your code but you have no garanty that your template is going to work on your source of data as you intend. The first thing to do before we trying to train a model is to use the `show_batch` method and have a look at your data:"
]
},
{
@ -496,8 +503,10 @@
"\n",
"Collating items in a batch\n",
"Error! It's not possible to collate your items in a batch\n",
"Could not collate the 0-th members of your tuples because got the following shapes\n",
"torch.Size([3, 500, 375]),torch.Size([3, 375, 500]),torch.Size([3, 333, 500]),torch.Size([3, 375, 500])\n",
"Could not collate the 0-th members of your tuples because got the following \n",
"shapes:\n",
"torch.Size([3, 500, 375]),torch.Size([3, 375, 500]),torch.Size([3, 333, 500]),\n",
"torch.Size([3, 375, 500])\n",
"```"
]
},
@ -606,6 +615,18 @@
"## Cross entropy loss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Cross entropy loss* is a loss function which is similar to the loss function we used in the previous chapter, but (as we'll see) has two benefits:\n",
"\n",
"- It works even when our dependent variable has more than two categories\n",
"- It results in faster and more reliable training.\n",
"\n",
"In order to understand how cross entropy loss works for dependent variables with more than two categories, we first have to understand what the actual data and activations that are loss function is seen look like."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -617,12 +638,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Cross entropy loss* is a loss function which is similar to the loss function we used in the previous chapter, but (as we'll see) has two benefits:\n",
"\n",
"- It works even when our dependent variable has more than two categories\n",
"- It results in faster and more reliable training.\n",
"\n",
"In order to understand how cross entropy loss works for dependent variables with more than two categories, we first have to understand what the actual data and activations that are loss function is seen look like. To actually get a batch of real data from our DataLoaders, we can use the one_batch method:"
"Let's have a look at the activations of our model. To actually get a batch of real data from our DataLoaders, we can use the `one_batch` method:"
]
},
{
@ -729,6 +745,13 @@
"len(preds[0]),preds[0].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To transform the activations of our model into predictions like this, we used something called the softmax activation function."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -740,7 +763,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"An activation function called *softmax* in the final layer is used to ensure that the activations are between zero and one, and that they sum to one.\n",
"In our classification model, an activation function called *softmax* in the final layer is used to ensure that the activations are between zero and one, and that they sum to one.\n",
"\n",
"Softmax is similar to the sigmoid function, which we saw earlier; sigmoid looks like this:"
]
@ -948,7 +971,9 @@
"source": [
"What does this function do in practice? Taking the exponential ensures all our numbers are positive, and then dividing by the sum ensures we are going to have a bunch of numbers that add up to one. The exponential also has a nice property: if one of the numbers in our activations `x` is slightly bigger than the others, the exponential will amplify this (since it grows, well... exponentially) which means that in the softmax, that number will be closer to 1. \n",
"\n",
"Intuitively, the Softmax function *really* wants to pick one class among the others, so it's ideal for training a classifier when we know each picture has a definite label. (Note that it may be less ideal during inference, as you might want your model to sometimes tell you it doesn't recognize any of the classes is has seen during training, and not pick a class because it has a slightly bigger activation score. In this case, it might be better to train a model using multiple binary output columns, each using a sigmoid activation.)"
"Intuitively, the Softmax function *really* wants to pick one class among the others, so it's ideal for training a classifier when we know each picture has a definite label. (Note that it may be less ideal during inference, as you might want your model to sometimes tell you it doesn't recognize any of the classes is has seen during training, and not pick a class because it has a slightly bigger activation score. In this case, it might be better to train a model using multiple binary output columns, each using a sigmoid activation.)\n",
"\n",
"Softmax is the first part of the cross entropy loss, the second part is log likeklihood. "
]
},
{
@ -1136,7 +1161,7 @@
"\n",
"We're only picking the loss from the column containing the correct label. We don't to consider the other columns, because by the definition of softmax, they add up to one minus the activation corresponding to the correct label. Therefore, making the activation for the correct label as high as possible, must mean we're also decreasing the activations of the remaining columns.\n",
"\n",
"PyTorch provides a function that does exactly the same thing as `sm_acts[range(n), targ]` (except it takes the negative, since we want a smaller loss to be better), called `nll_loss` (*NLL* stands for *negative log likelihood*):"
"PyTorch provides a function that does exactly the same thing as `sm_acts[range(n), targ]` (except it takes the negative, because when applying the log afterward, we will have negative numbers), called `nll_loss` (*NLL* stands for *negative log likelihood*):"
]
},
{
@ -1179,6 +1204,13 @@
"F.nll_loss(sm_acts, targ, reduction='none')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Despite the name being negative log likelihood, this PyTorch function does not take the log (we will see why in the next section). First, let's see why taking the logarithm can be useful."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1362,6 +1394,13 @@
"> s: An interesting feature about cross entropy loss appears when we consider its gradient. The gradient of `cross_entropy(a,b)` is just `softmax(a)-b`. Since `softmax(a)` is just the final activation of the model, that means that the gradient is proportional to the difference between the prediction and the target. This is the same as mean squared error in regression (assuming there's no final activation function such as that added by `y_range`), since the gradient of `(a-b)**2` is `2*(a-b)`. Since the gradient is linear, that means that we won't see sudden jumps or exponential increases in gradients, which should lead to smoother training of models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have now seen all the pieces hidden behind our loss function. While it gives us a number on how well (or bad) our model is doing, it does nothing to help us know if it's actually any good. Let's now see some ways to interpret our model predictions."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1444,7 +1483,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we are not pet breed experts, it is hard for us to know whether these category errors reflect actual difficulties in recognising breeds. So again, we turn to Google. A little bit of googling tells us that the most common category errors shown here are actually breed differences which even expert breeders sometimes disagree about. So this gives us some comfort that we are on the right track."
"Since we are not pet breed experts, it is hard for us to know whether these category errors reflect actual difficulties in recognising breeds. So again, we turn to Google. A little bit of googling tells us that the most common category errors shown here are actually breed differences which even expert breeders sometimes disagree about. So this gives us some comfort that we are on the right track.\n",
"\n",
"So we seem to have a good baseline. What can we do now ot make it even better?"
]
},
{
@ -1454,6 +1495,15 @@
"## Improving our model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will now look a at a range of techniques to improve the training of our model and make it better. While doing so, we will explain a little bit more about transfer learning and how to fine-tune our pretrained model as best as possible, without breaking the pretrained weights.\n",
"\n",
"The first thing we need to set when training a model is the learning rate. We saw in the previous chapter that it needed to be just right to train as efficiently as possible, so how do we pick a good one? fastai provides something called the Learning rate finder for this."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1707,7 +1757,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Something really interesting about this is that it was only discovered in 2015. Neural networks have been under development since the 1950s. Throughout that time finding a good learning rate has been, perhaps, the most important and challenging issue for practitioners. The idea does not require any advanced maths, giant computing resources, huge datasets, or anything else that would make it inaccessible to any curious researcher. Furthermore, the researcher who did develop it, Leslie Smith, was not part of some exclusive Silicon Valley lab, but was working as a naval researcher. All of this is to say: breakthrough work in deep learning absolutely does not require access to vast resources, elite teams, or advanced mathematical ideas. There is lots of work still to be done which requires just a bit of common sense, creativity, and tenacity."
"Something really interesting about the learning rate finder is that it was only discovered in 2015. Neural networks have been under development since the 1950s. Throughout that time finding a good learning rate has been, perhaps, the most important and challenging issue for practitioners. The idea does not require any advanced maths, giant computing resources, huge datasets, or anything else that would make it inaccessible to any curious researcher. Furthermore, Leslie Smith, was not part of some exclusive Silicon Valley lab, but was working as a naval researcher. All of this is to say: breakthrough work in deep learning absolutely does not require access to vast resources, elite teams, or advanced mathematical ideas. There is lots of work still to be done which requires just a bit of common sense, creativity, and tenacity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a good learning rate to train our model, let's look at how we can finetune the weights of a pretrained model."
]
},
{
@ -1727,7 +1784,7 @@
"\n",
"This final linear layer is unlikely to be of any use for us, when we are fine tuning in a transfer learning setting, because it is specifically designed to classify the categories in the original pretraining dataset. So when we do transfer learning we remove it, and throw it away, and replace it with a new linear layer with the correct number of outputs for our desired task (in this case, there would be 37 activations).\n",
"\n",
"This newly added linear layer will have entirely random weights. Therefore, our model prior to fine tuning has entirely random outputs. But that does not mean that it is an entirely random model! All of the layers prior to the last one have been carefully trained to be good at image classification tasks in general. As we saw in the images from the Zeiler and Fergus paper, the first layers encode very general concepts such as finding gradients and edges, and later layers encode concepts that are still very useful for us, such as finding eyeballs and fur.\n",
"This newly added linear layer will have entirely random weights. Therefore, our model prior to fine tuning has entirely random outputs. But that does not mean that it is an entirely random model! All of the layers prior to the last one have been carefully trained to be good at image classification tasks in general. As we saw in the images from the Zeiler and Fergus paper in <<chapter_intro>> (see <<img_layer1>> and followings), the first layers encode very general concepts such as finding gradients and edges, and later layers encode concepts that are still very useful for us, such as finding eyeballs and fur.\n",
"\n",
"We want to train a model in such a way that we allow it to remember all of these generally useful ideas from the pretrained model, use them to solve our particular task (classify pet breeds), and only adjust them as required for the specifics of our particular task.\n",
"\n",
@ -1743,7 +1800,7 @@
"- train the randomly added layers for one epoch, with all other layers frozen ;\n",
"- unfreeze all of the layers, and train them all for the number of epochs requested.\n",
"\n",
"Although this is a reasonable default approach, it is likely that for your particular dataset you may get better results by doing things slightly differently. The `fine_tune` method has a number of parameters you can use to change its behaviour, but it might be easiest for you to just call the underlying methods directly. Remember that you can see the source code for the method by using the following syntax:\n",
"Although this is a reasonable default approach, it is likely that for your particular dataset you may get better results by doing things slightly differently. The `fine_tune` method has a number of parameters you can use to change its behaviour, but it might be easiest for you to just call the underlying methods directly if you want to get some custom behavior. Remember that you can see the source code for the method by using the following syntax:\n",
"\n",
" learn.fine_tune??\n",
"\n",
@ -1960,7 +2017,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This has improved our model a bit, but there's more we can do..."
"This has improved our model a bit, but there's more we can do. The deepest layers of our pretrained model might not need as high a learning rate as the last ones, so we should probably use different learning rates for those, something called discriminative learning rates."
]
},
{
@ -1978,7 +2035,7 @@
"\n",
"In addition, do you remember the images we saw in <<chapter_intro>>, showing what each layer learns? The first layer learns very simple foundations, like edge and gradient detectors; these are likely to be just as useful for nearly any task. The later layers learn much more complex concepts, like \"eye\" and \"sunset\", which might not be useful in your task at all (maybe you're classifying car models, for instance). So it makes sense to let the later layers fine-tune more quickly than earlier layers.\n",
"\n",
"Therefore, fastai by default does something called *discriminative learning rates*. This was originally developed in the ULMFiT approach to NLP transfer learning that we introduced in <<chapter_intro>>. Like many good ideas in deep learning, it is extremely simple: use a lower learning rate for the early layers of the neural network, and a higher learning rate for the later layers (and especially the randomly added layers). The idea is based on insights developed by Jason Yosinski, who showed in 2014 that when transfer learning different layers of a neural network should train at different speeds:"
"Therefore, fastai by default does something called *discriminative learning rates*. This was originally developed in the ULMFiT approach to NLP transfer learning that we introduced in <<chapter_intro>>. Like many good ideas in deep learning, it is extremely simple: use a lower learning rate for the early layers of the neural network, and a higher learning rate for the later layers (and especially the randomly added layers). The idea is based on insights developed by Jason Yosinski, who showed in 2014 that when transfer learning different layers of a neural network should train at different speeds, as seen in <<yosinski>>."
]
},
{
@ -2199,6 +2256,13 @@
"As you can see, the training loss keeps getting better and better. But notice that eventually the validation loss improvement slows, and sometimes even gets worse! This is the point at which the model is starting to over fit. In particular, the model is becoming overconfident of its predictions. But this does *not* mean that it is getting less accurate, necessarily. Have a look at the table of training results per epoch, and you will often see that the accuracy continues improving, even as the validation loss gets worse. In the end what matters is your accuracy, or more generally your chosen metrics, not the loss. The loss is just the function we've given the computer to help us to optimise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another decision you have to make when training the model is for how long."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -2214,7 +2278,9 @@
"\n",
"On the other hand, you may well see that the metrics you have chosen are really getting worse at the end of training. Remember, it's not just that were looking for the validation loss to get worse, but your actual metrics. Your validation loss will first of all during training get worse because it gets overconfident, and only later will get worse because it is incorrectly memorising the data. We only care in practice about the latter issue. Our loss function is just something, remember, that we used to allow our optimiser to have something it could differentiate and optimise; it's not actually the thing we care about in practice.\n",
"\n",
"Before the days of 1cycle training it was very common to save the model at the end of each epoch, and then select whichever model had the best accuracy, out of all of the models saved in each epoch. This is known as *early stopping*. However, with one cycle training, it is very unlikely to give you the best answer, because those epochs in the middle occur before the learning rate has had a chance to reach the small values, where it can really find the best result. Therefore, if you find that you have overfit, what you should actually do is to retrain your model from scratch, and this time select a total number of epochs based on where your previous best results were found."
"Before the days of 1cycle training it was very common to save the model at the end of each epoch, and then select whichever model had the best accuracy, out of all of the models saved in each epoch. This is known as *early stopping*. However, with one cycle training, it is very unlikely to give you the best answer, because those epochs in the middle occur before the learning rate has had a chance to reach the small values, where it can really find the best result. Therefore, if you find that you have overfit, what you should actually do is to retrain your model from scratch, and this time select a total number of epochs based on where your previous best results were found.\n",
"\n",
"If we've got the time to train for more epochs, we may want to instead use that time to train more parameters, that is use a deeper architecture."
]
},
{
@ -2228,7 +2294,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If we've got the time to train for more epochs, we may want to instead use that time to train more parameters. In general, a model with more parameters can model your data more accurately. (There are lots and lots of caveats to this generalisation, and it depends on the specifics of the architectures you are using, but it is a reasonable rule of thumb for now.) For most of the architectures that we will be seeing in this book you can create larger versions of them by simply adding more layers. However, since we want to use pretrained models, we need to make sure that we choose a number of layers that has been already pretrained for us.\n",
"In general, a model with more parameters can model your data more accurately. (There are lots and lots of caveats to this generalisation, and it depends on the specifics of the architectures you are using, but it is a reasonable rule of thumb for now.) For most of the architectures that we will be seeing in this book you can create larger versions of them by simply adding more layers. However, since we want to use pretrained models, we need to make sure that we choose a number of layers that has been already pretrained for us.\n",
"\n",
"This is why, in practice, architectures tend to come in a small number of variants. For instance, the resnet architecture that we are using in this chapter comes in 18, 34, 50, 101, and 152 layer variants, pre-trained on ImageNet. A larger (more layers and parameters; sometimes described as the \"capacity\" of a model) version of a resnet will always be able to give us a better training loss, but it can suffer more from overfitting, because it has more parameters to over fit with.\n",
"\n",

View File

@ -30,7 +30,9 @@
"source": [
"In the previous chapter we learnt some important practical techniques for training models in practice. Issues like selecting learning rates and the number of epochs are very important to getting good results.\n",
"\n",
"In this chapter we are going to look at other types of computer vision problems, multi-label classification and regression. In the process will study more deeply the output activations, targets, and loss functions in deep learning models."
"In this chapter we are going to look at other types of computer vision problems, multi-label classification and regression. The first one is when you want to predict more than one label per image (or sometimes none at all) and the second one is when your labels are one (or several) number, a quantity instead of a category.\n",
"\n",
"In the process will study more deeply the output activations, targets, and loss functions in deep learning models."
]
},
{
@ -46,9 +48,11 @@
"source": [
"Multi-label classification refers to the problem of identifying the categories of objects in an image, where you may not have exactly one type of object in the image. There may be more than one kind of object, or there may be no objects at all in the classes that you are looking for.\n",
"\n",
"For instance, this would have been a great approach for our bear classifier. One problem with the bear classifier that we rolled out before is that if a user uploaded something that wasn't any kind of bear, the model would still say it was either a grizzly, black, or teddy bear — it had no ability to predict \"not a bear at all\". In fact, after we have completed this chapter, it would be a great exercise for you to go back to your image classifier application, and try to retrain it using the multi-label technique. And then, tested by passing in an image which is not of any of your recognised classes.\n",
"For instance, this would have been a great approach for our bear classifier. One problem with the bear classifier that we rolled out in <<chapter_production>> is that if a user uploaded something that wasn't any kind of bear, the model would still say it was either a grizzly, black, or teddy bear — it had no ability to predict \"not a bear at all\". In fact, after we have completed this chapter, it would be a great exercise for you to go back to your image classifier application, and try to retrain it using the multi-label technique. And then, tested by passing in an image which is not of any of your recognised classes.\n",
"\n",
"In practice, we have not seen many examples of people training multi-label classifiers for this purpose. But we very often see both users and developers complaining about this problem. It appears that this simple solution is not at all widely understood or appreciated. Because in practice it is probably more common to have some images with zero matches or more than one match, we should probably expect in practice that multi-label classifiers are more widely applicable than single label classifiers."
"In practice, we have not seen many examples of people training multi-label classifiers for this purpose. But we very often see both users and developers complaining about this problem. It appears that this simple solution is not at all widely understood or appreciated. Because in practice it is probably more common to have some images with zero matches or more than one match, we should probably expect in practice that multi-label classifiers are more widely applicable than single label classifiers.\n",
"\n",
"First, we'll seee what a multi-label dataset looks like, then we'll explain how to get it ready for our model. Then we'll see that the architecture does not change from last chapter, only the loss function does. Let's start with the data."
]
},
{
@ -285,6 +289,13 @@
"### End sidebar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have seen what the data looks like, let's make it ready for model training."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -300,13 +311,13 @@
"\n",
"As we have seen, PyTorch and fastai have two main classes for representing and accessing a training set or validation set:\n",
"\n",
"- `Dataset`: a collection which returns a tuple of your independent and dependent variable for a single item\n",
"- `DataLoader`: an iterator which provides a stream of mini batches, where each mini batch is a couple of a batch of independent variables and a batch of dependent variables\n",
"- `Dataset`:: a collection which returns a tuple of your independent and dependent variable for a single item\n",
"- `DataLoader`:: an iterator which provides a stream of mini batches, where each mini batch is a couple of a batch of independent variables and a batch of dependent variables\n",
"\n",
"On top of these, fastai provides two classes for bringing your training and validation sets together:\n",
"\n",
"- `Datasets`: an object which contains a training `Dataset` and a validation `Dataset`\n",
"- `DataLoaders`: an object which contains a training `DataLoader` and a validation `DataLoader`\n",
"- `Datasets`:: an object which contains a training `Dataset` and a validation `Dataset`\n",
"- `DataLoaders`:: an object which contains a training `DataLoader` and a validation `DataLoader`\n",
"\n",
"Since a `DataLoader` builds on top of a `Dataset`, and adds additional functionality to it (collating multiple items into a mini batch), its often easiest to start by creating and testing `Datasets`, and then look at `DataLoaders` after thats working.\n",
"\n",
@ -652,6 +663,13 @@
"And remember that if anything goes wrong when you create your `DataLoaders` from your `DataBlock`, or if you want to view exactly what happens with your `DataBlock`, you can use the `summary` method we presented in the last chapter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our data is now ready for training a model. As we will see, nothing is going to change when we create our `Learner`, but behind the scenes, the fastai library will pick a new loss function for us: binary cross entropy."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1172,7 +1190,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, we're using the validation set to pick a hyperparameter (the threshold), which is the purpose of the validation set. But sometimes students have expressed their concern that we might be *overfitting* to the validation set, since we're trying lots of values to see which is the best. However, as you see in the plot, changing the threshold in this case results in a smooth curve, so we're clearly not picking some inappropriate outlier. This is a good example of where you have to be careful of the difference between theory (don't try lots of hyperparameter values or you might overfit the validation set) versus practice (if the relationship is smooth, then it's fine to do this)."
"In this case, we're using the validation set to pick a hyperparameter (the threshold), which is the purpose of the validation set. But sometimes students have expressed their concern that we might be *overfitting* to the validation set, since we're trying lots of values to see which is the best. However, as you see in the plot, changing the threshold in this case results in a smooth curve, so we're clearly not picking some inappropriate outlier. This is a good example of where you have to be careful of the difference between theory (don't try lots of hyperparameter values or you might overfit the validation set) versus practice (if the relationship is smooth, then it's fine to do this).\n",
"\n",
"This concludes the part of thic chapter dedicated to multi-label classification. Let's have a look at a regression problem now."
]
},
{
@ -1206,7 +1226,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the [Biwi Kinect Head Pose Dataset](https://data.vision.ee.ethz.ch/cvl/gfanelli/head_pose/head_forest.html#db) for this part. First thing first, let's begin by downloading the dataset as usual."
"We will use the [Biwi Kinect Head Pose Dataset](https://data.vision.ee.ethz.ch/cvl/gfanelli/head_pose/head_forest.html#db) for this section. First thing first, let's begin by downloading the dataset as usual."
]
},
{
@ -1443,7 +1463,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Before doing any modeling, we should look at our data to confirm it seems OK."
"Before doing any modeling, we should look at our data to confirm it seems OK:"
]
},
{
@ -1533,6 +1553,13 @@
"As you can see, we haven't had to use a separate *image regression* application; all we've had to do is label the data, and tell fastai what kind of data the independent and dependent variables represent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's the same for creating our `Learner`. We will use the same function as before, this time with just a new parameter and we will ready to train our model."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1898,31 +1925,6 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,

View File

@ -24,13 +24,6 @@
"# Collaborative filtering deep dive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction to collaborative filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -41,8 +34,27 @@
"\n",
"There is actually a more general class of problems that this approach can solve; not necessarily just things involving users and products. Indeed, for collaborative filtering we more commonly refer to *items*, rather than *products*. Items could be links that you click on, diagnoses that are selected for patients, and so forth.\n",
"\n",
"The key foundational idea is that of *latent factors*. In the above Netflix example, we started with the assumption that you like old action sci-fi movies. But you never actually told Netflix that you like these kinds of movies. And Netflix never actually needed to add columns to their movies table saying which movies are of these types. But there must be some underlying concept of sci-fi, action, and movie age. And these concepts must be relevant for at least some people's movie watching decisions.\n",
"\n",
"The key foundational idea is that of *latent factors*. In the above Netflix example, we started with the assumption that you like old action sci-fi movies. But you never actually told Netflix that you like these kinds of movies. And Netflix never actually needed to add columns to their movies table saying which movies are of these types. But there must be some underlying concept of sci-fi, action, and movie age. And these concepts must be relevant for at least some people's movie watching decisions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's get some data suitable for a collaboratie filtering model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A first look at the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this chapter we are going to work on this movie review problem. We do not have access to Netflix's entire dataset of movie watching history, but there is a great dataset that we can use, called MovieLens. This dataset contains tens of millions of movie rankings (that is a combination of a movie ID, a user ID, and a numeric rating), although we will just use a subset of 100,000 of them for our example. If you're interested, it would be a great learning project to try and replicate this approach on the full 25 million recommendation dataset you can get from their website."
]
},
@ -167,7 +179,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. Here is the same data cross tabulated into a human friendly table:"
"Although this has all the information we need, it is not a particularly helpful way for humans to look at this data. <<movie_xtab>> shows the same data cross tabulated into a human friendly table."
]
},
{
@ -295,6 +307,13 @@
"(user1*casablanca).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we don't know what the latent factors actually are, and we don't know how to score them for each user and movie, we should learn them."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -306,9 +325,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we don't know what the latent factors actually are, and we don't know how to score them for each user and movie, we will learn them. There is surprisingly little distance from specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n",
"There is surprisingly little distance from specifying the structure of a model, as we did in the last section, and learning one, since we can just use our general gradient descent approach.\n",
"\n",
"Step one of this approach was to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialise values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, here's what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example:"
"Step one of this approach is to randomly initialise some parameters. These parameters will be a set of latent factors for each user and movie. We will have to decide how many to use. We will discuss how to select this shortly, but for illustrative purposes let's use 5 for now. Because each user will have a set of these factors, and each movie will have a set of these factors, we can show these randomly initialise values right next to the users and movies in our crosstab, and we can then fill in the dot products for each of these combinations in the middle. For example, <<xtab_latent>> shows what it looks like in Microsoft Excel, with the top-left cell formula displayed as an example."
]
},
{
@ -322,13 +341,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Step two of this approach was to calculate our predictions. As we've discussed, we can do this by simply taking the dot product of each movie with each user. If for instance the first latent user factor represents how much they like action movies, and the first latent movie factor represents if the movie has a lot of action or not, when the product of those will be particularly high if either the user likes action movie and the movie has a lot of action in it or if the user doesn't like action movie and the movie doesn't have any action in it. On the other hand, if we have a mismatch (a user loves action movies but the movie isn't, or the user doesn't like action movies and it is one), the product will be very low.\n",
"Step two of this approach is to calculate our predictions. As we've discussed, we can do this by simply taking the dot product of each movie with each user. If for instance the first latent user factor represents how much they like action movies, and the first latent movie factor represents if the movie has a lot of action or not, when the product of those will be particularly high if either the user likes action movie and the movie has a lot of action in it or if the user doesn't like action movie and the movie doesn't have any action in it. On the other hand, if we have a mismatch (a user loves action movies but the movie isn't, or the user doesn't like action movies and it is one), the product will be very low.\n",
"\n",
"Step three was to calculate our loss. We can use any loss function that we wish; that's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction.\n",
"Step three is to calculate our loss. We can use any loss function that we wish; that's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction.\n",
"\n",
"That's all we need. With this in place, we can optimise our parameters (that is, the latent factors) using stochastic gradient descent, such as to minimise the loss. At each step, the stochastic gradient descent optimiser will calculate the match between each movie and each user using the dot product, and will compare it to the actual rating that each user gave to each movie, and it will then calculate the derivative of this value, and will step the weights by multiplying this by the learning rate. After doing this lots of times, the loss will get better and better, and the recommendations will also get better and better."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use the usual `Learner` fit function, we will need to get our data into `DataLoaders`, so let's focus on that now."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -340,7 +366,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We would rather see movie titles than their ids. The table `u.item` contains the coorespondance id to title:"
"When showing the data we would rather see movie titles than their ids. The table `u.item` contains the coorespondance id to title:"
]
},
{
@ -724,7 +750,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In computer vision, we had a very easy way to get all the information of a pixel through its RGB values. Those three numbers gave us the red-ness, the green-ness and the blue-ness, which was enough to get our model to work afterward.\n",
"In computer vision, we had a very easy way to get all the information of a pixel through its RGB values: each pixel in a coloured imaged is represented by three numbers. Those three numbers gave us the red-ness, the green-ness and the blue-ness, which is enough to get our model to work afterward.\n",
"\n",
"For the problem at hand, we don't have the same easy way to characterize a user or a movie. There is probably relations with genres: if a given user likes romance, he is likely to put higher scores to romance movie. Or wether the movie is more action-centered vs heavy on dialogue. Or the presence of a specific actor that one use might particularly like. \n",
"\n",
@ -1515,6 +1541,13 @@
"learn.fit_one_cycle(5, 5e-3, wd=0.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's have a look at what our model has learned."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1526,7 +1559,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have a look at what our model has learned. It is already useful, in that it can provide us with recommendations for movies for our users — but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:"
"Our model is already useful, in that it can provide us with recommendations for movies for our users — but it is also interesting to see what parameters it has discovered. The easiest to interpret are the biases. Here are the movies with the lowest values in the bias vector:"
]
},
{
@ -1593,7 +1626,7 @@
"source": [
"So, for instance, even if you don't normally enjoy detective movies, you might enjoy LA Confidential!\n",
"\n",
"It is not quite so easy to directly interpret the embedding matrices. There is just too many factors for a human to look at. But there is a technique which can pull out the most important underlying *directions* in such a matrix, called PCA. We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course, Computational Linear Algebra for Coders. Here is what our movies look like based on two of the strongest PCA components:"
"It is not quite so easy to directly interpret the embedding matrices. There is just too many factors for a human to look at. But there is a technique which can pull out the most important underlying *directions* in such a matrix, called *principal component analysis* (PCA). We will not be going into this in detail in this book, because it is not particularly important for you to understand to be a deep learning practitioner, but if you are interested then we suggest you check out the fast.ai course, Computational Linear Algebra for Coders. <<img_pca_movie>> shows what our movies look like based on two of the strongest PCA components."
]
},
{
@ -1618,6 +1651,9 @@
],
"source": [
"#hide_input\n",
"#id img_pca_movie\n",
"#caption Representation of movies on two strongest PCA components\n",
"#alt Representation of movies on two strongest PCA components\n",
"g = ratings.groupby('title')['rating'].count()\n",
"top_movies = g.sort_values(ascending=False).index.values[:1000]\n",
"top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])\n",
@ -1653,7 +1689,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using fastai.collab"
"We defined our model from scratch to teach you what is inside, but you can directly use the fastai library to build it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using fastai.collab"
]
},
{
@ -1800,6 +1843,13 @@
"[dls.classes['title'][i] for i in idxs]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An other interesting thing we can do with these learned embeddings is to look at _distance_."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1811,7 +1861,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"An interesting thing we can do with these learned embeddings is to look at _distance_. On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that X and Y are the distances between the coordinates on each axis). For a 50 dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n",
"On a two-dimensional map we can calculate the distance between two coordinates using the formula of Pythagoras: $\\sqrt{x^{2}+y^{2}}$ (assuming that X and Y are the distances between the coordinates on each axis). For a 50 dimensional embedding we can do exactly the same thing, except that we add up the squares of all 50 of the coordinate distances.\n",
"\n",
"If there were two movies that were nearly identical, then there embedding vectors would also have to be nearly identical, because the users that would like them would be nearly exactly the same. There is a more general idea here: movie similarity can be defined by the similarity of users that like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity. We can use this to find the most similar movie to *Silence of the Lambs*:"
]
@ -1840,6 +1890,13 @@
"dls.classes['title'][idx]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have succesfully trained a model, let's see how to deal whwn we have no data for a new user, to be able to make recommandations to them."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1869,6 +1926,13 @@
"In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify upfront how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It's all about ensuring that there are humans in the loop, that there is careful monitoring, and gradual and thoughtful rollout."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as *probabilistic matrix factorisation* (PMF). Another approach, which generally works similarly well given the same data, is deep learning."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1880,8 +1944,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as *probabilistic matrix factorisation* (PMF). Another approach, which generally works similarly well given the same data, is deep learning.\n",
"\n",
"To turn our architecture into a deep learning model the first step is to take the results of the embedding look up, and concatenating those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.\n",
"\n",
"Since we'll be concatenating the embedding matrices, rather than taking their dot product, that means that the two embedding matrices can have different sizes (i.e. different numbers of latent factors). fastai has a function `get_emb_sz` that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:"
@ -2155,6 +2217,13 @@
"Although the results of `EmbeddingNN` are a bit worse than the dot product approach (which shows the power of carefully using an architecture for a domain), it does allow us to do something very important: we can now directly incorporate other user and movie information, time, and other information that may be relevant to the recommendation. That's exactly what `TabularModel` does. In fact, we've now seen that `EmbeddingNN` is just a `TabularModel`, with `n_cont=0` and `out_sz=1`. So we better spend some time learning about `TabularModel`, and how to use it to get great results!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Add a conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},

View File

@ -36,6 +36,20 @@
"# Tabular modelling deep dive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Write introduction mentioning all machine learning techniques introduced in this cahpter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Categorical embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -50,13 +64,6 @@
"> jargon: Continuous and categorical variables: \"Continuous variables\" are numerical data, such as \"age\" can be directly fed to the model, since you can add and multiply them directly. \"Categorical variables\" contain a number of discrete levels, such as \"movie id\", for which addition and multiplication don't have meaning (even if they're stored as numbers)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Categorical embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -79,7 +86,7 @@
"source": [
"We have already noticed all of these points when we built our collaborative filtering model. We can clearly see that these insights go far beyond just collaborative filtering, however.\n",
"\n",
"The paper also points out that (as we discussed in the last chapter) an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot encoded input layer. They used the following diagram to show this equivalence. Note that \"dense layer\" is another term with the same meaning as \"linear layer\", the one-hot encoding layers represent inputs."
"The paper also points out that (as we discussed in the last chapter) an embedding layer is exactly equivalent to placing an ordinary linear layer after every one-hot encoded input layer. They used the diagram in <<entity_emb>> to show this equivalence. Note that \"dense layer\" is another term with the same meaning as \"linear layer\", the one-hot encoding layers represent inputs."
]
},
{
@ -97,9 +104,7 @@
"\n",
"Where we analyzed the embedding weights for movie reviews, the authors of the entity embeddings paper analyzed the embedding weights for their sales prediction model. What they found was quite amazing, and illustrates their second key insight. This is that the embedding makes the categorical variables into something which is both continuous and also meaningful.\n",
"\n",
"The images below illustrate these ideas. They are based on the approaches used in the paper, along with some analysis we have added.\n",
"\n",
"On the left in the image below is a plot of the embedding matrix for the possible values of the `State` category. For a categorical variable we call the possible values of the variable its \"levels\" (or \"categories\" or \"classes\"), so here one level is \"Berlin,\" another is \"Hamburg,\" etc.. On the right is a map of Germany. The actual physical locations of the German states were not part of the provided data; yet, the model itself learned where they must be, based only on the behavior of store sales!"
"The images in <<state_emb>> below illustrate these ideas. They are based on the approaches used in the paper, along with some analysis we have added."
]
},
{
@ -113,7 +118,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Do you remember how we talked about *distance* between embeddings? The authors of the paper plotted the distance between embeddings between stores against the actual geographic distance between the stores in practice. They found that they matched very closely!"
"On the left in the image below is a plot of the embedding matrix for the possible values of the `State` category. For a categorical variable we call the possible values of the variable its \"levels\" (or \"categories\" or \"classes\"), so here one level is \"Berlin,\" another is \"Hamburg,\" etc.. On the right is a map of Germany. The actual physical locations of the German states were not part of the provided data; yet, the model itself learned where they must be, based only on the behavior of store sales!\n",
"\n",
"Do you remember how we talked about *distance* between embeddings? The authors of the paper plotted the distance between embeddings between stores against the actual geographic distance between the stores in practice (see <<store_emb>>). They found that they matched very closely!"
]
},
{
@ -127,7 +134,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We've even tried plotted the embeddings for days of the week and months of the year, and found that days and months that are near each other on the calendar ended up close as embeddings too."
"We've even tried plotted the embeddings for days of the week and months of the year, and found that days and months that are near each other on the calendar ended up close as embeddings too, as shown in <<date_emb>>."
]
},
{
@ -147,7 +154,7 @@
"\n",
"Is is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n",
"\n",
"This concatenation approach is, for instance, how Google do their recommendations on Google Play, as they explained in their paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792), and as shown in this figure from their paper:"
"This concatenation approach is, for instance, how Google do their recommendations on Google Play, as they explained in their paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792), and as shown in <<google_recsys>> from their paper."
]
},
{
@ -161,7 +168,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Interestingly, Google are actually combining both the two approaches we saw in the previous chapter: the *dot product* (which Google call *Cross Product*) and neural network approach."
"Interestingly, Google are actually combining both the two approaches we saw in the previous chapter: the *dot product* (which Google call *Cross Product*) and neural network approach.\n",
"\n",
"But let's pause for a moment. So far, the solution to all of our modelling problems has been: *train a deep learning model*. And indeed, that is a pretty good rule of thumb for complex unstructured data like images, sounds, natural language text, and so forth. Deep learning also works very well for collaborative filtering. But it is not always the best starting point for analysing tabular data."
]
},
{
@ -175,8 +184,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"But let's pause for a moment. So far, the solution to all of our modelling problems has been: *train a deep learning model*. And indeed, that is a pretty good rule of thumb for complex unstructured data like images, sounds, natural language text, and so forth. Deep learning also works very well for collaborative filtering. But it is not always the best starting point for analysing tabular data.\n",
"\n",
"Most machine learning courses will throw dozens of different algorithms at you, with a brief technical description of the math behind them and maybe a toy example. You're left confused by the enormous range of techniques shown and have little practical understanding of how to apply them.\n",
"\n",
"The good news is that modern machine learning can be distilled down to a couple of key techniques that are of very wide applicability. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:\n",
@ -213,7 +220,9 @@
"\n",
"Instead, we will be largely relying on a library called scikit-learn (also known as *sklearn*). Scikit-learn is a popular library for creating machine learning models, using approaches that are not covered by deep learning. In addition, we'll need to do some tabular data processing and querying, so we'll want to use the Pandas library. Finally, we'll also need numpy, since that's the main numeric programming library that both sklearn and Pandas rely on.\n",
"\n",
"We don't have time to do a deep dive on all these libraries in this book, so we'll just be touching on some of the main parts of each. For a far more in depth discussion, we strongly suggest Wes McKinney's [Python for Data Analysis, 2nd ed](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=asap_bc?ie=UTF8). Wes is the creator of Pandas, so you can be sure that the information is accurate!"
"We don't have time to do a deep dive on all these libraries in this book, so we'll just be touching on some of the main parts of each. For a far more in depth discussion, we strongly suggest Wes McKinney's [Python for Data Analysis, 2nd ed](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=asap_bc?ie=UTF8). Wes is the creator of Pandas, so you can be sure that the information is accurate!\n",
"\n",
"First, let's gather the data we will use."
]
},
{
@ -227,9 +236,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We will be looking at the Blue Book for Bulldozers Kaggle Competition: \"The goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.\"\n",
"For our dataset, we will be looking at the Blue Book for Bulldozers Kaggle Competition: \"The goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.\"\n",
"\n",
"This is a very common type of dataset and prediction problem, and similar to what you may see in your project or workplace."
"This is a very common type of dataset and prediction problem, and similar to what you may see in your project or workplace. It's available for download on Kaggle, a website that hosts data science competitions."
]
},
{
@ -360,6 +369,13 @@
"path.ls(file_type='text')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have downloaded our dataset, let's have a look at it!"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -429,7 +445,7 @@
"source": [
"That's a lot of columns for us to look at! Try looking through the dataset to get a sense of what kind of information is in each one. We'll shortly see how to \"zero in\" on the most interesting bits.\n",
"\n",
"At the point, a good next step is to handle *ordinal columns*. This refers to columns containing strings or similar, but where those strings have a natural ordering. For instance, here are the levels of `ProductSize`:"
"At this point, a good next step is to handle *ordinal columns*. This refers to columns containing strings or similar, but where those strings have a natural ordering. For instance, here are the levels of `ProductSize`:"
]
},
{
@ -505,6 +521,13 @@
"df[dep_var] = np.log(df[dep_var])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now ready to have a look at our first machine learning algorithm for tabular data: decision trees."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -518,6 +541,8 @@
"source": [
"Decision tree ensembles, as the name suggests, rely on decision trees. So let's start there! A decision tree asks a series of binary (that is, yes or no) questions about the data, and on that basis makes a prediction. For instance, the first question might be \"was the equipment manufactured before 1990?\" The second question will depend on the result of the first question (which is why this is a tree); for equipment manufactured for 1990 the second question might be \"was the auction after 2005?\" And so forth…\n",
"\n",
"TK: Adding a figure here might be useful\n",
"\n",
"This sequence of questions is now a procedure for taking any data item, whether an item from the training set or a new one, and assigning that item to a group. Namely, after asking and answering the questions, we can say the item belongs to the group of all the other training data items which yielded the same set of answers to the questions. But what good is this? the goal of our model is to predict values for items, not to assign them into groups from the training dataset. The value of this is that we can now assign a prediction value for each of these groups--for regression, we take the target mean of the items in the group.\n",
"\n",
"Let's consider how we find the right questions to ask. Of course, we wouldn't want to have to create all these questions ourselves — that's what computers are for! The basic steps to train a decision tree can be written down very easily:\n",
@ -615,6 +640,13 @@
"' '.join(o for o in df.columns if o.startswith('sale'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a good first step, but we will need to do a bit more cleaning. For this, we will use fastai objects called `TabularPandas` and `TabularProc`."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1259,6 +1291,13 @@
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that all this preprocessing is done, we are ready to create a decision tree."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -7274,9 +7313,7 @@
"source": [
"Much more reasonable!\n",
"\n",
"Building a decision tree is a good way to create a model of our data. It is very flexible, since it can clearly handle nonlinear relationships and interactions between variables. But we can see there is a fundamental compromise between how well it generalises (which we can achieve by creating small trees) and how accurate it is on the training set (which we can achieve by using large trees).\n",
"\n",
"But, how do we get the best of both worlds?"
"One thing that may have struck you during this process is that we have not done anything special to handle categorical variables."
]
},
{
@ -7290,8 +7327,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing that may have struck you during this process is that we have not done anything special to handle categorical variables.\n",
"\n",
"This is unlike the situation with deep learning networks, where we one-hot encoded the variables and then fed them to an embedding layer. There, the embedding layer helped to discover the meaning of these variable levels, since each level of a categorical variable does not have a meaning on its own (unless we manually specified an ordering using pandas). So how can these untreated categorical variables do anything useful in a decision tree? For instance, how could something like a product code be used?\n",
"\n",
"The short answer is: it just works! Think about a situation where there is one product code that is far more expensive at auction than any other one. In that case, any binary split will result in that one product code being in some group, and that group will be more expensive than the other group. Therefore, our simple decision tree building algorithm will choose that split. Later during training, the algorithm will be able to further split the subgroup which now contains the expensive product code. Over time, the tree will home in on that one expensive product.\n",
@ -7308,6 +7343,15 @@
"> : \"The standard approach for nominal predictors is to consider all (2^(k 1) 1) 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k 1 splits have to be considered for a nominal predictor with k categories.\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Building a decision tree is a good way to create a model of our data. It is very flexible, since it can clearly handle nonlinear relationships and interactions between variables. But we can see there is a fundamental compromise between how well it generalises (which we can achieve by creating small trees) and how accurate it is on the training set (which we can achieve by using large trees).\n",
"\n",
"But, how do we get the best of both worlds? A solution is to use random forests."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -7319,14 +7363,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In 1994 Berkeley professor Leo Breiman, one year after his retirement, published a small technical report called *Bagging Predictors*, which turned out to be one of the most influential ideas in modern machine learning. The report began:\n",
"In 1994 Berkeley professor Leo Breiman, one year after his retirement, published a small technical report called [*Bagging Predictors*](https://www.stat.berkeley.edu/~breiman/bagging.pdf), which turned out to be one of the most influential ideas in modern machine learning. The report began:\n",
"\n",
"> : \"Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions... The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests… show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.\"\n",
"\n",
@ -7341,7 +7378,7 @@
"\n",
"In 2001 Leo Breiman went on to demonstrate that this approach to building models, when applied to decision tree building algorithms, was particularly powerful. He went even further than just randomly choosing rows for each model's training, but also randomly selected from a subset of columns when choosing each split in each decision tree. He called this method the *random forest*. Today it is, perhaps, the most widely used and practically important machine learning method.\n",
"\n",
"In essence a random forest is a model that averages the predictions of large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree and other tree parameters. \"Bagging\" is a particular approach to \"ensembling\", which refers to any approach that combines the results of multiple models together."
"In essence a random forest is a model that averages the predictions of large number of decision trees, which are generated by randomly varying various parameters that specify what data is used to train the tree and other tree parameters. \"Bagging\" is a particular approach to \"ensembling\", which refers to any approach that combines the results of multiple models together. Let's get started on creating our own random forest!"
]
},
{
@ -7425,12 +7462,14 @@
"source": [
"One of the most important properties of random forests is that they aren't very sensitive to the hyperparameter choices, such as `max_features`. You can set `n_estimators` to as high a number as you have time to train -- the more trees, the more accurate they will be. `max_samples` can often be left at its default, unless you have over 200,000 data points, in which case setting it to 200,000 will make it train faster, with little impact on accuracy. `max_features=0.5`, and `min_samples_leaf=4` both tend to work well, although sklearn's defaults work well too.\n",
"\n",
"The sklearn docs [show an example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) of different `max_features` choices, with increasing numbers of trees. In the plot, the blue plot line uses the fewest features and the green line uses the most, since it uses all the features. As you can see, the models with the lowest error result from using a subset of features but with a larger number of trees:"
"The sklearn docs [show an example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) of different `max_features` choices, with increasing numbers of trees. In the plot, the blue plot line uses the fewest features and the green line uses the most, since it uses all the features. As you can see in <<max_features>>, the models with the lowest error result from using a subset of features but with a larger number of trees."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"hide_input": true
},
"source": [
"<img alt=\"sklearn max_features chart\" width=\"500\" caption=\"Error based on max features and # trees\" src=\"images/sklearn_features.png\" id=\"max_features\"/>"
]
@ -7514,6 +7553,13 @@
"plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we've shown, we can't tell. However, random forests have a very clever trick called *out-of-bag (OOB) error* which can handle this (and more!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -7525,8 +7571,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we've shown, we can't tell. However, random forests have a very clever trick called *out-of-bag (OOB) error* which can handle this (and more!)\n",
"\n",
"The idea is to calculate error on the training set, but only include the trees in the calculation of a row's error where that row was *not* included in training that tree. This allows us to see whether the model is over-fitting, without needing a separate validation set.\n",
"\n",
"This also has the benefit of allowing us to see whether our model generalizes, even if we have such a small amount of data that we want to avoid removing items to create a validation set. The OOB predictions are available in the `oob_prediction_` attribute. Note that we compare to *training* labels, since this is being calculated on the OOB trees on the training set.\n",
@ -7568,6 +7612,13 @@
"> question: Make a list of reasons why this model's validation set error on this dataset might be worse than the OOB error. How could you test your hypotheses?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is one way to interpret our model predictions, let's focus on more of those now."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -7878,7 +7929,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It seems likely that we could use just a subset of the columns and still get good results. Let's try just keeping those with a feature importance greater than 0.005."
"It seems likely that we could use just a subset of the columns by removing the variables of low-importance and still get good results. Let's try just keeping those with a feature importance greater than 0.005."
]
},
{
@ -8013,6 +8064,13 @@
"plot_fi(rf_feat_importance(m, xs_imp));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings. Let's try to remove redundent features. "
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -8024,7 +8082,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings. Let's try to remove redundent features. We can do this using \"*hierarchical cluster analysis*\", which find pairs of columns that are the most similar, and replaces them with the average of those columns. It does this recursively, until there's just one column. It plots a \"*dendrogram*\", which shows which columns were combined in which order, and how far away they are from each other.\n",
"We can do this using \"*hierarchical cluster analysis*\", which find pairs of columns that are the most similar, and replaces them with the average of those columns. It does this recursively, until there's just one column. It plots a \"*dendrogram*\", which shows which columns were combined in which order, and how far away they are from each other.\n",
"\n",
"Here's how it looks:"
]
@ -8245,6 +8303,13 @@
"m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tk add transition"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -8410,6 +8475,13 @@
"Thinking back to our bear detector, this mirrors the advice that we also provided there — it is often a good idea to build a model first, and then do your data cleaning, rather than vice versa. The model can help you identify potentially problematic data issues."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK Add transition"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -8443,7 +8515,7 @@
"- Which columns are effectively redundant with each other, for purposes of prediction?\n",
"- How do predictions vary, as we vary these columns?\n",
"\n",
"We've handled three of these already--so just one to go, which is: \"For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\" To answer this question, we need to use the `treeinterpreter` library. We'll also use the `waterfallcharts` library to draw the chart of the results.\n",
"We've handled four of these already--so just one to go, which is: \"For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?\" To answer this question, we need to use the `treeinterpreter` library. We'll also use the `waterfallcharts` library to draw the chart of the results.\n",
"\n",
" !pip install treeinterpreter\n",
" !pip install waterfallcharts"
@ -8549,6 +8621,13 @@
"This kind of information is most useful in production, rather than during model development. You can use it to provide useful information to users of your data product about the underlying reasoning behind the predictions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK add a transition"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -8556,6 +8635,13 @@
"## Extrapolation and neural networks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK add an introduction here before stacking header"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -8713,7 +8799,7 @@
"\n",
"Remember, a random forest is just the average of the predictions of a number of trees. And a tree simply predicts the average value of the rows in a leaf. Therefore, a tree and a random forest can never predict values outside of the range of the training data. This is particularly problematic for data where there is a trend over time, such as inflation, and you wish to make predictions for a future time.. Your predictions will be systematically to low.\n",
"\n",
"But the problem is actually more general than just time variables. Random forests are not able to extrapolate outside of the types of data you have seen, in a more general sense. It can only ever average previously seen observations."
"But the problem is actually more general than just time variables. Random forests are not able to extrapolate outside of the types of data you have seen, in a more general sense. That's why we need to make sure our validation set does not contain out of domain data."
]
},
{
@ -8960,7 +9046,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's a tiny bit better, which shows that you shouldn't always just use your entire dataset; sometimes a subset can be better."
"It's a tiny bit better, which shows that you shouldn't always just use your entire dataset; sometimes a subset can be better.\n",
"\n",
"Let's see if using a neural network helps."
]
},
{
@ -9384,6 +9472,13 @@
"learn.save('nn')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK add transition of make this an aside"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -9402,6 +9497,13 @@
"That means that really all the work is happening in `TabularModel`, so take a look at the source for that now. With the exception of the `BatchNorm1d` and `Dropout` layers (which we'll be learning about shortly) you now have the knowledge required to understand this whole class. Take a look at the discussion of `EmbeddingNN` at the end of the last chapter. Recall that it passed `n_cont=0` to `TabularModel`. We now can see why that was: because there are zero continuous variables (in fastai the `n_` prefix means \"number of\", and `cont` is an abbreviation for \"continuous\")."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tk add transition"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -9466,6 +9568,13 @@
"In fact, this result is better than any score shown on the Kaggle leaderboard. This is not directly comparable, however, because the Kaggle leaderboard uses a separate dataset that we do not have access to. Kaggle does not allow us to submit to this old competition, to find out how we would have gone, so we have no way to directly compare. But our results certainly look very encouraging!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is another important approach to ensembling, called *boosting*, where we add models, instead of averaging them. "
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -9479,7 +9588,7 @@
"source": [
"So far our approach to ensembling has been to use *bagging*, which involves combining many models together by averaging them, where each model is trained on a different data subset. When this is applied to decision trees, this is called a *random forest*.\n",
"\n",
"There is another important approach to ensembling, called *boosting*, where we add models, instead of averaging them. Here is how it works:\n",
"Here is how boosting works:\n",
"\n",
"- Train a small model which under fits your dataset\n",
"- Calculate the predictions in the training set for this model\n",
@ -9496,6 +9605,13 @@
"We are not going to go into details as to how to train a gradient boosted tree ensemble here, because the field is moving rapidly, and any guidance we give will almost certainly be outdated by the time you read this! As we write this, sklearn has just added a `HistGradientBoostingRegressor` class, which provides excellent performance. There are many hyperparameters to tweak for this class, and for all gradient boosted tree methods we have seen. Unlike random forests, gradient boosted trees are extremely sensitive to the choices of these hyperparameters. So in practice, most people will use a loop which tries a range of different hyperparameters, to find which works best."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK add transition. Or maybe make this an aside?"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -9507,7 +9623,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The abstract of the entity embedding paper states: \"*the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead*\". It includes this very interesting table:"
"The abstract of the entity embedding paper we mentioned at the start of this chapter states: \"*the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead*\". It includes this very interesting table:"
]
},
{
@ -9532,14 +9648,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Our advice for tabular modeling"
"## Conclusion: our advice for tabular modeling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have two approaches to tabular modelling: decision tree ensembles, and neural networks. And we have mentioned two different decision tree ensembles: random forests, and gradient boosting machines. Each is very effective, but each also has compromises:\n",
"We have dicussed two approaches to tabular modelling: decision tree ensembles, and neural networks. And we have mentioned two different decision tree ensembles: random forests, and gradient boosting machines. Each is very effective, but each also has compromises:\n",
"\n",
"**Random forests** are the easiest to train, because they are extremely resilient to hyperparameter choices, and require very little preprocessing. They are very fast to train, and should not overfit, if you have enough trees. But, they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods\n",
"\n",

View File

@ -53,7 +53,7 @@
"\n",
"One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine tune the (sequence-based) language model prior to fine tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. So that is 100,000 movie reviews altogether (since there are also 25,000 labelled reviews in the training set, and 25,000 in the validation set). We can use all 100,000 of these reviews to fine tune the pretrained language model — this will result in a language model that is particularly good at predicting the next word of a movie review. In contrast, the pretrained model was trained only on Wikipedia articles.\n",
"\n",
"The [ULMFiT paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of language model fine tuning, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarised in this figure:"
"The [ULMFiT paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of language model fine tuning, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarised in <<ulmfit_process>>."
]
},
{
@ -67,15 +67,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Language model fine tuning"
"Now have a think about how you would turn this language modelling problem into a neural network, given what you have learned so far. We'll be able to use concepts that we've seen in the last two chapters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Text preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A *language model* is a model that learns to predict the next word of a sentence. Have a think about how you would turn this language modelling problem into a neural network, given what you have learned so far. We'll be able to use concepts that we've seen in the last two chapters.\n",
"\n",
"It's not at all obvious how we're going to use what we've learned so far to build a language model. Sentences can be different lengths, and documents can be very long. So, how can we predict the next word of a sentence using a neural network? Let's find out!\n",
"\n",
"We've already seen how categorical variables can be used as independent variables for a neural network. The approach we took for a single categorical variable was to:\n",
@ -96,21 +101,14 @@
"source": [
"Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:\n",
"\n",
"- **Tokenization**: convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)\n",
"- **Numericalization**: make a list of all of the unique words which appear (the vocab), and convert each word into a number, by looking up its index in the vocab\n",
"- **Language model data loader** creation: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable which is offset from the independent variable buy one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required\n",
"- **Language model** creation: we need a special kind of model which does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network*. We will get to the details of this in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.\n",
"- **Tokenization**:: convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)\n",
"- **Numericalization**:: make a list of all of the unique words which appear (the vocab), and convert each word into a number, by looking up its index in the vocab\n",
"- **Language model data loader** creation:: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable which is offset from the independent variable buy one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required\n",
"- **Language model** creation:: we need a special kind of model which does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network*. We will get to the details of this in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.\n",
"\n",
"Let's take a look at how each step works in detail."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing text with fastai"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -126,9 +124,9 @@
"\n",
"Because there is no one correct answer to these questions, there is no one approach to tokenization. Each element of the list created by the tokenisation process is called a *token*. There are three main approaches:\n",
"\n",
"- **Word-based**: split a sentence on spaces, as well as applying language specific rules to try to separate parts of meaning, even when there are no spaces, such as turning \"don't\" into \"do n't\". Generally, punctuation marks are also split into separate tokens\n",
"- **Subword based**: split words into smaller parts, based on the most commonly occurring substrings. For instance, \"occasion\" might be tokeniser as \"o c ca sion\"\n",
"- **Character-based**: split a sentence into its individual characters.\n",
"- **Word-based**:: split a sentence on spaces, as well as applying language specific rules to try to separate parts of meaning, even when there are no spaces, such as turning \"don't\" into \"do n't\". Generally, punctuation marks are also split into separate tokens\n",
"- **Subword based**:: split words into smaller parts, based on the most commonly occurring substrings. For instance, \"occasion\" might be tokeniser as \"o c ca sion\"\n",
"- **Character-based**:: split a sentence into its individual characters.\n",
"\n",
"We'll be looking at word and subword tokenization here, and we'll leave character-based tokenization for you to implement in the questionnaire at the end of this chapter."
]
@ -349,14 +347,14 @@
"\n",
"Here is a brief summary of what each does:\n",
"\n",
"- `fix_html`: replace special HTML characters by a readable version (IMDb reviwes have quite a few of them for instance) ;\n",
"- `replace_rep`: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it's repeated, then the character ;\n",
"- `replace_wrep`: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it's repeated, then the word ;\n",
"- `spec_add_spaces`: add spaces around / and # ;\n",
"- `rm_useless_spaces`: remove all repetitions of the space character ;\n",
"- `replace_all_caps`: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;\n",
"- `replace_maj`: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;\n",
"- `lowercase`: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)."
"- `fix_html`:: replace special HTML characters by a readable version (IMDb reviwes have quite a few of them for instance) ;\n",
"- `replace_rep`:: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it's repeated, then the character ;\n",
"- `replace_wrep`:: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it's repeated, then the word ;\n",
"- `spec_add_spaces`:: add spaces around / and # ;\n",
"- `rm_useless_spaces`:: remove all repetitions of the space character ;\n",
"- `replace_all_caps`:: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;\n",
"- `replace_maj`:: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;\n",
"- `lowercase`:: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)."
]
},
{
@ -386,6 +384,13 @@
"coll_repr(tkn('&copy; Fast.ai www.fast.ai/INDEX'), 31)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's have a look at how subword tokenization would work."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -557,6 +562,13 @@
"Overall, subword tokenization provides a way to easily scale between character tokenization (i.e. use a small subword vocab) and word tokenization (i.e. use a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other \"languages\" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once our texts have been split into tokens, we need to convert them to numbers."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -709,6 +721,13 @@
"' '.join(num.vocab[o] for o in nums)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have numbers, we need to put them in batches for our model."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1239,6 +1258,13 @@
"' '.join(num.vocab[o] for o in y[0][:20])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This concludes all the preprocessing steps we need to apply to our data. We are now ready to train out text classifier."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1246,6 +1272,15 @@
"## Training a text classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have seen at the beginning of this chapter to train a state-of-the-art text classifier using transfer learning will take two steps: first we need to fine-tune our langauge model pretrained on Wikipedia to the corpus of IMDb reviews, then we can use that model to train a classifier.\n",
"\n",
"As usual, let's start with assemblng our data."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1332,6 +1367,13 @@
"dls_lm.show_batch(max_n=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that our data is ready, we can fine-tune the pretrained language model."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1423,6 +1465,13 @@
"learn.fit_one_cycle(1, 2e-2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model takes a while to train, so it's a good opportunity to talk about saving intermediary results. "
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1434,7 +1483,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This model takes a while to train, so it's a good opportunity to talk about saving intermediary results. You can easily save the state of your model like so:"
"You can easily save the state of your model like so:"
]
},
{
@ -2074,7 +2123,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We reach 94.3% accuracy, which was state-of-the-art just three years ago. By training a model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, fine-tuning a much bigger model and using expensive data augmentation (translating sentences in another language and back, using another model for translation)."
"We reach 94.3% accuracy, which was state-of-the-art just three years ago. By training a model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, fine-tuning a much bigger model and using expensive data augmentation (translating sentences in another language and back, using another model for translation).\n",
"\n",
"Using a pretrained model let us build a fine-tuned language model that was pretty powerful, to either generate fake reviews or help classify them. It is good to remember that this technology can also be used for malign purposes."
]
},
{
@ -2088,14 +2139,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analysed the comments that were sent to the FCC in the USA regarding a 2017 proposal to repeal net neutrality. In his article [More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6)\", he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Madlibs-style mail merge. Below, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature:"
"Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analysed the comments that were sent to the FCC in the USA regarding a 2017 proposal to repeal net neutrality. In his article [More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6)\", he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Madlibs-style mail merge. In <<disinformation>>, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ethics/image16.png\" width=\"700\">"
"<img src=\"images/ethics/image16.png\" width=\"700\" id=\"disinformation\" caption=\"Comments received during the neutral neutrality debate\">"
]
},
{
@ -2104,7 +2155,7 @@
"source": [
"Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
"\n",
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the tools at your disposal necessary to create and compelling language model. That is, something that can generate context appropriate believable text. It won't necessarily be perfectly accurate or correct, but it will be believable. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about. Take a look at this conversation on Reddit, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending:"
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the tools at your disposal necessary to create and compelling language model. That is, something that can generate context appropriate believable text. It won't necessarily be perfectly accurate or correct, but it will be believable. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about. Take a look at this conversation on Reddit shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending:"
]
},
{
@ -2120,14 +2171,14 @@
"source": [
"In this case, the use of the algorithm is being done explicitly. But imagine what would happen if a bad actor decided to release such an algorithm across social networks. They could do it slowly and carefully, allowing the algorithms to gradually develop followings and trust over time. It would not take many resources to have literally millions of accounts doing this. In such a situation we could easily imagine it getting to a point where the vast majority of discourse online was from bots, and nobody would have any idea that it was happening.\n",
"\n",
"We are already starting to see examples of machine learning being used to generate identities. For example, here is the LinkedIn profile for Katie Jones:"
"We are already starting to see examples of machine learning being used to generate identities. For example, <<katie_jones>> shows us a LinkedIn profile for Katie Jones."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ethics/image15.jpeg\" width=\"400\">"
"<img src=\"images/ethics/image15.jpeg\" width=\"400\" id=\"katie_jones\" caption=\"Katie Jones' LinkedIn profile\">"
]
},
{
@ -2139,6 +2190,13 @@
"Many people assume or hope that algorithms will come to our defence here. The hope is that we will develop classification algorithms which can automatically recognise auto generated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TK make transition if it stays in this chapter or conclusion of NLP chapter"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -2169,7 +2227,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
@ -2187,7 +2245,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
@ -2230,7 +2288,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
@ -2247,7 +2305,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [
{
@ -2256,7 +2314,7 @@
"(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at'...]"
]
},
"execution_count": 5,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
@ -2270,7 +2328,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [
{
@ -2279,7 +2337,7 @@
"tensor([ 2, 8, 20, 27, 11, 88, 18, 53, 3286, 45])"
]
},
"execution_count": 6,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
@ -2300,7 +2358,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {},
"outputs": [
{
@ -2309,7 +2367,7 @@
"tensor([ 2, 8, 20, 27, 11, 88, 18, 53, 3286, 45])"
]
},
"execution_count": 8,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
@ -2330,7 +2388,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {},
"outputs": [
{
@ -2339,7 +2397,7 @@
"(#10) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at']"
]
},
"execution_count": 9,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
@ -2357,7 +2415,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": null,
"metadata": {},
"outputs": [
{
@ -2366,7 +2424,7 @@
"'xxbos xxmaj this movie , which i just discovered at'"
]
},
"execution_count": 10,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
@ -2434,7 +2492,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"metadata": {},
"outputs": [
{
@ -2443,7 +2501,7 @@
"3"
]
},
"execution_count": 11,
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
@ -2463,7 +2521,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
@ -3204,31 +3262,6 @@
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,