Copyedits 10-12

This commit is contained in:
Sylvain Gugger 2020-05-15 15:04:52 -07:00
parent a482265f72
commit a3599602ce
6 changed files with 396 additions and 393 deletions

View File

@ -29,9 +29,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify those reviews. One thing is a bit different from the transfer learning we have in computer vision: the pretrained model was not trained on the same task as the model we used to classify reviews.\n",
"In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify reviews. One thing is a bit different from the transfer learning we have in computer vision: the pretrained model was not trained on the same task as the model we used to classify reviews.\n",
"\n",
"What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called *self-supervised learning*: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to get an understanding of the English-- or other--language. Self-supervised learning can also be used in other domains; for instance, see [Self-supervised learning and computer vision](https://www.fast.ai/2020/01/13/self_supervised/) for an introduction to vision applications. Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pre-training a model used for transfer learning."
"What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called *self-supervised learning*: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to develop an understanding of the English (or other) language. Self-supervised learning can also be used in other domains; for instance, see [\"Self-Supervised Learning and Computer Vision\"](https://www.fast.ai/2020/01/13/self_supervised/) for an introduction to vision applications. Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pretraining a model used for transfer learning."
]
},
{
@ -45,15 +45,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The language model we used in <<chapter_intro>> to classify IMDb reviews was pretrained on Wikipedia. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better: the Wikipedia English is slightly different from the IMDb English. So instead of jumping directly to the classifier, we could finetune our pretrained language model to the IMDb corpus and *then* use that as the base for our classifier.\n",
"The language model we used in <<chapter_intro>> to classify IMDb reviews was pretrained on Wikipedia. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better. The Wikipedia English is slightly different from the IMDb English, so instead of jumping directly to the classifier, we could fine-tune our pretrained language model to the IMDb corpus and then use *that* as the base for our classifier.\n",
"\n",
"Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targetting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of IMDb, there will be lots of names of movie directors and actors, and often a less formal style of language that seen in Wikipedia.\n",
"Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targeting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of the IMDb datset, there will be lots of names of movie directors and actors, and often a less formal style of language than that seen in Wikipedia.\n",
"\n",
"We saw that with fastai, we can download a pre-trained language model for English, and use it to get state-of-the-art results for NLP classification. (We expect pre-trained models in many more languages to be available soon they might well be available by the time you are reading this book, in fact.) So, why are we learning how to train a language model in detail?\n",
"We already saw that with fastai, we can download a pretrained English language model and use it to get state-of-the-art results for NLP classification. (We expect pretrained models in many more languages to be available soon—they might well be available by the time you are reading this book, in fact.) So, why are we learning how to train a language model in detail?\n",
"\n",
"One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine tune the (sequence-based) language model prior to fine tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. So that is 100,000 movie reviews altogether (since there are also 25,000 labelled reviews in the training set, and 25,000 in the validation set). We can use all 100,000 of these reviews to fine tune the pretrained language model — this will result in a language model that is particularly good at predicting the next word of a movie review. In contrast, the pretrained model was trained only on Wikipedia articles.\n",
"One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine-tune the (sequence-based) language model prior to fine-tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. Since there are 25,000 labeled reviews in the training set and 25,000 in the validation set, that makes 100,000 movie reviews altogether. We can use all of these reviews to fine-tune the pretrained language model, which was trained only on Wikipedia articles; this will result in a language model that is particularly good at predicting the next word of a movie review.\n",
"\n",
"The [ULMFiT paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of language model fine tuning, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarised in <<ulmfit_process>>."
"This is known as the Universal Language Model Fine-tuning (ULMFit) approqch. The [paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of fine-tuning of the language model, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarized in <<ulmfit_process>>."
]
},
{
@ -67,7 +67,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now have a think about how you would turn this language modelling problem into a neural network, given what you have learned so far. We'll be able to use concepts that we've seen in the last two chapters."
"We'll now explore how to apply a neural network to this language modeling problem, using the concepts introduced in the last two chapters. But before reading further, pause and think about how *you* would approach this."
]
},
{
@ -85,14 +85,14 @@
"\n",
"We've already seen how categorical variables can be used as independent variables for a neural network. The approach we took for a single categorical variable was to:\n",
"\n",
"1. Make a list of all possible levels of that categorical variable (let us call this list the *vocab*)\n",
"1. Replace each level with its index in the vocab\n",
"1. Create an embedding matrix for this containing a row for each level (i..e, for each item of the vocab)\n",
"1. Use this embedding matrix as the first layer of a neural network. (A dedicated embedding matrix can take as inputs the raw vocab indexes created in step two; this is equivalent to, but faster and more efficient, than a matrix which takes as input one-hot encoded vectors representing the indexes)\n",
"1. Make a list of all possible levels of that categorical variable (we'll call this list the *vocab*).\n",
"1. Replace each level with its index in the vocab.\n",
"1. Create an embedding matrix for this containing a row for each level (i.e., for each item of the vocab).\n",
"1. Use this embedding matrix as the first layer of a neural network. (A dedicated embedding matrix can take as inputs the raw vocab indexes created in step 2; this is equivalent to but faster and more efficient than a matrix that takes as input one-hot-encoded vectors representing the indexes.)\n",
"\n",
"We can do nearly the same thing with text! What is new is the idea of a sequence. First we concatenate all of the documents in our dataset into one big long string and split it into words, giving us a very long list of words. Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second last, and our dependent variable would be the sequence of words starting with the second word and ending with the last word. \n",
"We can do nearly the same thing with text! What is new is the idea of a sequence. First we concatenate all of the documents in our dataset into one big long string and split it into words, giving us a very long list of words (or \"tokens\"). Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second to last, and our dependent variable will be the sequence of words starting with the second word and ending with the last word. \n",
"\n",
"When creating our vocab, we will have very common words that will probably be in the vocabulary of our pretrained model, but we will also have new words specific to our corpus (cinematographic terms, or actor names for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of this pretrained model; but for new words, we won't have anything, so we will just initialize the corresponding row with a random vector."
"Our vocab will consist of a mix of common words that are already in the vocabulary of our pretrained model and new words specific to our corpus (cinematographic terms or actors names, for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of the pretrained model; but for new words we won't have anything, so we will just initialize the corresponding row with a random vector."
]
},
{
@ -101,10 +101,10 @@
"source": [
"Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:\n",
"\n",
"- **Tokenization**:: convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)\n",
"- **Numericalization**:: make a list of all of the unique words which appear (the vocab), and convert each word into a number, by looking up its index in the vocab\n",
"- **Language model data loader** creation:: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable which is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required\n",
"- **Language model** creation:: we need a special kind of model which does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network*. We will get to the details of this in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.\n",
"- Tokenization:: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)\n",
"- Numericalization:: Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab\n",
"- Language model data loader creation:: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required\n",
"- Language model creation:: We need a special kind of model that does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network* (RNN). We will get to the details of these RNNs in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.\n",
"\n",
"Let's take a look at how each step works in detail."
]
@ -120,13 +120,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When we said, *convert the text into a list of words*, we left out a lot of details. For instance, what do we do with punctuation? How do we deal with a word like \"don't\"? Is it one word, or two? What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? What about languages like German and Poland where we can create really long words from many, many pieces? What about languages like Japanese and Chinese which don't use bases at all, and don't really have a well-defined idea of *word*?\n",
"When we said \"convert the text into a list of words,\" we left out a lot of details. For instance, what do we do with punctuation? How do we deal with a word like \"don't\"? Is it one word, or two? What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? What about languages like German and Polish where we can create really long words from many, many pieces? What about languages like Japanese and Chinese that don't use bases at all, and don't really have a well-defined idea of *word*?\n",
"\n",
"Because there is no one correct answer to these questions, there is no one approach to tokenization. Each element of the list created by the tokenisation process is called a *token*. There are three main approaches:\n",
"Because there is no one correct answer to these questions, there is no one approach to tokenization. There are three main approaches:\n",
"\n",
"- **Word-based**:: split a sentence on spaces, as well as applying language specific rules to try to separate parts of meaning, even when there are no spaces, such as turning \"don't\" into \"do n't\". Generally, punctuation marks are also split into separate tokens\n",
"- **Subword based**:: split words into smaller parts, based on the most commonly occurring substrings. For instance, \"occasion\" might be tokeniser as \"o c ca sion\"\n",
"- **Character-based**:: split a sentence into its individual characters.\n",
"- Word-based:: Split a sentence on spaces, as well as applying language-specific rules to try to separate parts of meaning even when there are no spaces (such as turning \"don't\" into \"do n't\"). Generally, punctuation marks are also split into separate tokens.\n",
"- Subword based:: Split words into smaller parts, based on the most commonly occurring substrings. For instance, \"occasion\" might be tokenized as \"o c ca sion.\"\n",
"- Character-based:: Split a sentence into its individual characters.\n",
"\n",
"We'll be looking at word and subword tokenization here, and we'll leave character-based tokenization for you to implement in the questionnaire at the end of this chapter."
]
@ -135,7 +135,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: token: one element of a list created by the tokenisation process. It could be a word, part of a word (a _subword_), or a single character."
"> jargon: token: One element of a list created by the tokenization process. It could be a word, part of a word (a _subword_), or a single character."
]
},
{
@ -149,7 +149,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Rather than providing its own tokenizers, fastai instead provides a consistent interface to a range of tokenisers in external libraries. Tokenization is an active field of research, and new and improved tokenizers are coming out all the time, so the defaults that fastai uses change too. However, the API and options shouldn't change too much, since fastai tries to maintain a consistent API even as the underlying technology changes.\n",
"Rather than providing its own tokenizers, fastai instead provides a consistent interface to a range of tokenizers in external libraries. Tokenization is an active field of research, and new and improved tokenizers are coming out all the time, so the defaults that fastai uses change too. However, the API and options shouldn't change too much, since fastai tries to maintain a consistent API even as the underlying technology changes.\n",
"\n",
"Let's try it out with the IMDb dataset that we used in <<chapter_intro>>:"
]
@ -211,9 +211,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we write this book, the default *English word tokenizer* for fastai uses a library called *spaCy*. This uses a sophisticated rules engine that has special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not always be Spacy, depending when you're reading this).\n",
"As we write this book, the default English word tokenizer for fastai uses a library called *spaCy*. It has a sophisticated rules engine with special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not necessarily be spaCy, depending when you're reading this).\n",
"\n",
"Let's try it out. We'll use fastai's `coll_repr(collection,n)` function to display the results; this displays the first `n` items of `collection`, along with the full size--it's what `L` uses by default. Note that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
"Let's try it out. We'll use fastai's `coll_repr(collection, n)` function to display the results. This displays the first *`n`* items of *`collection`*, along with the full size--it's what `L` uses by default. Note that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
]
},
{
@ -239,7 +239,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split \"it's\" into \"it\" and \"'s\". That makes intuitive sense--these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. spaCy handles these for us, for instance, here we see that \".\" is separated when it terminates a sentence, but not in an acronym or number:"
"As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split \"it's\" into \"it\" and \"'s\". That makes intuitive sense; these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. Fortunately, spaCy handles these pretty well for us--for instance, here we see that \".\" is separated when it terminates a sentence, but not in an acronym or number:"
]
},
{
@ -291,19 +291,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are now some tokens added that start with the characters \"xx\", which is not a common word prefix in English. These are *special tokens*.\n",
"Notice that there are now some tokens that start with the characters \"xx\", which is not a common word prefix in English. These are *special tokens*.\n",
"\n",
"For example, the first item in the list, \"xxbos\", is a special token that indicates the start of a new text (\"BOS\" is a standard NLP acronym which means \"beginning of stream\"). By recognizing this start token, the model will be able to learn it needs to \"forget\" what was said previously and focus on upcoming words.\n",
"For example, the first item in the list, `xxbos`, is a special token that indicates the start of a new text (\"BOS\" is a standard NLP acronym that means \"beginning of stream\"). By recognizing this start token, the model will be able to learn it needs to \"forget\" what was said previously and focus on upcoming words.\n",
"\n",
"These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognise the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenised language, a language which is designed to be easy for a model to learn.\n",
"These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenized language--a language that is designed to be easy for a model to learn.\n",
"\n",
"For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special *repeated character* token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalised word will be replaced with a special capitalisation token, followed by the lower case version of the word. This way, the embedding matrix only needs the lower case version of the words, saving compute and memory, but can still learn the concept of capitalisation.\n",
"For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special *repeated character* token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.\n",
"\n",
"Here are some of the main special tokens you'll see:\n",
"\n",
"- xxbos:: indicates the beginning of a text (here a review)\n",
"- xxmaj:: indicates the next word begins with a capital (since we lower-cased everything)\n",
"- xxunk:: indicates the next word is unknown\n",
"- `xxbos`:: Indicates the beginning of a text (here, a review)\n",
"- `xxmaj`:: Indicates the next word begins with a capital (since we lowercased everything)\n",
"- `xxunk`:: Indicates the next word is unknown\n",
"\n",
"To see the rules that were used, you can check the default rules:"
]
@ -339,7 +339,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As always, you can look at the source code of each of them in a notebook by typing\n",
"As always, you can look at the source code of each of them in a notebook by typing:\n",
"\n",
"```\n",
"??replace_rep\n",
@ -347,14 +347,14 @@
"\n",
"Here is a brief summary of what each does:\n",
"\n",
"- `fix_html`:: replace special HTML characters by a readable version (IMDb reviews have quite a few of them for instance) ;\n",
"- `replace_rep`:: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it's repeated, then the character ;\n",
"- `replace_wrep`:: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it's repeated, then the word ;\n",
"- `spec_add_spaces`:: add spaces around / and # ;\n",
"- `rm_useless_spaces`:: remove all repetitions of the space character ;\n",
"- `replace_all_caps`:: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;\n",
"- `replace_maj`:: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;\n",
"- `lowercase`:: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)."
"- `fix_html`:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)\n",
"- `replace_rep`:: Replaces any character repeated three times or more with a special token for repetition (`xxrep`), the number of times it's repeated, then the character\n",
"- `replace_wrep`:: Replaces any word repeated three times or more with a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word\n",
"- `spec_add_spaces`:: Adds spaces around / and #\n",
"- `rm_useless_spaces`:: Removes all repetitions of the space character\n",
"- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxcap`) in front of it\n",
"- `replace_maj`:: Lowercases a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it\n",
"- `lowercase`:: Lowercases all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`)"
]
},
{
@ -388,7 +388,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's have a look at how subword tokenization would work."
"Now let's take a look at how subword tokenization would work."
]
},
{
@ -402,14 +402,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to the *word tokenization* approach seen in the last section, another popular tokenization method is *subword tokenization*. Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. For instance, consider this sentence: 我的名字是郝杰瑞 (which means \"My name is Jeremy Howard\" in Chinese). That's not going to work very well with a word tokenizer, because there are no spaces in it! Languages like Chinese and Japanese don't use spaces, and in fact they don't even have a well-defined concept of a \"word\". There are also languages, like Turkish and Hungarian, which can add many bits together without spaces, to create very long words which include a lot of separate pieces of information.\n",
"In addition to the *word tokenization* approach seen in the last section, another popular tokenization method is *subword tokenization*. Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. For instance, consider this sentence: 我的名字是郝杰瑞 (\"My name is Jeremy Howard\" in Chinese). That's not going to work very well with a word tokenizer, because there are no spaces in it! Languages like Chinese and Japanese don't use spaces, and in fact they don't even have a well-defined concept of a \"word.\" There are also languages, like Turkish and Hungarian, that can add many subwords together without spaces, creating very long words that include a lot of separate pieces of information.\n",
"\n",
"To handle these cases, it's generally best to use subword tokenization. This proceeds in two steps:\n",
"\n",
"1. Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.\n",
"2. Tokenize the corpus using this vocab of *subword units*.\n",
"\n",
"Let's look at an example. For our corpus, we'll use the first 2000 movie reviews:"
"Let's look at an example. For our corpus, we'll use the first 2,000 movie reviews:"
]
},
{
@ -425,7 +425,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We instantiate our tokenizer, passing in the size of the vocab we want to create, and then we need to \"train\" it. That is, we need to have it read our documents, and find the common sequences of characters, to create the vocab. This is done with `setup`. As we'll see shortly, `setup` is a special fastai method that is called automatically in our usual data processing pipelines. Since we're doing everything manually at the moment, however, we have to call it ourselves. Here's a function that does these steps for a given vocab size, and shows an example output:"
"We instantiate our tokenizer, passing in the size of the vocab we want to create, and then we need to \"train\" it. That is, we need to have it read our documents and find the common sequences of characters to create the vocab. This is done with `setup`. As we'll see shortly, `setup` is a special fastai method that is called automatically in our usual data processing pipelines. Since we're doing everything manually at the moment, however, we have to call it ourselves. Here's a function that does these steps for a given vocab size, and shows an example output:"
]
},
{
@ -557,16 +557,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Picking a subword vocab size represents a compromise: a larger vocab means more fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.\n",
"Picking a subword vocab size represents a compromise: a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.\n",
"\n",
"Overall, subword tokenization provides a way to easily scale between character tokenization (i.e. use a small subword vocab) and word tokenization (i.e. use a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other \"languages\" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!)"
"Overall, subword tokenization provides a way to easily scale between character tokenization (i.e., using a small subword vocab) and word tokenization (i.e., using a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other \"languages\" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once our texts have been split into tokens, we need to convert them to numbers."
"Once our texts have been split into tokens, we need to convert them to numbers. We'll look at that next."
]
},
{
@ -580,12 +580,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Numericalization is the process of mapping tokens to integers. It's basically identical to the steps necessary to create a `Category` variable, such as the dependent variable of digits in MNIST:\n",
"*Numericalization* is the process of mapping tokens to integers. The steps are basically identical to those necessary to create a `Category` variable, such as the dependent variable of digits in MNIST:\n",
"\n",
"1. Make a list of all possible levels of that categorical variable (the *vocab*)\n",
"1. Replace each level with its index in the vocab\n",
"1. Make a list of all possible levels of that categorical variable (the vocab).\n",
"1. Replace each level with its index in the vocab.\n",
"\n",
"We'll take a look at this in action on the word-tokenized text we saw earlier:"
"Let's take a look at this in action on the word-tokenized text we saw earlier:"
]
},
{
@ -610,7 +610,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like `SubwordTokenizer`, we need to call `setup` on `Numericalize`; this is how we create the `vocab`. That means we'll need our tokenized corpus first. Since tokenization takes a while, it's done in parallel by fastai; but for this manual walk-thru, we'll use a small subset:"
"Just like with `SubwordTokenizer`, we need to call `setup` on `Numericalize`; this is how we create the vocab. That means we'll need our tokenized corpus first. Since tokenization takes a while, it's done in parallel by fastai; but for this manual walkthrough, we'll use a small subset:"
]
},
{
@ -638,7 +638,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can pass this to `setup` to create our `vocab`:"
"We can pass this to `setup` to create our vocab:"
]
},
{
@ -667,11 +667,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Our special rules tokens appear first, and then every word appears once, in frequency order. The defaults to `Numericalize` are `min_freq=3,max_vocab=60000`. `max_vocab=60000` results in fastai replacing all words other than the most common 60000 with a special *unknown word* token `xxunk`. This is useful to avoid having an overly large embedding matrix, since that can slow down training, use up too much memory, and can also mean that there isn't enough data to train useful representations for rare words. However, this last issue is better handled by setting `min_freq`; the default `min_freq=3` means that any word appearing less than three times is replaced with `xxunk`.\n",
"Our special rules tokens appear first, and then every word appears once, in frequency order. The defaults to `Numericalize` are `min_freq=3,max_vocab=60000`. `max_vocab=60000` results in fastai replacing all words other than the most common 60,000 with a special *unknown word* token, `xxunk`. This is useful to avoid having an overly large embedding matrix, since that can slow down training and use up too much memory, and can also mean that there isn't enough data to train useful representations for rare words. However, this last issue is better handled by setting `min_freq`; the default `min_freq=3` means that any word appearing less than three times is replaced with `xxunk`.\n",
"\n",
"Fastai can also numericalize your dataset using a vocab that you provide, by passing a list of words as the `vocab` parameter.\n",
"fastai can also numericalize your dataset using a vocab that you provide, by passing a list of words as the `vocab` parameter.\n",
"\n",
"Once we've created our `Numericalize` object, we can use it as if it's a function:"
"Once we've created our `Numericalize` object, we can use it as if it were a function:"
]
},
{
@ -732,16 +732,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Putting Our Texts Into Batches for a Language Model"
"### Putting Our Texts into Batches for a Language Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When dealing with images, we needed to resize them all to the same height and width before grouping them together in a mini-batch so they could stack together efficiently in a single tensor. Here it's going to be a little different, because one cannot simply resize text to a desired length. Also, we want our language model to read text in order, so that it can efficiently predict what the next word is. All the difficulty of a language model loader is that each new batch should begin precisely where the previous left off.\n",
"When dealing with images, we needed to resize them all to the same height and width before grouping them together in a mini-batch so they could stack together efficiently in a single tensor. Here it's going to be a little different, because one cannot simply resize text to a desired length. Also, we want our language model to read text in order, so that it can efficiently predict what the next word is. this means that each new batch should begin precisely where the previous one left off.\n",
"\n",
"Let's start with an example and imagine our text is the following:\n",
"Suppose we have the following text:\n",
"\n",
"> : In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\n",
"\n",
@ -749,14 +749,14 @@
"\n",
"> : xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \\n xxmaj then we will study how we build a language model and train it for a while .\n",
"\n",
"We have separated the 90 tokens by spaces. Let's say we want a batch size of 6, then we need to break this text in 6 contiguous parts of length 15:"
"We now have 90 tokens, separated by spaces. Let's say we want a batch size of 6. We need to break this text into 6 contiguous parts of length 15:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
"hide_input": false
},
"outputs": [
{
@ -878,7 +878,7 @@
}
],
"source": [
"#hide\n",
"#hide_input\n",
"stream = \"In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\"\n",
"tokens = tkn(stream)\n",
"bs,seq_len = 6,15\n",
@ -887,24 +887,15 @@
"display(HTML(df.to_html(index=False,header=None)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"hide_input": false
},
"source": [
"<img alt=\"TK: add title\" width=\"800\" caption=\"TK: add title\" id=\"TK: add it\" src=\"images/att_00071.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In a perfect world, we could then give this one batch to our model. But that doesn't work, because this would very likely not fit in our GPU memory (here we have 90 tokens, but all the IMDb reviews together give several millions of tokens).\n",
"In a perfect world, we could then give this one batch to our model. But that approach doesn't scale, because outside of this toy example it's unlikely that a signle batch containing all the texts would fit in our GPU memory (here we have 90 tokens, but all the IMDb reviews together give several million).\n",
"\n",
"So in fact we will need to divide this array more finely into subarrays of a fixed sequence length. It is important to maintain order within and across these subarrays, because we will use a model that maintains state in order so that it remembers what it read previously when predicting what comes next. \n",
"So, we need to divide this array more finely into subarrays of a fixed sequence length. It is important to maintain order within and across these subarrays, because we will use a model that maintains a state so that it remembers what it read previously when predicting what comes next. \n",
"\n",
"Going back to our previous example with 6 batches of length 15, if we chose sequence length of 5, that would mean we first feed the following array:"
"Going back to our previous example with 6 batches of length 15, if we chose a sequence length of 5, that would mean we first feed the following array:"
]
},
{
@ -984,7 +975,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then"
"Then this one:"
]
},
{
@ -1064,7 +1055,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally"
"And finally:"
]
},
{
@ -1144,13 +1135,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Going back to our dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order in which the inputs come, so at the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside, otherwise the text would not make sense anymore).\n",
"Going back to our movie reviews dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order of the inputs, so at the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside them, or the texts would not make sense anymore!).\n",
"\n",
"We will then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...) because we want the model to read continuous rows of text (as in our example above). This is why each text has been added a `xxbos` token during preprocessing, so that the model knows when it reads the stream we are beginning a new entry.\n",
"We then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...), because we want the model to read continuous rows of text (as in the preceding example). An `xxbos` token is added at the start of each during preprocessing, so that the model knows when it reads the stream when a new entry is beginning.\n",
"\n",
"So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
"So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length we picked.\n",
"\n",
"This is all done behind the scenes by the fastai library when we create a `LMDataLoader`. We can create one by first applying our `Numericalize` object to the tokenized texts:"
"This is all done behind the scenes by the fastai library when we create an `LMDataLoader`. We do this by first applying our `Numericalize` object to the tokenized texts:"
]
},
{
@ -1166,7 +1157,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"...and then passing that to `LMDataLoader`:"
"and then passing that to `LMDataLoader`:"
]
},
{
@ -1210,7 +1201,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"...and then looking at the first row of the independent variable, which should be the start of the first text:"
"and then looking at the first row of the independent variable, which should be the start of the first text:"
]
},
{
@ -1237,7 +1228,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"...and the first row of the dependent variable, which is the same thing offset by one token:"
"The dependent variable is the same thing offset by one token:"
]
},
{
@ -1264,7 +1255,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This concludes all the preprocessing steps we need to apply to our data. We are now ready to train out text classifier."
"This concludes all the preprocessing steps we need to apply to our data. We are now ready to train our text classifier."
]
},
{
@ -1278,7 +1269,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have seen at the beginning of this chapter to train a state-of-the-art text classifier using transfer learning will take two steps: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, then we can use that model to train a classifier.\n",
"As we saw at the beginning of this chapter, thare are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, and then we can use that model to train a classifier.\n",
"\n",
"As usual, let's start with assembling our data."
]
@ -1317,12 +1308,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing that's different to previous types used in `DataBlock` is that we're not just using the class directly (i.e. `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method which, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read every document and tokenize it to get the vocab); to be as efficient as possible fastai does things such as: \n",
"One thing that's different to previous types we've used in `DataBlock` is that we're not just using the class directly (i.e., `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method that, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible preforms a few optimizations: \n",
"\n",
"- Save the tokenized documents in a temporary folder, so fastai doesn't have to tokenize more than once\n",
"- Runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs.\n",
"- It save the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once\n",
"- It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs\n",
"\n",
"Therefore we need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing--that's what `from_folder` does.\n",
"We need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing--that's what `from_folder` does.\n",
"\n",
"`show_batch` then works in the usual way:"
]
@ -1380,14 +1371,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine Tuning the Language Model"
"### Fine-Tuning the Language Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For converting the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modelling. Then those embeddings are fed in a *Recurrent Neural Network* (RNN), using an architecture called *AWD_LSTM* (we will show how to write such a model from scratch in <<chapter_nlp_dive>>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside `language_model_learner`:"
"To convert the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modeling. Then we'll feed those embeddings into a *recurrent neural network* (RNN), using an architecture called *AWD-LSTM* (we will show you how to write such a model from scratch in <<chapter_nlp_dive>>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside `language_model_learner`:"
]
},
{
@ -1405,9 +1396,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The loss function used by default is cross entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). A metric often used in NLP for language models is called *perplexity*. It is the exponential of the loss (i.e. `torch.exp(cross_entropy)`). We will also add accuracy, to see how many times our model is right when trying to predict the next word, since cross entropy (as we've seen) is both hard to interpret, and also tells you more about the model's confidence, rather than just its accuracy\n",
"The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The 8perplexity* metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`). We also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy.\n",
"\n",
"The grey first arrow in our overall picture has been done for us and made available as a pretrained model in fastai; we've now built the `DataLoaders` and `Learner` for the second stage, and we're ready to fine-tune it!"
"Going back to our ULMFit process diagramThe grey first arrow in our overall picture has been done for us and made available as a pretrained model in fastai; we've now built the `DataLoaders` and `Learner` for the second stage, and we're ready to fine-tune it!"
]
},
{
@ -1421,7 +1412,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll just use `fit_one_cycle`. Just like `cnn_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (which is the only part of the model that contains randomly initialized weights--i.e. embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):"
"It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll use `fit_one_cycle`. Just like `cnn_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weights--i.e., embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):"
]
},
{
@ -1501,7 +1492,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It will create a file in `learn.path/models/` named \"1epoch.pth\". If you want to load your model in another machine after creating your `Learner` the same way, or resume training later, you can load the content of this file with:"
"This will create a file in `learn.path/models/` named *1epoch.pth*. If you want to load your model in another machine after creating your `Learner` the same way, or resume training later, you can load the content of this file with:"
]
},
{
@ -1517,7 +1508,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then finetune the model after unfreezing:"
"We can then train the model for more epochs after unfreezing:"
]
},
{
@ -1656,14 +1647,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> jargon: Encoder: The model not including the task-specific final layer(s). It means much the same thing as *body* when applied to vision CNNs, but tends to be more used for NLP and generative models."
"> jargon: Encoder: The model not including the task-specific final layer(s). This term means much the same thing as _body_ when applied to vision CNNs, but \"encoder\" tends to be more used for NLP and generative models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This completes the second stage of the text classification process: fine-tuning the language model. We can now fine tune this language model using the IMDb sentiment labels."
"This completes the second stage of the text classification process: fine-tuning the language model. We can now use it to fine-tune a clasifier using the IMDb sentiment labels."
]
},
{
@ -1677,7 +1668,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Before using this to fine-tune a classifier on the review, we can use our model to generate random reviews: since it's trained to guess what the next word of the sentence is, we can use it to write new reviews:"
"Before we move on to fine-tuning the classifier, let's quickly try something different: using our model to generate random reviews. Since it's trained to guess what the next word of the sentence is, we can use the model to write new reviews:"
]
},
{
@ -1736,9 +1727,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, we add some randomness (we pick a random word based on the probabilities returned by the model) so you don't get exactly the same review twice. Our model doesn't have any programmed knowledge of the structure of a sentence or grammar rules, yet it has clearly learned a lot about English sentences: we can see it capitalized properly (I is just transformed to i with our rules -- they require two characters or more to consider a word is capitalized -- so it's normal to see it lowercased), and is using consistent tense. The general review make sense at first glance, and it's only if you read carefully you can notice something is a bit off. Not bad for a model trained in a couple of hours! \n",
"As you can see, we add some randomness (we pick a random word based on the probabilities returned by the model) so we don't get exactly the same review twice. Our model doesn't have any programmed knowledge of the structure of a sentence or grammar rules, yet it has clearly learned a lot about English sentences: we can see it capitalizes properly (*I* is just transformed to *i* because our rules require two characters or more to consider a word as capitalized, so it's normal to see it lowercased) and is using consistent tense. The general review makes sense at first glance, and it's only if you read carefully that you can notice something is a bit off. Not bad for a model trained in a couple of hours! \n",
"\n",
"Our end goal wasn't to train a model to generate reviews, but to classify them... so let's use this model to do just that."
"But our end goal wasn't to train a model to generate reviews, but to classify them... so let's use this model to do just that."
]
},
{
@ -1752,9 +1743,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We're now moving from language model fine tuning, to classifier fine tuning. To re-cap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label--in the case of IMDb, it's the sentiment of a document.\n",
"We're now moving from language model fine-tuning to classifier fine-tuning. To recap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label--in the case of IMDb, it's the sentiment of a document.\n",
"\n",
"This means that the structure of our `DataBlock` for NLP classification will look very familiar; it's actually nearly the same as we've seen for the many image classification datasets we've worked with:"
"This means that the structure of our `DataBlock` for NLP classification will look very familiar. It's actually nearly the same as we've seen for the many image classification datasets we've worked with:"
]
},
{
@ -1829,14 +1820,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the `DataBlock` definition above, every piece is familiar from previous data blocks we've built, with two important exceptions:\n",
"Looking at the `DataBlock` definition, every piece is familiar from previous data blocks we've built, with two important exceptions:\n",
"\n",
"- `TextBlock.from_folder` no longer has the `is_lm=True` parameter, and\n",
"- `TextBlock.from_folder` no longer has the `is_lm=True` parameter.\n",
"- We pass the `vocab` we created for the language model fine-tuning.\n",
"\n",
"The reason that we pass the vocab of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.\n",
"The reason that we pass the `vocab` of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.\n",
"\n",
"By passing `is_lm=False` (or not passing `is_lm` at all, since it defaults to `False`) we tell `TextBlock` that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a minibatch. Let's see with an example, by trying to create a minibatch containing the first 10 documents. First we'll numericalize them:"
"By passing `is_lm=False` (or not passing `is_lm` at all, since it defaults to `False`) we tell `TextBlock` that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a mini-batch. Let's see with an example, by trying to create a mini-batch containing the first 10 documents. First we'll numericalize them:"
]
},
{
@ -1879,11 +1870,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember, PyTorch `DataLoader`s need to collate all the items in a batch into a single tensor, and that a single tensor has a fixed shape (i.e. it has some particular length on every axis, and all items must be consistent). This should look a bit familiar: we had the same issue with images. In that case, we use cropping, padding, and/or squishing to make everything the same size. Cropping might not be a good idea for documents, because it seems likely we'd remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn't been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!) You can't really \"squish\" a document. So that leaves padding!\n",
"Remember, PyTorch `DataLoader`s need to collate all the items in a batch into a single tensor, and a single tensor has a fixed shape (i.e., it has some particular length on every axis, and all items must be consistent). This should sound familiar: we had the same issue with images. In that case, we used cropping, padding, and/or squishing to make all the inputs the same size. Cropping might not be a good idea for documents, because it seems likely we'd remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn't been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!). You can't really \"squish\" a document. So that leaves padding!\n",
"\n",
"We will expand the shortest texts to make them all the same size. To do this, we use a special token that will be ignored by our model. This is called *padding* (just like in vision). Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend of be of similar lengths. We won't make every batch, therefore, the same size, but will instead use the size of the largest document in each batch. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, although as we write these words, no library provides good support for this yet, and there aren't any papers covering it. It's something we're planning to add to fastai soon however, so have a look on the book website, where we'll add information about this if and when it's working well.)\n",
"We will expand the shortest texts to make them all the same size. To do this, we use a special padding token that will be ignored by our model. Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend of be of similar lengths. We won't pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, but at the time of writing no library provides good support for this yet, and there aren't any papers covering it. It's something we're planning to add to fastai soon, however, so keep an eye on the book's website; we'll add information about this as soon as we have it working well.)\n",
"\n",
"The padding and sorting is automatically done by the data block API for us when using a `TextBlock`, with `is_lm=False`. (We don't have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)\n",
"The sorting and padding are automatically done by the data block API for us when using a `TextBlock`, with `is_lm=False`. (We don't have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)\n",
"\n",
"We can now create a model to classify our texts:"
]
@ -1902,7 +1893,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step prior to training the classifier is to load the encoder from our fine-tuned language model. We use `load_encoder` instead of `load` because we only have pretrained weights available for the encoder; `load` by default raises an exception if an incomplete model is loaded."
"The final step prior to training the classifier is to load the encoder from our fine-tuned language model. We use `load_encoder` instead of `load` because we only have pretrained weights available for the encoder; `load` by default raises an exception if an incomplete model is loaded:"
]
},
{
@ -1918,14 +1909,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine Tuning the Classifier"
"### Fine-Tuning the Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step is to train with discriminative learning rates and *gradual unfreezing*. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference."
"The last step is to train with discriminative learning rates and *gradual unfreezing*. In computer vision we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference:"
]
},
{
@ -1973,7 +1964,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In just one epoch we get the same result as our training in <<chapter_intro>>, not too bad! We can pass `-2` to `freeze_to` to freeze all except the last two parameter groups:"
"In just one epoch we get the same result as our training in <<chapter_intro>>: not too bad! We can pass `-2` to `freeze_to` to freeze all except the last two parameter groups:"
]
},
{
@ -2127,9 +2118,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We reach 94.3% accuracy, which was state-of-the-art just three years ago. By training a model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, fine-tuning a much bigger model and using expensive data augmentation (translating sentences in another language and back, using another model for translation).\n",
"We reached 94.3% accuracy, which was state-of-the-art performance just three years ago. By training another model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, by fine-tuning a much bigger model and using expensive data augmentation techniques (translating sentences in another language and back, using another model for translation).\n",
"\n",
"Using a pretrained model let us build a fine-tuned language model that was pretty powerful, to either generate fake reviews or help classify them. It is good to remember that this technology can also be used for malign purposes."
"Using a pretrained model let us build a fine-tuned language model that was pretty powerful, to either generate fake reviews or help classify them. This is exciting stuff, but it's good to remember that this technology can also be used for malign purposes."
]
},
{
@ -2143,14 +2134,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analysed the comments that were sent to the FCC in the USA regarding a 2017 proposal to repeal net neutrality. In his article [More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6)\", he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Madlibs-style mail merge. In <<disinformation>>, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature."
"Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analyzed the comments that were sent to the US Federal Communications Commission (FCC) regarding a 2017 proposal to repeal net neutrality. In his article [\"More than a Million Pro-Repeal Net Neutrality Comments Were Likely Faked\"](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6), he reports how he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Mad Libs-style mail merge. In <<disinformation>>, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ethics/image16.png\" width=\"700\" id=\"disinformation\" caption=\"Comments received during the neutral neutrality debate\">"
"<img src=\"images/ethics/image16.png\" width=\"700\" id=\"disinformation\" caption=\"Comments received by the FCC during the net neutrality debate\">"
]
},
{
@ -2159,7 +2150,7 @@
"source": [
"Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
"\n",
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the tools at your disposal necessary to create a compelling language model. That is, something that can generate context appropriate believable text. It won't necessarily be perfectly accurate or correct, but it will be believable. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about. Take a look at this conversation on Reddit shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending:"
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the necessary tools at your disposal to create a compelling language model--that is, something that can generate context-appropriate, believable text. It won't necessarily be perfectly accurate or correct, but it will be plausible. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about in recent years. Take a look at the Reddit dilaogue shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending."
]
},
{
@ -2173,25 +2164,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, the use of the algorithm is being done explicitly. But imagine what would happen if a bad actor decided to release such an algorithm across social networks. They could do it slowly and carefully, allowing the algorithms to gradually develop followings and trust over time. It would not take many resources to have literally millions of accounts doing this. In such a situation we could easily imagine it getting to a point where the vast majority of discourse online was from bots, and nobody would have any idea that it was happening.\n",
"In this case, it was explicitly said that an algorithm was used, but imagine what would happen if a bad actor decided to release such an algorithm across social networks. They could do it slowly and carefully, allowing the algorithm to gradually develop followers and trust over time. It would not take many resources to have literally millions of accounts doing this. In such a situation we could easily imagine getting to a point where the vast majority of discourse online was from bots, and nobody would have any idea that it was happening.\n",
"\n",
"We are already starting to see examples of machine learning being used to generate identities. For example, <<katie_jones>> shows us a LinkedIn profile for Katie Jones."
"We are already starting to see examples of machine learning being used to generate identities. For example, <<katie_jones>> shows a LinkedIn profile for Katie Jones."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ethics/image15.jpeg\" width=\"400\" id=\"katie_jones\" caption=\"Katie Jones' LinkedIn profile\">"
"<img src=\"images/ethics/image15.jpeg\" width=\"400\" id=\"katie_jones\" caption=\"Katie Jones's LinkedIn profile\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Katie Jones was connected on LinkedIn to several members of mainstream Washington think tanks. But she didn't exist. That image you see is auto generated by a generative adversarial network, and somebody named Katie Jones has not, in fact, graduated from the Centre for Strategic and International Studies.\n",
"Katie Jones was connected on LinkedIn to several members of mainstream Washington think tanks. But she didn't exist. That image you see was auto-generated by a generative adversarial network, and somebody named Katie Jones has not, in fact, graduated from the Center for Strategic and International Studies.\n",
"\n",
"Many people assume or hope that algorithms will come to our defence here. The hope is that we will develop classification algorithms which can automatically recognise auto generated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
"Many people assume or hope that algorithms will come to our defense here--that we will develop classification algorithms that can automatically recognise autogenerated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
]
},
{
@ -2205,9 +2196,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We have looked at the last application covered out of the box by the fastai library: text. We have seen two types of models: language models that can generate texts, and a classifier that determines if a review is positive or negative. To build a state-of-the art classifier, we used a pretrained language model, fine-tuned it to the corpus of our task, then used its body with a new head to do the classification.\n",
"In this chapter we exlored the last application covered out of the box by the fastai library: text. We saw two types of models: language models that can generate texts, and a classifier that determines if a review is positive or negative. To build a state-of-the art classifier, we used a pretrained language model, fine-tuned it to the corpus of our task, then used its body (the encoder) with a new head to do the classification.\n",
"\n",
"Before we end this section, we will know look at how the fastai library can help you assemble your data on your specific problems."
"Before we end this section, we'll take a look at how the fastai library can help you assemble your data for your specific problems."
]
},
{
@ -2221,28 +2212,28 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What is self-supervised learning?\n",
"1. What is a language model?\n",
"1. Why is a language model considered self-supervised learning?\n",
"1. What is \"self-supervised learning\"?\n",
"1. What is a \"language model\"?\n",
"1. Why is a language model considered self-supervised?\n",
"1. What are self-supervised models usually used for?\n",
"1. Why do we fine-tune language models?\n",
"1. What are the three steps to create a state-of-the-art text classifier?\n",
"1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?\n",
"1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?\n",
"1. What are the three steps to prepare your data for a language model?\n",
"1. What is tokenization? Why do we need it?\n",
"1. What is \"tokenization\"? Why do we need it?\n",
"1. Name three different approaches to tokenization.\n",
"1. What is 'xxbos'?\n",
"1. List 4 rules that fastai applies to text during tokenization.\n",
"1. Why are repeated characters replaced with a token showing the number of repetitions, and the character that's repeated?\n",
"1. What is numericalization?\n",
"1. What is `xxbos`?\n",
"1. List four rules that fastai applies to text during tokenization.\n",
"1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?\n",
"1. What is \"numericalization\"?\n",
"1. Why might there be words that are replaced with the \"unknown word\" token?\n",
"1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)\n",
"1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)\n",
"1. Why do we need padding for text classification? Why don't we need it for language modeling?\n",
"1. What does an embedding matrix for NLP contain? What is its shape?\n",
"1. What is perplexity?\n",
"1. What is \"perplexity\"?\n",
"1. Why do we have to pass the vocabulary of the language model to the classifier data block?\n",
"1. What is gradual unfreezing?\n",
"1. Why is text generation always likely to be ahead of automatic identification of machine generated texts?"
"1. What is \"gradual unfreezing\"?\n",
"1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?"
]
},
{
@ -2256,9 +2247,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. See what you can learn about language models and disinformation. What are the best language models today? Have a look at some of their outputs. Do you find them convincing? How could a bad actor best use this to create conflict and uncertainty?\n",
"1. Given the limitation that models are unlikely to be able to consistently recognise machine generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leveraged deep learning?"
"1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?\n",
"1. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@ -22,14 +22,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Munging With fastai's mid-Level API"
"# Data Munging with fastai's Mid-Level API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have seen what `Tokenizer` or a `Numericalize` do to a collection of texts, and how they're used inside the data block API, which handles those transforms for us directly using the `TextBlock`. But what if we want to only apply one of those transforms, either to see intermediate results or because we have already tokenized texts. More generally, what can we do when the data block API is not flexible enough to accommodate our particular use case? For this, we need to use fastai's *mid-level API* for processing data. The data block API is built on top of that layer, so it will allow you to do everything the data block API does, and much much more."
"We have seen what `Tokenizer` and `Numericalize` do to a collection of texts, and how they're used inside the data block API, which handles those transforms for us directly using the `TextBlock`. But what if we want to only apply one of those transforms, either to see intermediate results or because we have already tokenized texts? More generally, what can we do when the data block API is not flexible enough to accommodate our particular use case? For this, we need to use fastai's *mid-level API* for processing data. The data block API is built on top of that layer, so it will allow you to do everything the data block API does, and much much more."
]
},
{
@ -43,7 +43,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The fastai library is built on a *layered API*. At the very top layer, there are *applications* that allow us to train a model in five lines of codes, as we saw in <<chapter_intro>>. In the case of creating `DataLoaders` for a text classifier, for instance, we used the line:"
"The fastai library is built on a *layered API*. In the very top layer there are *applications* that allow us to train a model in five lines of codes, as we saw in <<chapter_intro>>. In the case of creating `DataLoaders` for a text classifier, for instance, we used the line:"
]
},
{
@ -83,14 +83,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"But it's sometimes not flexible enough. For debugging purposes for instance, we might need to apply just parts of the transforms that come with this data block. Or, we might want to create `DataLoaders` for some application that isn't directly supported by fastai. In this section, we'll dig into the pieces that are used inside fastai to implement the data block API. By understanding these pieces, you'll be able to leverage the power and flexibility of this mid-tier API."
"But it's sometimes not flexible enough. For debugging purposes, for instance, we might need to apply just parts of the transforms that come with this data block. Or we might want to create a `DataLoaders` for some application that isn't directly supported by fastai. In this section, we'll dig into the pieces that are used inside fastai to implement the data block API. Understanding these will enable you to leverage the power and flexibility of this mid-tier API."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> note: The mid-level API in general does not only contain functionality for creating `DataLoaders`. It also has the *callback* system , which allows us to customize the training loop any way we like, and the *general optimizer*. Both will be covered in <<chapter_accel_sgd>>."
"> note: Mid-Level API: The mid-level API does not only contain functionality for creating `DataLoaders`. It also has the _callback_ system, which allows us to customize the training loop any way we like, and the _general optimizer_. Both will be covered in <<chapter_accel_sgd>>."
]
},
{
@ -151,7 +151,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And how to numericalize, including automatically creating the vocab for our corpus:"
"and how to numericalize, including automatically creating the vocab for our corpus:"
]
},
{
@ -181,7 +181,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The classes also have a *decode* method. For instance, `Numericalize.decode` gives us back the string tokens:"
"The classes also have a `decode` method. For instance, `Numericalize.decode` gives us back the string tokens:"
]
},
{
@ -208,7 +208,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"...and `Tokenizer.decode` turns this back into a single string (it may not, however, be exactly the same as the original string; this depends on whether the tokenizer is *reversible*, which the default word tokenizer is not at the time we're writing this book):"
"and `Tokenizer.decode` turns this back into a single string (it may not, however, be exactly the same as the original string; this depends on whether the tokenizer is *reversible*, which the default word tokenizer is not at the time we're writing this book):"
]
},
{
@ -237,13 +237,13 @@
"source": [
"`decode` is used by fastai's `show_batch` and `show_results`, as well as some other inference methods, to convert predictions and mini-batches into a human-understandable representation.\n",
"\n",
"For each of `tok` or `num` above, we created an object, called the setup method (which trains the tokenizer if needed for `tok` and creates the vocab for `num`), applied it to our raw texts (by calling the object as a function), and then finally decoded it back to an understandable representation. These steps are needed for most data preprocessing tasks, so fastai provides a class that encapsulates them. This is the `Transform` class. Both `Tokenize` and `Numericalize` are `Transform`s.\n",
"For each of `tok` or `num` in the preceding example, we created an object, called the `setup` method (which trains the tokenizer if needed for `tok` and creates the vocab for `num`), applied it to our raw texts (by calling the object as a function), and then finally decoded the result back to an understandable representation. These steps are needed for most data preprocessing tasks, so fastai provides a class that encapsulates them. This is the `Transform` class. Both `Tokenize` and `Numericalize` are `Transform`s.\n",
"\n",
"In general, a `Transform` is an object that behaves like a function, has an optional *setup* that will initialize some inner state (like the vocab inside `num` for instance), and has an optional *decode* that will reverse the function (this reversal may not be perfect, as we saw above for `tok`).\n",
"In general, a `Transform` is an object that behaves like a function and has an optional `setup` nethod that will initialize some inner state (like the vocab inside `num`) and an optional `decode` that will reverse the function (this reversal may not be perfect, as we saw with `tok`).\n",
"\n",
"A good example of `decode` is found in the `Normalize` transform that we saw in <<chapter_sizing_and_tta>>: to be able to plot the images its `decode` method undoes the normalization (i.e. it multiplies by the std and adds back the mean). On the other hand, data augmentation transforms do not have a `decode` method, since we want to show the effects on images, to make sure the data augmentation is working as we want.\n",
"A good example of `decode` is found in the `Normalize` transform that we saw in <<chapter_sizing_and_tta>>: to be able to plot the images its `decode` method undoes the normalization (i.e., it multiplies by the standard deviation and adds back the mean). On the other hand, data augmentation transforms do not have a `decode` method, since we want to show the effects on images to make sure the data augmentation is working as we want.\n",
"\n",
"A special behavior of `Transform`s is that they always get applied over tuples: in general, our data is always a tuple `(input,target)` (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as `Resize`, we don't want to resize the tuple, but resize the input (if applicable) and the target (if applicable). It's the same for the batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.\n",
"A special behavior of `Transform`s is that they always get applied over tuples. In general, our data is always a tuple `(input,target)` (sometimes with more than one input or more than one target). When applying a transform on an item like this, such as `Resize`, we don't want to resize the tuple as a whole; instead, we want to resize the input (if applicable) and the target (if applicable) separately. It's the same for batch transforms that do data augmentation: when the input is an image and the target is a segmentation mask, the transform needs to be applied (the same way) to the input and the target.\n",
"\n",
"We can see this behavior if we pass a tuple of texts to `tok`:"
]
@ -280,7 +280,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to write a custom transform to apply to your data, the easiest way is to write a function. As you can see in this example, a `Transform` will only be applied to a matching type, if a type is provided (otherwise it will always be applied). In the following code, the `:int` in the function signature means that the `f` only gets applied to ints. That's why `tfm(2.0)` returns `2.0`, but `tfm(2)` returns `3` here:"
"If you want to write a custom transform to apply to your data, the easiest way is to write a function. As you can see in this example, a `Transform` will only be applied to a matching type, if a type is provided (otherwise it will always be applied). In the following code, the `:int` in the function signature means that `f` only gets applied to `int`s. That's why `tfm(2.0)` returns `2.0`, but `tfm(2)` returns `3` here:"
]
},
{
@ -309,9 +309,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here `f` is converted to a `Transform` with no `setup` and no `decode` method.\n",
"Here, `f` is converted to a `Transform` with no `setup` and no `decode` method.\n",
"\n",
"Python has a special syntax for passing a function (like `f`) to another function (or something that behaves like a function, known as a `callable` in Python), which is a *decorator*. A decorator is used by prepending a callable with `@`, and placing it before a function definition (there's lots of good online tutorials for Python decorators, so take a look if this is a new concept for you). The following is identical to the previous code:"
"Python has a special syntax for passing a function (like `f`) to another function (or something that behaves like a function, known as a *callable* in Python), called a *decorator*. A decorator is used by prepending a callable with `@` and placing it before a function definition (there are lots of good online tutorials about Python decorators, so take a look at one if this is a new concept for you). The following is identical to the previous code:"
]
},
{
@ -359,7 +359,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here `NormalizeMean` will initialize some state during the setup (the mean of all elements passed), then the transformation is to subtract that mean. For decoding purposes, we implement the reverse of that transformation by adding the mean. Here is an example of `NormalizeMean` in action:"
"Here, `NormalizeMean` will initialize some state during the setup (the mean of all elements passed), then the transformation is to subtract that mean. For decoding purposes, we implement the reverse of that transformation by adding the mean. Here is an example of `NormalizeMean` in action:"
]
},
{
@ -397,14 +397,14 @@
"[options=\"header\"]\n",
"|======\n",
"| Class | To call | To implement\n",
"| `nn.Module` (PyTorch) | `()` (i.e. call as function) | `forward`\n",
"| `nn.Module` (PyTorch) | `()` (i.e., call as function) | `forward`\n",
"| `Transform` | `()` | `encodes`\n",
"| `Transform` | `decode()` | `decodes`\n",
"| `Transform` | `setup()` | `setups`\n",
"|======\n",
"```\n",
"\n",
"So, for instance, you would never call `setups` directly, but instead would call `setup`. The reason for this is that `setup` does some work before and after calling `setups` for you. To learn more about `Transform`s and how you can use them to have different behavior depending on the type of the input, be sure to check the tutorials in the fastai docs."
"So, for instance, you would never call `setups` directly, but instead would call `setup`. The reason for this is that `setup` does some work before and after calling `setups` for you. To learn more about `Transform`s and how you can use them to implement different behavior depending on the type of the input, be sure to check the tutorials in the fastai docs."
]
},
{
@ -418,7 +418,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To compose several transforms together, fastai provides `Pipeline`. We define a `Pipeline` by passing it a list of `Transform`s; it will then compose the transforms inside it. When you call a `Pipeline` on an object, it will automatically call the transforms inside, in order:"
"To compose several transforms together, fastai provides the `Pipeline` class. We define a `Pipeline` by passing it a list of `Transform`s; it will then compose the transforms inside it. When you call `Pipeline` on an object, it will automatically call the transforms inside, in order:"
]
},
{
@ -446,7 +446,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And you can call decode on the result of your encoding, to get back something you can display and analyze:"
"And you can call `decode` on the result of your encoding, to get back something you can display and analyze:"
]
},
{
@ -473,7 +473,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The only part that doesn't work the same way as in `Transform` is the setup. To properly setup a `Pipeline` of `Transform`s on some data, you need to use a `TfmdLists`."
"The only part that doesn't work the same way as in `Transform` is the setup. To properly set up a `Pipeline` of `Transform`s on some data, you need to use a `TfmdLists`."
]
},
{
@ -487,7 +487,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Your data is usually a set of raw items (like filenames, or rows in a dataframe) to which you want to apply a succession of transformations. We just saw that the succession of transformations was represented by a `Pipeline` in fastai. The class that groups together this pipeline with your raw items is called `TfmdLists`."
"Your data is usually a set of raw items (like filenames, or rows in a DataFrame) to which you want to apply a succession of transformations. We just saw that a succession of transformations is represented by a `Pipeline` in fastai. The class that groups together this `Pipeline` with your raw items is called `TfmdLists`."
]
},
{
@ -517,7 +517,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"At initialization, the `TfmdLists` will automatically call the setup method of each transform in order, providing them not with the raw items but the items transformed by all the previous `Transform`s in order. We can get the result of our pipeline on any raw element just by indexing into the `TfmdLists`:"
"At initialization, the `TfmdLists` will automatically call the `setup` method of each `Transform` in order, providing them not with the raw items but the items transformed by all the previous `Transform`s in order. We can get the result of our `Pipeline` on any raw element just by indexing into the `TfmdLists`:"
]
},
{
@ -544,7 +544,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And the `TfmdLists` knows how to decode for showing purposing:"
"And the `TfmdLists` knows how to decode for show purposes:"
]
},
{
@ -599,7 +599,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `TfmdLists` is named with an \"s\" because it can handle a training and validation set with a splits argument. You just need to pass the indices of which elemets are in the training set, and which are in the validation set:"
"The `TfmdLists` is named with an \"s\" because it can handle a training and a validation set with a `splits` argument. You just need to pass the indices of which elements are in the training set, and which are in the validation set:"
]
},
{
@ -618,7 +618,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can then access them through the `train` and `valid` attribute:"
"You can then access them through the `train` and `valid` attributes:"
]
},
{
@ -645,11 +645,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have manually written a `Transform` that returns your whole data (input and target) from the raw items you had, then `TfmdLists` is the class you need. You can directly convert it to a `DataLoaders` object with the `dataloaders` method. This is what we will do in our Siamese example further in this chapter.\n",
"If you have manually written a `Transform` that performs all of your preprocessing at once, turning raw items into a tuple with inputs and targets, then `TfmdLists` is the class you need. You can directly convert it to a `DataLoaders` object with the `dataloaders` method. This is what we will do in our Siamese example later in this chapter.\n",
"\n",
"In general though, you have two (or more) parallel pipelines of transforms: one for processing your raw items into inputs and one to process your raw items into targets. For instance, here, the pipeline we defined only processes the raw text into inputs. If we want to do text classification, we have to process the labels as well, into targets. \n",
"In general, though, you will have two (or more) parallel pipelines of transforms: one for processing your raw items into inputs and one to process your raw items into targets. For instance, here, the pipeline we defined only processes the raw text into inputs. If we want to do text classification, we also have to process the labels into targets. \n",
"\n",
"Here we need to do two things: first take the label name from the parent folder. There is a function `parent_label` for this:"
"For this we need to do two things. First we take the label name from the parent folder. There is a function, `parent_label`, for this:"
]
},
{
@ -677,7 +677,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we need a `Transform` that will grab the unique items and build a vocab with it during setup, then will transform the string labels into integers when called. fastai provides this transform, it's called `Categorize`:"
"Then we need a `Transform` that will grab the unique items and build a vocab with them during setup, then transform the string labels into integers when called. fastai provides this for us; it's called `Categorize`:"
]
},
{
@ -768,7 +768,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Like a `TfmdLists`, we can pass along `splits` to a `Datasets` to split our data between training and validation:"
"Like a `TfmdLists`, we can pass along `splits` to a `Datasets` to split our data between training and validation sets:"
]
},
{
@ -829,7 +829,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step is to convert your `Datasets` object to a `DataLoaders`, which can be done with the `dataloaders` method. Here we need to pass along special arguments to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to `before_batch`: "
"The last step is to convert our `Datasets` object to a `DataLoaders`, which can be done with the `dataloaders` method. Here we need to pass along a special argument to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to `before_batch`: "
]
},
{
@ -845,11 +845,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`dataloaders` directly calls `DataLoader` on each subset of our `Datasets`. fastai's `DataLoader` expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. It has a lot of points of customization but the most important you should know are:\n",
"`dataloaders` directly calls `DataLoader` on each subset of our `Datasets`. fastai's `DataLoader` expands the PyTorch class of the same name and is responsible for collating the items from our datasets into batches. It has a lot of points of customization, but the most important ones that you should know are:\n",
"\n",
"- `after_item`: applied on each item after grabbing it inside the dataset. This is the equivalent of the `item_tfms` in `DataBlock`.\n",
"- `before_batch`: applied on the list of items before they are collated. This is the ideal place to pad items to the same size.\n",
"- `after_batch`: applied on the batch as a whole after its construction. This is the equivalent of the `batch_tfms` in `DataBlock`."
"- `after_item`:: Applied on each item after grabbing it inside the dataset. This is the equivalent of `item_tfms` in `DataBlock`.\n",
"- `before_batch`:: Applied on the list of items before they are collated. This is the ideal place to pad items to the same size.\n",
"- `after_batch`:: Applied on the batch as a whole after its construction. This is the equivalent of `batch_tfms` in `DataBlock`."
]
},
{
@ -876,9 +876,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The two differences with what we had above is the use of `GrandParentSplitter` to split our training and validation data, and the `dl_type` argument. This is to tell `dataloaders` to use the `SortedDL` class of `DataLoader`, and not the usual one. `SortedDL` constructs batches by putting samples of roughly the same lengths into batches.\n",
"The two differences from the previous code are the use of `GrandparentSplitter` to split our training and validation data, and the `dl_type` argument. This is to tell `dataloaders` to use the `SortedDL` class of `DataLoader`, and not the usual one. `SortedDL` constructs batches by putting samples of roughly the same lengths into batches.\n",
"\n",
"This does the exact same thing as our `DataBlock` from above:"
"This does the exact same thing as our previous `DataBlock`:"
]
},
{
@ -900,25 +900,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"...except that now, you know how to customize every single piece of it!\n",
"But now, you know how to customize every single piece of it!\n",
"\n",
"Let's practice what we just learned on this mid-level API for data preprocessing on a computer vision example now, with a *Siamese Model* input pipeline."
"Let's practice what we just learned on this mid-level API for data preprocessing about using a computer vision example now."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applying the mid-Tier Data API: SiamesePair"
"## Applying the Mid-Level Data API: SiamesePair"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A *Siamese model* takes two images and has to determine if they are of the same class or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. We will explain here how to prepare the data for such a model, then we will train that model in <<chapter_arch_details>>.\n",
"A *Siamese model* takes two images and has to determine if they are of the same class or not. For this example, we will use the Pet dataset again and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. We will explain here how to prepare the data for such a model, then we will train that model in <<chapter_arch_details>>.\n",
"\n",
"First things first, let's get the images in our dataset."
"First things first, let's get the images in our dataset:"
]
},
{
@ -936,9 +936,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If we didn't care about showing our objects at all, we could directly create one transform to completely preprocess that list of files. We will want to look at those images though, so we need to create a custom type. When you call the `show` method on a `TfmdLists` or a `Datasets` object, it will decode items until it reaches a type that contains a `show` method and use it to show the object. That `show` method gets passed a `ctx`, which could be a matplotlib axes for images, or the row of a dataframe for texts.\n",
"If we didn't care about showing our objects at all, we could directly create one transform to completely preprocess that list of files. We will want to look at those images though, so we need to create a custom type. When you call the `show` method on a `TfmdLists` or a `Datasets` object, it will decode items until it reaches a type that contains a `show` method and use it to show the object. That `show` method gets passed a `ctx`, which could be a `matplotlib` axis for images, or a row of a DataFrame for texts.\n",
"\n",
"Here we create a `SiameseImage` object that subclasses `Tuple` and is intended to contain three things: two images, and a boolean that's `True` if they are the same breed. We also implement the special `show` method, such that it concatenates the two images, with a black line in the middle. Don't worry too much about the part that is in the `if` test (which is to show the `SiameseImage` when the images are Pillow images, and not tensors), the important part is in the last three lines."
"Here we create a `SiameseImage` object that subclasses `Tuple` and is intended to contain three things: two images, and a Boolean that's `True` if the images are of the same breed. We also implement the special `show` method, such that it concatenates the two images with a black line in the middle. Don't worry too much about the part that is in the `if` test (which is to show the `SiameseImage` when the images are Python images, not tensors); the important part is in the last three lines:"
]
},
{
@ -964,7 +964,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's then create a first `SiameseImage` and check our `show` method works:"
"Let's create a first `SiameseImage` and check our `show` method works:"
]
},
{
@ -1026,7 +1026,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The important thing with `Transform`s we saw before is that they dispatch over tuples or their subclasses. That's precisely why we chose to subclass tuple in this instance: this way we can apply any transform that work on images to our `SiameseImage` and it will be applied on each image in the tuple:"
"The important thing with transforms that we saw before is that they dispatch over tuples or their subclasses. That's precisely why we chose to subclass `Tuple` in this instance--this way we can apply any transform that works on images to our `SiameseImage` and it will be applied on each image in the tuple:"
]
},
{
@ -1056,9 +1056,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the resize transform is applied to each of the two images, but not the boolean flag. Even if we have a custom type, we can thus benefit from all the data augmentation transforms inside the library.\n",
"Here the `Resize` transform is applied to each of the two images, but not the Boolean flag. Even if we have a custom type, we can thus benefit from all the data augmentation transforms inside the library.\n",
"\n",
"We are now ready to build the `Transform` that we will use to get our data ready for a Siamese model. First, we will need a function to determine the class of all our images:"
"We are now ready to build the `Transform` that we will use to get our data ready for a Siamese model. First, we will need a function to determine the classes of all our images:"
]
},
{
@ -1075,7 +1075,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then here is our main transform. For each image, il will, with a probability of 0.5, draw an image from the same class and return a `SiameseImage` with a true label, or draw an image from another class and return a `SiameseImage` with a false label. This is all done in the private `_draw` function. There is one difference between the training and validation set, which is why the transform needs to be initialized with the splits: on the training set, we will make that random pick each time we read an image, whereas on the validation set, we make this random pick once and for all at initialization. This way, we get more varied samples during training, but always the same validation set."
"For each image our tranform will, with a probability of 0.5, draw an image from the same class and return a `SiameseImage` with a true label, or draw an image from another class and return a `SiameseImage` with a false label. This is all done in the private `_draw` function. There is one difference between the training and validation sets, which is why the transform needs to be initialized with the splits: on the training set we will make that random pick each time we read an image, whereas on the validation set we make this random pick once and for all at initialization. This way, we get more varied samples during training, but always the same validation set:"
]
},
{
@ -1140,7 +1140,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the middle level API for data collection, we have two objects that can help us apply transforms on a set of items, `TfmdLists` and `Datasets`. If you remember what we have just seen, one applies a `Pipeline` of transforms and the other applies several `Pipeline` of transforms in parallel, to build tuples. Here, our main transform already builds the tuples, so we use `TfmdLists`:"
"In the mid-level API for data collection we have two objects that can help us apply transforms on a set of items, `TfmdLists` and `Datasets`. If you remember what we have just seen, one applies a `Pipeline` of transforms and the other applies several `Pipeline` of transforms in parallel, to build tuples. Here, our main transform already builds the tuples, so we use `TfmdLists`:"
]
},
{
@ -1170,7 +1170,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can finally get our data in `DataLoaders` by calling the `dataloaders` method. One thing to be careful here is that this method does not take `item_tfms` and `batch_tfms` like a `DataBlock`. The fastai `DataLoader` has several hooks that are named after events: here what we apply on the items after they are grabbed is called `after_item`, and what we apply on the batch once it's built is called `after_batch`."
"And we can finally get our data in `DataLoaders` by calling the `dataloaders` method. One thing to be careful of here is that this method does not take `item_tfms` and `batch_tfms` like a `DataBlock`. The fastai `DataLoader` has several hooks that are named after events; here what we apply on the items after they are grabbed is called `after_item`, and what we apply on the batch once it's built is called `after_batch`:"
]
},
{
@ -1187,17 +1187,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we need to pass more transforms than usual: that's because the data block API usually adds them automatically:\n",
"Note that we need to pass more transforms than usual--that's because the data block API usually adds them automatically:\n",
"\n",
"- `ToTensor` is the one that converts images to tensors (again, it's applied on every part of the tuple)\n",
"- `IntToFloatTensor` convert the tensor of images that have integers from 0 to 255 to a tensor of floats, and divides by 255 to make the values between 0 and 1."
"- `ToTensor` is the one that converts images to tensors (again, it's applied on every part of the tuple).\n",
"- `IntToFloatTensor` converts the tensor of images containing integers from 0 to 255 to a tensor of floats, and divides by 255 to make the values between 0 and 1."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can now train a model using those `DataLoaders`. It needs a bit more customization than the usual model provided by `cnn_learner` since it has to take two images instead of one. We will see how to create such a model and train it in <<chapter_arch_dtails>>."
"We can now train a model using this `DataLoaders`. It will need a bit more customization than the usual model provided by `cnn_learner` since it has to take two images instead of one, but we will see how to create such a model and train it in <<chapter_arch_dtails>>."
]
},
{
@ -1211,7 +1211,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"fastai presents a layered API: it takes one line of code to grab the data when it's in one of the usual settings. It's then easy for beginners to focus on training a model without spending to much time assembling the data. Then the data block API gives you more flexibility by mixing and matching some building blocks. Undeneath it, the mid-level API gives you entire flexibility to apply any transformations on your items. In your real-wrold problems, this is probably what you will need to use, and we hope it makes the step of data-munging as easy as poassible."
"fastai provides a layered API. It takes one line of code to grab the data when it's in one of the usual settings, making it easy for beginners to focus on training a model without spending too much time assembling the data. Then, the high-level data block API gives you more flexibility by allowing you to mix and match some building blocks. Underneath it, the mid-level API gives you greater flexibility to apply any transformations on your items. In your real-world problems, this is probably what you will need to use, and we hope it makes the step of data-munging as easy as possible."
]
},
{
@ -1225,17 +1225,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Why do we say that fastai has a layered API? What does it mean?\n",
"1. Why does a `Transform` have a decode method? What does it do?\n",
"1. Why does a `Transform` have a setup method? What does it do?\n",
"1. Why do we say that fastai has a \"layered\" API? What does it mean?\n",
"1. Why does a `Transform` have a `decode` method? What does it do?\n",
"1. Why does a `Transform` have a `setup` method? What does it do?\n",
"1. How does a `Transform` work when called on a tuple?\n",
"1. Which methods do you need to implement when writing your own `Transform`?\n",
"1. Write a `Normalize` transform that fully normalizes items (substract the mean and divide by the standard deviation of the dataset), and that can decode that behavior. Try not to peak!\n",
"1. Write a `Transform` that does the numericalization of tokenized texts (it should set its vocab automatically from the dataset seen and have a decode method). Look at the source code of fastai if you need help.\n",
"1. Write a `Normalize` transform that fully normalizes items (subtract the mean and divide by the standard deviation of the dataset), and that can decode that behavior. Try not to peek!\n",
"1. Write a `Transform` that does the numericalization of tokenized texts (it should set its vocab automatically from the dataset seen and have a `decode` method). Look at the source code of fastai if you need help.\n",
"1. What is a `Pipeline`?\n",
"1. What is a `TfmdLists`? \n",
"1. What is a `Datasets`? How is it different from `TfmdLists`?\n",
"1. Why are `TfmdLists` and `Datasets` named with an s?\n",
"1. What is a `Datasets`? How is it different from a `TfmdLists`?\n",
"1. Why are `TfmdLists` and `Datasets` named with an \"s\"?\n",
"1. How can you build a `DataLoaders` from a `TfmdLists` or a `Datasets`?\n",
"1. How do you pass `item_tfms` and `batch_tfms` when building a `DataLoaders` from a `TfmdLists` or a `Datasets`?\n",
"1. What do you need to do when you want to have your custom items work with methods like `show_batch` or `show_results`?\n",
@ -1253,8 +1253,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Use the mid-level API to grab the data on the pets dataset. On the adult dataset (used in chapter 1).\n",
"1. Look at the siamese tutorial in the fastai documentation to learn how to customize the behavior of `show_batch` and `show_results` for new type of items. Implement it on your own project."
"1. Use the mid-level API to prepare the data in `DataLoaders` on the pets dataset. On the adult dataset (used in chapter 1).\n",
"1. Look at the Siamese tutorial in the fastai documentation to learn how to customize the behavior of `show_batch` and `show_results` for new type of items. Implement it in your own project."
]
},
{
@ -1268,11 +1268,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations you've completed all of the chapters in this book which cover the key practical parts of training and using deep learning! You know how to use all of fastai's built in applications, and how to customise them using the data blocks API and loss functions. You even know how to create a neural network from scratch, and train it! (And hopefully you now know some of the questions to ask to help make sure your creations help improve society too.)\n",
"Congratulations—you've completed all of the chapters in this book that cover the key practical parts of training models and using deep learning! You know how to use all of fastai's built-in applications, and how to customize them using the data block API and loss functions. You even know how to create a neural network from scratch, and train it! (And hopefully you now know some of the questions to ask to make sure your creations help improve society too.)\n",
"\n",
"The knowledge you already have is enough to create full working prototypes of many types of neural network application. More importantly, it will help you understand the capabilities and limitations of deep learning models, and how to design a system which best handles these capabilities and limitations.\n",
"The knowledge you already have is enough to create full working prototypes of many types of neural network application. More importantly, it will help you understand the capabilities and limitations of deep learning models, and how to design a system that's well adapted to them.\n",
"\n",
"In the rest of this book we will be pulling apart these applications, piece by piece, to understand all of the foundations they are built on. This is important knowledge for a deep learning practitioner, because it is the knowledge which allows you to inspect and debug models that you build, and to create new applications which are customised for your particular projects."
"In the rest of this book we will be pulling apart those applications, piece by piece, to understand the foundations they are built on. This is important knowledge for a deep learning practitioner, because it is what allows you to inspect and debug models that you build and create new applications that are customized for your particular projects."
]
},
{

View File

@ -28,9 +28,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We're now ready to go deep... deep into deep learning! You already learned how to train a basic neural network, but how do you go from there to creating state of the art models? In this part of the book we're going to uncover all of the mysteries, starting with language models.\n",
"We're now ready to go deep... deep into deep learning! You already learned how to train a basic neural network, but how do you go from there to creating state-of-the-art models? In this part of the book we're going to uncover all of the mysteries, starting with language models.\n",
"\n",
"We saw in <<chapter_nlp>> how to finetune a pretrained language model to build a text classifier, in this chapter, we will explain to you what exactly is inside that model, and what an RNN is. First, let's gather some data that will allow us to quickly prototype our various models. "
"You saw in <<chapter_nlp>> how to fine-tune a pretrained language model to build a text classifier. In this chapter, we will explain to you what exactly is inside that model, and what an RNN is. First, let's gather some data that will allow us to quickly prototype our various models. "
]
},
{
@ -44,14 +44,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Whenever we start working on a new problem, we always first try to think of the simplest dataset we can which would allow us to try out methods quickly and easily, and interpret the results. When we started working on language modelling a few years ago, we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *human numbers*, and it simply contains the first 10,000 numbers written out in English."
"Whenever we start working on a new problem, we always first try to think of the simplest dataset we can that will allow us to try out methods quickly and easily, and interpret the results. When we started working on language modeling a few years ago we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *Human Numbers*, and it simply contains the first 10,000 numbers written out in English."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> j: One of the most common practical mistakes I see even amongst highly experienced practitioners is failing to use appropriate datasets at appropriate times during the analysis process. In particular, most people tend to start with datasets which are too big and too complicated."
"> j: One of the most common practical mistakes I see even amongst highly experienced practitioners is failing to use appropriate datasets at appropriate times during the analysis process. In particular, most people tend to start with datasets that are too big and too complicated."
]
},
{
@ -105,7 +105,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's open those two files and see what's inside. At first we'll join all of those texts together and ignore the split train/valid given by the dataset, we will come back to it later on:"
"Let's open those two files and see what's inside. At first we'll join all of the texts together and ignore the train/valid split given by the dataset (we'll come back to that later):"
]
},
{
@ -135,7 +135,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We take all those lines and concatenate them in one big stream. To mark when we go from one number to the next, we use a '.' as separation:"
"We take all those lines and concatenate them in one big stream. To mark when we go from one number to the next, we use a `.` as a separator:"
]
},
{
@ -163,7 +163,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use word tokenization for this dataset, by splitting on spaces:"
"We can tokenize this dataset by splitting on spaces:"
]
},
{
@ -248,7 +248,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have some small dataset on which language modelling should be an easy task, we can build our first model."
"Now that we have a small dataset on which language modeling should be an easy task, we can build our first model."
]
},
{
@ -262,9 +262,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"One simple way to turn this into a neural network would be to specify that we are going to predict each word based on the previous three words. Therefore, we could create a list of every sequence of three words as independent variables, and the next word after each sequence as the dependent variable. \n",
"One simple way to turn this into a neural network would be to specify that we are going to predict each word based on the previous three words. We could create a list of every sequence of three words as our independent variables, and the next word after each sequence as the dependent variable. \n",
"\n",
"We can do that with plain Python. Let us do it first with tokens just to confirm what it looks like:"
"We can do that with plain Python. Let's do it first with tokens just to confirm what it looks like:"
]
},
{
@ -319,7 +319,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can batch those easily using the `DataLoader` class. For now we will split randomly the sequences."
"We can batch those easily using the `DataLoader` class. For now we will split the sequences randomly:"
]
},
{
@ -341,9 +341,9 @@
"\n",
"The first tweak is that the first linear layer will use only the first word's embedding as activations, the second layer will use the second word's embedding plus the first layer's output activations, and the third layer will use the third word's embedding plus the second layer's output activations. The key effect of this is that every word is interpreted in the information context of any words preceding it. \n",
"\n",
"The second tweak is that each of these three layers will use the same weight matrix. The way that one word impacts the activations from previous words should not change depending on the position of a word. In other words, activation values will change as data moves through the layers, but the layer weights themselves will not change from layer to layer. So a layer does not learn one sequence position; it must learn to handle all positions.\n",
"The second tweak is that each of these three layers will use the same weight matrix. The way that one word impacts the activations from previous words should not change depending on the position of a word. In other words, activation values will change as data moves through the layers, but the layer weights themselves will not change from layer to layer. So, a layer does not learn one sequence position; it must learn to handle all positions.\n",
"\n",
"Since layer weights do not change, you might think of the sequential layers as the \"same layer\" repeated. In fact PyTorch makes this concrete; we can just create one layer, and use it multiple times."
"Since layer weights do not change, you might think of the sequential layers as \"the same layer\" repeated. In fact, PyTorch makes this concrete; we can just create one layer, and use it multiple times."
]
},
{
@ -387,25 +387,25 @@
"source": [
"As you see, we have created three layers:\n",
"\n",
"- The embedding layer (`i_h` for *input* to *hidden*)\n",
"- The linear layer to create the activations for the next word (`h_h` for *hidden* to *hidden*)\n",
"- A final linear layer to predict the fourth word (`h_o` for *hidden* to *output*)\n",
"- The embedding layer (`i_h`, for *input* to *hidden*)\n",
"- The linear layer to create the activations for the next word (`h_h`, for *hidden* to *hidden*)\n",
"- A final linear layer to predict the fourth word (`h_o`, for *hidden* to *output*)\n",
"\n",
"This might be easier to represent in pictorial form. Let's define a simple pictorial representation of basic neural networks. <<img_simple_nn>> shows how we're going to represent a neural net with one hidden layer."
"This might be easier to represent in pictorial form, so let's define a simple pictorial representation of basic neural networks. <<img_simple_nn>> shows how we're going to represent a neural net with one hidden layer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Pictorial representation of simple neural network\" width=\"400\" src=\"images/att_00020.png\" caption=\"Pictorial representation of simple neural network\" id=\"img_simple_nn\">"
"<img alt=\"Pictorial representation of simple neural network\" width=\"400\" src=\"images/att_00020.png\" caption=\"Pictorial representation of a simple neural network\" id=\"img_simple_nn\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each shape represents activations: rectangle for input, circle for hidden (inner) layer activations, and triangle for output activations. We will use those shapes (summarized in <<img_shapes>>) in all the diagrams of this chapter."
"Each shape represents activations: rectangle for input, circle for hidden (inner) layer activations, and triangle for output activations. We will use those shapes (summarized in <<img_shapes>>) in all the diagrams in this chapter."
]
},
{
@ -419,7 +419,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"An arrow represents the actual layer computation—i.e. the linear layer followed by the activation layers. Using this notation, <<lm_rep>> shows what our simple language model looks like."
"An arrow represents the actual layer computation—i.e., the linear layer followed by the activation layers. Using this notation, <<lm_rep>> shows what our simple language model looks like."
]
},
{
@ -506,7 +506,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To see if this is any good, let's check what would a very simple model give us. In this case we could always predict the most common token, so let's find out which token is the most often the target in our validation set:"
"To see if this is any good, let's check what a very simple model would give us. In this case we could always predict the most common token, so let's find out which token is most often the target in our validation set:"
]
},
{
@ -538,21 +538,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The most common token has the index 29, which corresponds to the token 'thousand'. Always predicting this token would give us an accuracy of roughly 15\\%, so we are faring way better!"
"The most common token has the index 29, which corresponds to the token `thousand`. Always predicting this token would give us an accuracy of roughly 15\\%, so we are faring way better!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at `tokens` reminded me that large numbers are written with many words, so on the way to 10,000 you write \"thousand\" a lot: five thousand, five thousand and one, five thousand and two, etc.. Oops! Looking at your data is great for noticing subtle features and also embarrassingly obvious ones."
"> A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at `tokens` reminded me that large numbers are written with many words, so on the way to 10,000 you write \"thousand\" a lot: five thousand, five thousand and one, five thousand and two, etc. Oops! Looking at your data is great for noticing subtle features and also embarrassingly obvious ones."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a nice first baseline. Let's see how we can refactor this with a loop."
"This is a nice first baseline. Let's see how we can refactor it with a loop."
]
},
{
@ -566,7 +566,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the code for our module, we could simplify it by replacing the duplicated code that calls the layers with a for loop. As well as making our code simpler, this will also have the benefit that we could apply our module equally well to token sequences of different lengths; we would not be restricted to token lists of length three."
"Looking at the code for our module, we could simplify it by replacing the duplicated code that calls the layers with a `for` loop. As well as making our code simpler, this will also have the benefit that we will be able to apply our module equally well to token sequences of different lengths--we won't be restricted to token lists of length three:"
]
},
{
@ -664,7 +664,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also refactor our pictorial representation in exactly the same way, see <<basic_rnn>> (we're also removing the details of activation sizes here, and using the same arrow colors as in <<lm_rep>>)."
"We can also refactor our pictorial representation in exactly the same way, as shown in <<basic_rnn>> (we're also removing the details of activation sizes here, and using the same arrow colors as in <<lm_rep>>)."
]
},
{
@ -678,30 +678,30 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You will see that there is a set of activations which are being updated each time through the loop, and are stored in the variable `h` — this is called the *hidden state*."
"You will see that there is a set of activations that are being updated each time through the loop, stored in the variable `h`—this is called the *hidden state*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Jargon: hidden state: the activations that are updated at each step of a recurrent neural network"
"> Jargon: hidden state: The activations that are updated at each step of a recurrent neural network."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A neural network which is defined using a loop like this is called a *recurrent neural network*, also known as an RNN. It is important to realise that an RNN is not a complicated new architecture, but is simply a refactoring of a multilayer neural network using a for loop.\n",
"A neural network that is defined using a loop like this is called a *recurrent neural network* (RNN). It is important to realize that an RNN is not a complicated new architecture, but simply a refactoring of a multilayer neural network using a `for` loop.\n",
"\n",
"> A: My true opinion: if they were called \"looping neural networks\", or LNNs, they would seem 50% less daunting!"
"> A: My true opinion: if they were called \"looping neural networks,\" or LNNs, they would seem 50% less daunting!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we know what an RNN is, let's try to make it a little bit beter."
"Now that we know what an RNN is, let's try to make it a little bit better."
]
},
{
@ -715,11 +715,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the code for our RNN, one thing that seems problematic is that we are initialising our hidden state to zero for every new input sequence. Why is that a problem? We made our sample sequences short so they would fit easily into batches. But if we order those samples correctly, those sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence. \n",
"Looking at the code for our RNN, one thing that seems problematic is that we are initializing our hidden state to zero for every new input sequence. Why is that a problem? We made our sample sequences short so they would fit easily into batches. But if we order the samples correctly, those sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence. \n",
"\n",
"Another thing we can look at is having more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? \n",
"\n",
"We'll see how we can implement those changes, starting with adding some state."
"Let's see how we can implement those changes, starting with adding some state."
]
},
{
@ -733,15 +733,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we initialize the model's hidden state to zero for each new sample, we are throwing away all the information we have about the sentences we have seen so far, which means that our model doesn't actually know where we are up to in the overall counting sequence. This is easily fixed; we can simply move the initialisation of the hidden state to `__init__`.\n",
"Because we initialize the model's hidden state to zero for each new sample, we are throwing away all the information we have about the sentences we have seen so far, which means that our model doesn't actually know where we are up to in the overall counting sequence. This is easily fixed; we can simply move the initialization of the hidden state to `__init__`.\n",
"\n",
"But this fix will create its own subtle, but important, problem. It effectively makes our neural network as deep as the entire number of tokens in our document. For instance, if there were 10,000 tokens in our dataset, we would be creating a 10,000 layer neural network.\n",
"But this fix will create its own subtle, but important, problem. It effectively makes our neural network as deep as the entire number of tokens in our document. For instance, if there were 10,000 tokens in our dataset, we would be creating a 10,000-layer neural network.\n",
"\n",
"To see this, consider the original pictorial representation of our recurrent neural network in <<lm_rep>>, before refactoring it with a for loop. You can see each layer corresponds with one token input. When we talk about the representation of a recurrent neural network before refactoring with the for loop, we call this the *unrolled representation*. It is often helpful to consider the unrolled representation when trying to understand an RNN.\n",
"To see why this is the case, consider the original pictorial representation of our recurrent neural network in <<lm_rep>>, before refactoring it with a `for` loop. You can see each layer corresponds with one token input. When we talk about the representation of a recurrent neural network before refactoring with the `for` loop, we call this the *unrolled representation*. It is often helpful to consider the unrolled representation when trying to understand an RNN.\n",
"\n",
"The problem with a 10,000 layer neural network is that if and when you get to the 10,000th word of the dataset, you will still need to calculate the derivatives all the way back to the first layer. This is going to be very slow indeed, and very memory intensive. It is unlikely that you could store even one mini batch on your GPU.\n",
"The problem with a 10,000-layer neural network is that if and when you get to the 10,000th word of the dataset, you will still need to calculate the derivatives all the way back to the first layer. This is going to be very slow indeed, and very memory-intensive. It is unlikely that you'll be able to store even one mini-batch on your GPU.\n",
"\n",
"The solution to this is to tell PyTorch that we do not want to back propagate the derivatives through the entire implicit neural network. Instead, we will just keep the last three layers of gradients. To remove all of the gradient history in PyTorch, we use the `detach` method.\n",
"The solution to this problem is to tell PyTorch that we do not want to back propagate the derivatives through the entire implicit neural network. Instead, we will just keep the last three layers of gradients. To remove all of the gradient history in PyTorch, we use the `detach` method.\n",
"\n",
"Here is the new version of our RNN. It is now stateful, because it remembers its activations between different calls to `forward`, which represent its use for different samples in the batch:"
]
@ -774,14 +774,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you think about it, this model will have the same activations whatever the sequence length we pick, because the hidden state will remember the last activation from the previous batch. The only thing that will be different are the gradients computed at each step: they will only be calculated on sequence length tokens in the past, instead of the whole stream. That is why this sequence length is often called *bptt* for back-propagation through time."
"This model will have the same activations whatever sequence length we pick, because the hidden state will remember the last activation from the previous batch. The only thing that will be different is the gradients computed at each step: they will only be calculated on sequence length tokens in the past, instead of the whole stream. This approach is called *backpropagation through time* (BPTT)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* jargon: Back propagation through time (BPTT): Treating a neural net with effectively one layer per time step (usually refactored using a loop) as one big model, and calculating gradients on it in the usual way. To avoid running out of memory and time, we usually use _truncated_ BPTT, which \"detaches\" the history of computation steps in the hidden state every few time steps."
"> jargon: Back propagation through time (BPTT): Treating a neural net with effectively one layer per time step (usually refactored using a loop) as one big model, and calculating gradients on it in the usual way. To avoid running out of memory and time, we usually use _truncated_ BPTT, which \"detaches\" the history of computation steps in the hidden state every few time steps."
]
},
{
@ -792,7 +792,7 @@
"\n",
"`LMDataLoader` was doing this for us in <<chapter_nlp>>. This time we're going to do it ourselves.\n",
"\n",
"To do this, we are going to rearrange our dataset. First we divide the samples into `m = len(dset) // bs` groups (this is the equivalent of splitting the whole concatenated dataset into, for instance, 64 equally sized pieces, since we're using `bs=64` here). `m` is the length of each of these pieces. For instance, if we're using our whole dataset (although we'll actually split it into train vs valid in a moment), that will be:"
"To do this, we are going to rearrange our dataset. First we divide the samples into `m = len(dset) // bs` groups (this is the equivalent of splitting the whole concatenated dataset into, for example, 64 equally sized pieces, since we're using `bs=64` here). `m` is the length of each of these pieces. For instance, if we're using our whole dataset (although we'll actually split it into train versus valid in a moment), that will be:"
]
},
{
@ -824,7 +824,7 @@
"\n",
" (0, m, 2*m, ..., (bs-1)*m)\n",
"\n",
"then the second batch of the samples: \n",
"the second batch of the samples: \n",
"\n",
" (1, m+1, 2*m+1, ..., (bs-1)*m+1)\n",
"\n",
@ -850,7 +850,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we just pass `drop_last=True` when building our `DataLoaders` to drop the last batch that has not a shape of `bs`, we also pass `shuffle=False` to make sure the texts are read in order."
"Then we just pass `drop_last=True` when building our `DataLoaders` to drop the last batch that does not have a shape of `bs`. We also pass `shuffle=False` to make sure the texts are read in order:"
]
},
{
@ -870,7 +870,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The last thing we add is a little tweak of the training loop via a `Callback`. We will talk more about callbacks in <<chapter_accel_sgd>>; this one will call the `reset` method of our model at the beginning of each epoch and before each validation phase. Since we implemented that method to zero the hidden state of the model, this will make sure we start we a clean state before reading those continuous chunks of text. We can also start training a bit longer:"
"The last thing we add is a little tweak of the training loop via a `Callback`. We will talk more about callbacks in <<chapter_accel_sgd>>; this one will call the `reset` method of our model at the beginning of each epoch and before each validation phase. Since we implemented that method to zero the hidden state of the model, this will make sure we start with a clean state before reading those continuous chunks of text. We can also start training a bit longer:"
]
},
{
@ -1011,7 +1011,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is easy enough to add. We need to first change our data so that the dependent variable has each of the three next words after each of our three input words. Instead of 3, we use an attribute, `sl` (for sequence length) and make it a bit bigger:"
"This is easy enough to add. We need to first change our data so that the dependent variable has each of the three next words after each of our three input words. Instead of `3`, we use an attribute, `sl` (for sequence length), and make it a bit bigger:"
]
},
{
@ -1061,7 +1061,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we need to modify our model so that it outputs a prediction after every word, rather than just at the end of a three word sequence:"
"Now we need to modify our model so that it outputs a prediction after every word, rather than just at the end of a three-word sequence:"
]
},
{
@ -1258,9 +1258,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to train for longer, since the task has changed a bit and is more complicated now. But we end up with a good result... At least, sometimes. If you run it a few times, you'll see that you can get quite different results on different runs. That's because effectively we have a very deep network here, which can result in very large or very small gradients. We'll see in the next part of to deal with this.\n",
"We need to train for longer, since the task has changed a bit and is more complicated now. But we end up with a good result... At least, sometimes. If you run it a few times, you'll see that you can get quite different results on different runs. That's because effectively we have a very deep network here, which can result in very large or very small gradients. We'll see in the next part of this chapter how to deal with this.\n",
"\n",
"Now, the obvious way to get a better model is to go deeper: we only have one linear layer between the hidden state and the output activations in our basic RNN, so maybe we would get better results with more."
"Now, the obvious way to get a better model is to go deeper: we only have one linear layer between the hidden state and the output activations in our basic RNN, so maybe we'll get better results with more."
]
},
{
@ -1288,14 +1288,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"…or in an unrolled representation in <<unrolled_stack_rep>> (the same way as in <<lm_rep>> last section)."
"The unrolled representation is shown in <<unrolled_stack_rep>> (similar to <<lm_rep>>)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"2-layer unrolled RNN\" width=\"500\" caption=\"2-layer unrolled RNN\" id=\"unrolled_stack_rep\" src=\"images/att_00026.png\">"
"<img alt=\"2-layer unrolled RNN\" width=\"500\" caption=\"Two-layer unrolled RNN\" id=\"unrolled_stack_rep\" src=\"images/att_00026.png\">"
]
},
{
@ -1309,14 +1309,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Model"
"### The Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's save some time by using PyTorch's RNN class, which implements exactly what we have created above, but also gives us the option to stack multiple RNNs, as we have discussed:"
"We can save some time by using PyTorch's `RNN` class, which implements exactly what we created earlier, but also gives us the option to stack multiple RNNs, as we have discussed:"
]
},
{
@ -1486,7 +1486,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that's disappointing... we are doing more poorly than the single-layer RNN from the end of last section. The reason is that we have a deeper model, leading to exploding or disappearing activations."
"Now that's disappointing... our previous single-layer RNN performed better. Why? The reason is that we have a deeper model, leading to exploding or vanishing activations."
]
},
{
@ -1500,31 +1500,31 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In practice, creating accurate models from this kind of RNN is difficult. We will get better results if we call `detach` less often, and have more layers this gives our RNN a longer time horizon to learn from, and richer features to create. But it also means we have a deeper model to train. The key challenge in the development of deep learning has been figuring out how to train these kinds of models.\n",
"In practice, creating accurate models from this kind of RNN is difficult. We will get better results if we call `detach` less often, and have more layers—this gives our RNN a longer time horizon to learn from, and richer features to create. But it also means we have a deeper model to train. The key challenge in the development of deep learning has been figuring out how to train these kinds of models.\n",
"\n",
"The reason this is challenging is because of what happens when you multiply by a matrix many times. Think about what happens when you multiply by a number many times. For example, if you multiply by two, starting at one, you get the sequence 1, 2, 4, 8,… after 32 steps you are already at 4,294,967,296. A similar issue happens if we multiply by 0.5: we get 0.5, 0.25, 0.125… and after 32 steps it's 0.00000000023. As you can see, a number even slightly higher or lower than one results in an explosion or disappearance of our number, after just a few repeated multiplications.\n",
"The reason this is challenging is because of what happens when you multiply by a matrix many times. Think about what happens when you multiply by a number many times. For example, if you multiply by 2, starting at 1, you get the sequence 1, 2, 4, 8,... after 32 steps you are already at 4,294,967,296. A similar issue happens if you multiply by 0.5: you get 0.5, 0.25, 0.125… and after 32 steps it's 0.00000000023. As you can see, multiplying by a number even slightly higher or lower than 1 results in an explosion or disappearance of our starting number, after just a few repeated multiplications.\n",
"\n",
"Because matrix multiplication is just multiplying numbers and adding them up, exactly the same thing happens with repeated matrix multiplications. And a deep neural network is just repeated matrix multiplications--each extra layer is another matrix multiplication. This means that it is very easy for a deep neural network to end up with extremely large, or extremely small numbers.\n",
"Because matrix multiplication is just multiplying numbers and adding them up, exactly the same thing happens with repeated matrix multiplications. And that's all a deep neural network is --each extra layer is another matrix multiplication. This means that it is very easy for a deep neural network to end up with extremely large or extremely small numbers.\n",
"\n",
"This is a problem, because the way computers store numbers (known as \"floating point\") means that they become less and less accurate the further away the numbers get from zero. The diagram in <<float_prec>>, from the excellent article [What you never wanted to know about floating point but will be forced to find out](http://www.volkerschatz.com/science/float.html), shows how the precision of floating point numbers varies over the number line:"
"This is a problem, because the way computers store numbers (known as \"floating point\") means that they become less and less accurate the further away the numbers get from zero. The diagram in <<float_prec>>, from the excellent article [\"What You Never Wanted to Know About Floating Point but Will Be Forced to Find Out\"](http://www.volkerschatz.com/science/float.html), shows how the precision of floating-point numbers varies over the number line."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img alt=\"Precision of floating point numbers\" width=\"1000\" caption=\"Precision of floating point numbers\" id=\"float_prec\" src=\"images/fltscale.svg\">"
"<img alt=\"Precision of floating point numbers\" width=\"1000\" caption=\"Precision of floating-point numbers\" id=\"float_prec\" src=\"images/fltscale.svg\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This inaccuracy means that often the gradients calculated for updating the weights end up as zero or infinity for deep networks. This is commonly refered to as *vanishing gradients* or *exploding gradients*. That means that in SGD, the weights are updated either not at all, or jump to infinity. Either way, they won't improve with training.\n",
"This inaccuracy means that often the gradients calculated for updating the weights end up as zero or infinity for deep networks. This is commonly refered to as the *vanishing gradients* or *exploding gradients* problem. It means that in SGD, the weights are either not updated at all or jump to infinity. Either way, they won't improve with training.\n",
"\n",
"Researchers have developed a number of ways to tackle this problem, which we will be discussing later in the book. One way to tackle the problem is to change the definition of a layer in a way that makes it less likely to have exploding activations. We'll look at the details of how this is done in <<chapter_convolutions>>, when we discuss *batch normalization*, and <<chapter_resnet>>, when we discuss *ResNets*, although these details don't generally matter in practice (unless you are a researcher that is creating new approaches to solving this problem). Another way to deal with this is by being careful about *initialization*, which is a topic we'll investigate in <<chapter_foundations>>.\n",
"Researchers have developed a number of ways to tackle this problem, which we will be discussing later in the book. One option is to change the definition of a layer in a way that makes it less likely to have exploding activations. We'll look at the details of how this is done in <<chapter_convolutions>>, when we discuss batch normalization, and <<chapter_resnet>>, when we discuss ResNets, although these details don't generally matter in practice (unless you are a researcher that is creating new approaches to solving this problem). Another strategy for dealing with this is by being careful about initialization, which is a topic we'll investigate in <<chapter_foundations>>.\n",
"\n",
"For RNNs, there are two types of layers frequently used to avoid exploding activations, and they are: *gated recurrent units* (GRU), and *Long Short-Term Memory* (LSTM). Both of these are available in PyTorch, and are drop-in replacements for the RNN layer. We will only cover LSTMs in this book, there are plenty of good tutorials online explaining GRUs, which are a minor variant on the LSTM design."
"For RNNs, there are two types of layers that are frequently used to avoid exploding activations: *gated recurrent units* (GRUs) and *long short-term memory* (LSTM) layers. Both of these are available in PyTorch, and are drop-in replacements for the RNN layer. We will only cover LSTMs in this book; there are plenty of good tutorials online explaining GRUs, which are a minor variant on the LSTM design."
]
},
{
@ -1538,14 +1538,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"LSTM (for long short-term memory) is an architecture that was introduced back in 1997 by Jurgen Schmidhuber and Sepp Hochreiter. In this architecture, there are not one but two hidden states. In our base RNN, the hidden state is the output of the RNN at the previous time step. That hidden state is then responsible for doing two things at a time:\n",
"LSTM is an architecture that was introduced back in 1997 by Jürgen Schmidhuber and Sepp Hochreiter. In this architecture, there are not one but two hidden states. In our base RNN, the hidden state is the output of the RNN at the previous time step. That hidden state is then responsible for two things:\n",
"\n",
"- having the right information for the output layer to predict the correct next token\n",
"- retaining memory of everything that happened in the sentence\n",
"- Having the right information for the output layer to predict the correct next token\n",
"- Retaining memory of everything that happened in the sentence\n",
"\n",
"Consider, for example, the sentences \"Henry has a dog and he likes his dog very much\" and \"Sophie has a dog and she likes her dog very much\". It's very clear that the RNN needs to remember the name at the beginning of the sentence to be able to predict *he/she* or *his/her*. \n",
"Consider, for example, the sentences \"Henry has a dog and he likes his dog very much\" and \"Sophie has a dog and she likes her dog very much.\" It's very clear that the RNN needs to remember the name at the beginning of the sentence to be able to predict *he/she* or *his/her*. \n",
"\n",
"In practice, RNNs are really bad at retaining memory of what happened much earlier in the sentence, which is the motivation to have another hidden state (called cell state) in the LSTM. The cell state will be responsible for keeping *long short-term memory*, while the hidden state will focus on the next token to predict. Let's have a closer look and how this is achieved and build one LSTM from scratch."
"In practice, RNNs are really bad at retaining memory of what happened much earlier in the sentence, which is the motivation to have another hidden state (called *cell state*) in the LSTM. The cell state will be responsible for keeping *long short-term memory*, while the hidden state will focus on the next token to predict. Let's take a closer look and how this is achieved and build an LSTM from scratch."
]
},
{
@ -1559,7 +1559,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to build an LSTM, we first have to understand its architecture. <<lstm>> shows us its inner structure.\n",
"In order to build an LSTM, we first have to understand its architecture. <<lstm>> shows its inner structure.\n",
" \n",
"<img src=\"images/LSTM.png\" id=\"lstm\" caption=\"Architecture of an LSTM\" alt=\"A graph showing the inner architecture of an LSTM\" width=\"700\">"
]
@ -1568,22 +1568,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this picture, our input $x_{t}$ enters on the bottom with the previous hidden state ($h_{t-1}$) and cell state ($c_{t-1}$). The four orange boxes represent four layers with the activation being either sigmoid (for $\\sigma$) or tanh. tanh is just a sigmoid rescaled to the range -1 to 1. Its mathematical expression can be written like this:\n",
"In this picture, our input $x_{t}$ enters on the left with the previous hidden state ($h_{t-1}$) and cell state ($c_{t-1}$). The four orange boxes represent four layers (our neural nets) with the activation being either sigmoid ($\\sigma$) or tanh. tanh is just a sigmoid function rescaled to the range -1 to 1. Its mathematical expression can be written like this:\n",
"\n",
"$$\\tanh(x) = \\frac{e^{x} + e^{-x}}{e^{x}-e^{-x}} = 2 \\sigma(2x) - 1$$\n",
"\n",
"where $\\sigma$ is the sigmoid function. The green boxes are elementwise operations. What goes out is the new hidden state ($h_{t}$) and new cell state ($c_{t}$) on the right, ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.\n",
"where $\\sigma$ is the sigmoid function. The green circles are elementwise operations. What goes out on the right is the new hidden state ($h_{t}$) and new cell state ($c_{t}$), ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.\n",
"\n",
"Let's go over the four neural nets (called *gates*) one by one and explain the diagram, but before this, notice how very little the cell state (on the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.\n",
"Let's go over the four neural nets (called *gates*) one by one and explain the diagram--but before this, notice how very little the cell state (at the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.\n",
"\n",
"First, the arrows for input and old hidden state are joined together. In the RNN we wrote before in this chapter, we were adding them together. In the LSTM, we stack them in one big tensor. This means the dimension of our embeddings (which is the dimension of $x_{t}$) can be different than the dimension of our hidden state. If we call those `n_in` and `n_hid`, the arrow at the bottom is of size `n_in + n_hid`, thus all the neural nets (orange boxes) are linear layers with `n_in + n_hid` inputs and `n_hid` outputs.\n",
"First, the arrows for input and old hidden state are joined together. In the RNN we wrote earlier in this chapter, we were adding them together. In the LSTM, we stack them in one big tensor. This means the dimension of our embeddings (which is the dimension of $x_{t}$) can be different than the dimension of our hidden state. If we call those `n_in` and `n_hid`, the arrow at the bottom is of size `n_in + n_hid`; thus all the neural nets (orange boxes) are linear layers with `n_in + n_hid` inputs and `n_hid` outputs.\n",
"\n",
"The first gate (looking from the left to right) is called the *forget gate*. Since it's a linear layer followed by a sigmoid, its output will have scalars between 0 and 1. We multiply this result by the cell gate, so for all the values close to 0, we will forget what was inside that cell state (and for the values close to 1 it doesn't do anything). This gives the ability to the LSTM to forget things about its longterm state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
"The first gate (looking from left to right) is called the *forget gate*. Since it's a linear layer followed by a sigmoid, its output will consist of scalars between 0 and 1. We multiply this result by the cell gate, so for all the values close to 0, we will forget what was inside that cell state (and for the values close to 1 it doesn't do anything). This gives the LSTM the ability to forget things about its long-term state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
"\n",
"The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it just decides which element of the cell state to update (values close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
"\n",
"The last gate is the *output gate*. It will decides which information take in the cell state to generate the output. The cell state goes through a tanh before this and the output gate combined with the sigmoid decides which values to take inside it.\n",
"The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance, we may see a new gender pronoun, in which case we'll need to replace the information about gender that the forget gate removed. Like the forget gate, the input gate ends up on a product, so it just decides which element of the cell state to update (values close to 1) or not (values close to 0). The third gate determines the updated values, with elements between -1 and 1 (thanks to the tanh function). The result is then added to the cell state.\n",
"\n",
"The last gate is the *output gate*. Similar to the forget gate, it decides which element of the cell state to keep for our prediction at this timestep (values close to 1) and which to discard (values close to 0). The cell state is applied the tanh function before being used.\n",
"\n",
"In terms of code, we can write the same steps like this:"
]
@ -1618,7 +1617,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In practice, we can then refactor the code. Also, in terms of performance, it's better to do one big matrix multiplication than four smaller ones (that's because we only launch the special fast kernel on GPU once, and it gives the GPU more work to do in parallel). The stacking takes a bit of time (since we have to move one of the tensors around on the GPU to have it all in a contiguous array), so we use two separate layers for the input and the hidden state. The optimized and refactored code then looks like that:"
"In practice, we can then refactor the code. Also, in terms of performance, it's better to do one big matrix multiplication than four smaller ones (that's because we only launch the special fast kernel on the GPU once, and it gives the GPU more work to do in parallel). The stacking takes a bit of time (since we have to move one of the tensors around on the GPU to have it all in a contiguous array), so we use two separate layers for the input and the hidden state. The optimized and refactored code then looks like this:"
]
},
{
@ -1634,7 +1633,7 @@
"\n",
" def forward(self, input, state):\n",
" h,c = state\n",
" #One big multiplication for all the gates is better than 4 smaller ones\n",
" # One big multiplication for all the gates is better than 4 smaller ones\n",
" gates = (self.ih(input) + self.hh(h)).chunk(4, 1)\n",
" ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])\n",
" cellgate = gates[3].tanh()\n",
@ -1648,7 +1647,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we use the PyTorch `chunk` method to split our tensor into 4 pieces, e.g.:"
"Here we use the PyTorch `chunk` method to split our tensor into four pieces. It works like this:"
]
},
{
@ -1880,7 +1879,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that's better than a multilayer RNN! We can still see there is a bit of overfitting, which is a sign that a bit of regularization might help."
"Now that's better than a multilayer RNN! We can still see there is a bit of overfitting, however, which is a sign that a bit of regularization might help."
]
},
{
@ -1894,9 +1893,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Recurrent neural networks, in general, are hard to train, because of the problems of vanishing activations and gradients we saw before. Using LSTMs (or GRUs) cells make training easier than vanilla RNNs, but there are still very prone to overfitting. Data augmentation, while it exists for text data, is less often used because in most cases, it requires another model to generate random augmentation (by translating in another language and back to the language used for instance). Overall, data augmentation for text data is currently not a well explored space.\n",
"Recurrent neural networks, in general, are hard to train, because of the problem of vanishing activations and gradients we saw before. Using LSTM (or GRU) cells makes training easier than with vanilla RNNs, but they are still very prone to overfitting. Data augmentation, while a possibility, is less often used for text data than for images because in most cases it requires another model to generate random augmentations (e.g., by translating the text into another language and then back into the original language). Overall, data augmentation for text data is currently not a well-explored space.\n",
"\n",
"However, there are other regularization techniques we can use instead to reduce overfitting, which were thoroughly studied for use with LSTMs in the paper [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182). This paper showed how effective use of *dropout*, *activation regularization*, and *temporal activation regularization* could allow an LSTM to beat state of the art results that previously required much more complicated models. They called an LSTM using these techniques an *AWD LSTM*. We'll look at each of these techniques in turn."
"However, there are other regularization techniques we can use instead to reduce overfitting, which were thoroughly studied for use with LSTMs in the paper [\"Regularizing and Optimizing LSTM Language Models\"](https://arxiv.org/abs/1708.02182) by Stephen Merity, Nitish Shirish Keskar, and Richard Socher. This paper showed how effective use of *dropout*, *activation regularization*, and *temporal activation regularization* could allow an LSTM to beat state-of-the-art results that previously required much more complicated models. The authors called an LSTM using these techniques an *AWD-LSTM*. We'll look at each of these techniques in turn."
]
},
{
@ -1910,33 +1909,33 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Dropout is a regularization technique that was introduce by Geoffrey Hinton et al. in [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). The basic idea is to randomly change some activations to zero at training time. This makes sure all neurons actively work toward the output as seen in <<img_dropout>> which is a screenshot from the original paper.\n",
"Dropout is a regularization technique that was introduced by Geoffrey Hinton et al. in [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). The basic idea is to randomly change some activations to zero at training time. This makes sure all neurons actively work toward the output, as seen in <<img_dropout>> (from \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting\" by Nitish Srivastava et al.).\n",
"\n",
"<img src=\"images/Dropout1.png\" alt=\"A figure from the article showing how neurons go off with dropout\" width=\"800\" id=\"img_dropout\" caption=\"A screenshot from the dropout paper\">\n",
"<img src=\"images/Dropout1.png\" alt=\"A figure from the article showing how neurons go off with dropout\" width=\"800\" id=\"img_dropout\" caption=\"Applying dropout in a neural network (courtesy of Nitish Srivastava et al.)\">\n",
"\n",
"Hinton used a nice metaphor when he explained, in an interview, the inspiration for dropout:\n",
"\n",
"> : \"I went to my bank. The tellers kept changing and I asked one of them why. He said he didnt know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting\"\n",
"> : I went to my bank. The tellers kept changing and I asked one of them why. He said he didnt know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting.\n",
"\n",
"In the same interview, he also explained that neuroscience provided additional inspiration:\n",
"\n",
"> : \"We don't really know why neurons spike. One theory is that they want to be noisy so as to regularize, because we have many more parameters than we have data points. The idea of dropout is that if you have noisy activations, you can afford to use a much bigger model.\""
"> : We don't really know why neurons spike. One theory is that they want to be noisy so as to regularize, because we have many more parameters than we have data points. The idea of dropout is that if you have noisy activations, you can afford to use a much bigger model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This explains the idea behind why dropout helps to generalize: first, it helps the neurons to cooperate better together, then it makes the activations more noisy, thus making the model more robust."
"This explains the idea behind why dropout helps to generalize: first it helps the neurons to cooperate better together, then it makes the activations more noisy, thus making the model more robust."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see there that if we just zero those activations without doing anything else, our model will have problems to train: if we go from the sum of 5 activations (that are all positive numbers since we apply a ReLU) to just 2, this won't have the same scale. Therefore if we dropout with a probability `p`, we rescale all activation by dividing them by `1-p` (on average `p` will be zeroed, so it leaves `1-p`), as shown in <<img_dropout1>> which is a diagram from the original paper.\n",
"We can see, however, that if we were to just zero those activations without doing anything else, our model would have problems training: if we go from the sum of five activations (that are all positive numbers since we apply a ReLU) to just two, this won't have the same scale. Therefore, if we apply dropout with a probability `p`, we rescale all activations by dividing them by `1-p` (on average `p` will be zeroed, so it leaves `1-p`), as shown in <<img_dropout1>>.\n",
"\n",
"<img src=\"images/Dropout.png\" alt=\"A figure from the article introducing dropout showing how a neuron is on/off\" width=\"600\" id=\"img_dropout1\" cpation=\"Why scale the activations when applying dropout\">\n",
"<img src=\"images/Dropout.png\" alt=\"A figure from the article introducing dropout showing how a neuron is on/off\" width=\"600\" id=\"img_dropout1\" caption=\"Why scale the activations when applying dropout (courtesy of Nitish Srivastava et al.)\">\n",
"\n",
"This is a full implementation of the dropout layer in PyTorch (although PyTorch's native layer is actually written in C, not Python):"
]
@ -1959,32 +1958,32 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `bernoulli_` method is creating a tensor with random zeros (with probability p) and ones (with probability 1-p), which is then multiplied with our input before dividing by `1-p`. Note the use of the `training` attribute, which is available in any PyTorch `nn.Module`, and tells us if we are doing training or inference.\n",
"The `bernoulli_` method is creating a tensor of random zeros (with probability `p`) and ones (with probability `1-p`), which is then multiplied with our input before dividing by `1-p`. Note the use of the `training` attribute, which is available in any PyTorch `nn.Module`, and tells us if we are doing training or inference.\n",
"\n",
"> note: In previous chapters of the book we'd be adding a code example for `bernoulli_` here, so you can see exactly how it works. But now that you know enough to do this yourself, we're going to be doing fewer and fewer examples for you, and instead expecting you to do your own experiments to see how things work. In this case, you'll see in the end-of-chapter questionnaire that we're asking you to experiment with `bernoulli_`--but don't wait for us to ask you to experiment to develop your understanding of the code we're studying, go ahead and do it anyway!\n",
"> note: Do Your Own Experiments: In previous chapters of the book we'd be adding a code example for `bernoulli_` here, so you can see exactly how it works. But now that you know enough to do this yourself, we're going to be doing fewer and fewer examples for you, and instead expecting you to do your own experiments to see how things work. In this case, you'll see in the end-of-chapter questionnaire that we're asking you to experiment with `bernoulli_`--but don't wait for us to ask you to experiment to develop your understanding of the code we're studying; go ahead and do it anyway!\n",
"\n",
"Using dropout before passing the output of our LSTM to the final layer will help reduce overfitting. Dropout is also used in many other models, including the default CNN head used in `fastai.vision`, and is also available in `fastai.tabular` by passing the `ps` parameter (where each \"p\" is passed to each added `Dropout` layer), as we'll see in <<chapter_arch_details>>."
"Using dropout before passing the output of our LSTM to the final layer will help reduce overfitting. Dropout is also used in many other models, including the default CNN head used in `fastai.vision`, and is available in `fastai.tabular` by passing the `ps` parameter (where each \"p\" is passed to each added `Dropout` layer), as we'll see in <<chapter_arch_details>>."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dropout has a different behavior in training and validation mode, which we achieved using the `training` attribute in `Dropout` above. Calling the `train()` method on a `Module` sets `training` to `True` (both for the module you call the method on, and for every module it recursively contains), and `eval()` sets it to `False`. This is done automatically when calling the methods of `Learner`, but if you are not using that class, remember to switch from one to the other as needed."
"Dropout has different behavior in training and validation mode, which we specified using the `training` attribute in `Dropout`. Calling the `train` method on a `Module` sets `training` to `True` (both for the module you call the method on and for every module it recursively contains), and `eval` sets it to `False`. This is done automatically when calling the methods of `Learner`, but if you are not using that class, remember to switch from one to the other as needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### AR and TAR Regularization"
"### Activation Regularization and Temporal Activation Regularization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"AR (for *activation regularization*) and TAR (for *temporal activation regularization*) are two regularization methods very similar to weight decay. When applying weight decay, we add a small penalty to the loss that aims at making the weights as small as possible. For the activation regularization, it's the final activations produced by the LSTM that we will try to make as small as possible, instead of the weights.\n",
"*Activation regularization* (AR) and *temporal activation regularization* (TAR) are two regularization methods very similar to weight decay, discussed in <<chapter_collab>>. When applying weight decay, we add a small penalty to the loss that aims at making the weights as small as possible. For activation regularization, it's the final activations produced by the LSTM that we will try to make as small as possible, instead of the weights.\n",
"\n",
"To regularize the final activations, we have to store those somewhere, then add the means of the squares of them to the loss (along with a multiplier `alpha`, which is just like `wd` for weight decay):\n",
"\n",
@ -1997,13 +1996,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Temporal activation regularization is linked to the fact we are predicting tokens in a sentence. That means it's likely that the outputs of our LSTMs should somewhat make sense when we read them in order. TAR is there to encourage that behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible: our activations tensor has a shape `bs x sl x n_hid`, and we read consecutive activation on the sequence length axis (so the dimension in the middle). With this, TAR can be expressed as:\n",
"Temporal activation regularization is linked to the fact we are predicting tokens in a sentence. That means it's likely that the outputs of our LSTMs should somewhat make sense when we read them in order. TAR is there to encourage that behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible: our activations tensor has a shape `bs x sl x n_hid`, and we read consecutive activations on the sequence length axis (the dimension in the middle). With this, TAR can be expressed as:\n",
"\n",
"``` python\n",
"loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()\n",
"```\n",
"\n",
"`alpha` and `beta` are then two hyper-parameters to tune. To make this work, we need our model with dropout to return three things: the proper output, the activations of the LSTM pre-dropout and the activations of the LSTM post-dropout. AR is often applied on the dropped out activations (to not penalize the activations we turned in 0s afterward) while TAR is applied on the non-dropped out activations (because those 0s create big differences between two consecutive timesteps). There is then a callback called `RNNRegularizer` that will apply this regularization for us."
"`alpha` and `beta` are then two hyperparameters to tune. To make this work, we need our model with dropout to return three things: the proper output, the activations of the LSTM pre-dropout, and the activations of the LSTM post-dropout. AR is often applied on the dropped-out activations (to not penalize the activations we turned in zeros afterward) while TAR is applied on the non-dropped-out activations (because those zeros create big differences between two consecutive time steps). There is then a callback called `RNNRegularizer` that will apply this regularization for us."
]
},
{
@ -2017,9 +2016,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can combine dropout (applied before we go in our output layer) with the AR and TAR regularization to train our previous LSTM. We just need to return three things instead of one: the normal output of our LSTM, the dropped-out activations and the activations from our LSTMs. Those last two will be picked up by the callback `RNNRegularization` for the contributions it has to make to the loss.\n",
"We can combine dropout (applied before we go into our output layer) with AR and TAR to train our previous LSTM. We just need to return three things instead of one: the normal output of our LSTM, the dropped-out activations, and the activations from our LSTMs. The last two will be picked up by the callback `RNNRegularization` for the contributions it has to make to the loss.\n",
"\n",
"Another useful trick we can add from the AWD LSTM paper is *weight tying*. In a language model, the input embeddings represent a mapping from English words to activations, and the output hidden layer represents a mapping from activations to English words. We might expect, intuitively, that these mappings could be the same. We can represent this in PyTorch by assigning the same weight matrix to each of these layers:\n",
"Another useful trick we can add from [the AWD LSTM paper](https://arxiv.org/abs/1708.02182) is *weight tying*. In a language model, the input embeddings represent a mapping from English words to activations, and the output hidden layer represents a mapping from activations to English words. We might expect, intuitively, that these mappings could be the same. We can represent this in PyTorch by assigning the same weight matrix to each of these layers:\n",
"\n",
" self.h_o.weight = self.i_h.weight\n",
"\n",
@ -2073,7 +2072,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"A `TextLearner` automatically adds those two callbacks for us (with default for `alpha` and `beta` as above) so we can simplify the line above to:"
"A `TextLearner` automatically adds those two callbacks for us (with those values for `alpha` and `beta` as defaults), so we can simplify the preceding line to:"
]
},
{
@ -2250,16 +2249,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You have now seen everything that is inside the AWD-LSTM architecture we used in text classification in <<chapter_nlp>>. It uses dropouts in a lot more places:\n",
"You have now seen everything that is inside the AWD-LSTM architecture we used in text classification in <<chapter_nlp>>. It uses dropout in a lot more places:\n",
"\n",
"- embedding dropout (just after the embedding layer)\n",
"- input dropout (after the embedding layer)\n",
"- weight dropout (applied to the weights of the LSTM at each training step)\n",
"- hidden dropout (applied to the hidden state between two layers)\n",
"- Embedding dropout (just after the embedding layer)\n",
"- Input dropout (after the embedding layer)\n",
"- Weight dropout (applied to the weights of the LSTM at each training step)\n",
"- Hidden dropout (applied to the hidden state between two layers)\n",
"\n",
"which makes it even more regularized. Since fine-tuning those five dropout values (adding the dropout before the output layer) is complicated, we have determined good defaults, and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw (which is multiplied by each dropout).\n",
"This makes it even more regularized. Since fine-tuning those five dropout values (including the dropout before the output layer) is complicated, we have determined good defaults and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw in that chapter (which is multiplied by each dropout).\n",
"\n",
"Another architecture that is very powerful, especially in \"sequence to sequence\" problems (that is, problems where the dependent variable is itself a variable length sequence, such as language translation), is the Transformers architecture. You can find it in an online bonus chapter on the book website."
"Another architecture that is very powerful, especially in \"sequence-to-sequence\" problems (that is, problems where the dependent variable is itself a variable-length sequence, such as language translation), is the Transformers architecture. You can find it in a bonus chapter on the [book's website](https://book.fast.ai/)."
]
},
{
@ -2275,16 +2274,16 @@
"source": [
"1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?\n",
"1. Why do we concatenate the documents in our dataset before creating a language model?\n",
"1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?\n",
"1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make to ou model?\n",
"1. How can we share a weight matrix across multiple layers in PyTorch?\n",
"1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.\n",
"1. Write a module that predicts the third word given the previous two words of a sentence, without peeking.\n",
"1. What is a recurrent neural network?\n",
"1. What is hidden state?\n",
"1. What is \"hidden state\"?\n",
"1. What is the equivalent of hidden state in ` LMModel1`?\n",
"1. To maintain the state in an RNN why is it important to pass the text to the model in order?\n",
"1. What is an unrolled representation of an RNN?\n",
"1. To maintain the state in an RNN, why is it important to pass the text to the model in order?\n",
"1. What is an \"unrolled\" representation of an RNN?\n",
"1. Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?\n",
"1. What is BPTT?\n",
"1. What is \"BPTT\"?\n",
"1. Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <<chapter_nlp>>.\n",
"1. What does the `ModelReseter` callback do? Why do we need it?\n",
"1. What are the downsides of predicting just one output word for each three input words?\n",
@ -2294,23 +2293,23 @@
"1. Draw a representation of a stacked (multilayer) RNN.\n",
"1. Why should we get better results in an RNN if we call `detach` less often? Why might this not happen in practice with a simple RNN?\n",
"1. Why can a deep network result in very large or very small activations? Why does this matter?\n",
"1. In a computer's floating point representation of numbers, which numbers are the most precise?\n",
"1. In a computer's floating-point representation of numbers, which numbers are the most precise?\n",
"1. Why do vanishing gradients prevent training?\n",
"1. Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?\n",
"1. What are these two states called in an LSTM?\n",
"1. What is tanh, and how is it related to sigmoid?\n",
"1. What is the purpose of this code in `LSTMCell`?: `h = torch.stack([h, input], dim=1)`\n",
"1. What does `chunk` to in PyTorch?\n",
"1. What is the purpose of this code in `LSTMCell`: `h = torch.stack([h, input], dim=1)`\n",
"1. What does `chunk` do in PyTorch?\n",
"1. Study the refactored version of `LSTMCell` carefully to ensure you understand how and why it does the same thing as the non-refactored version.\n",
"1. Why can we use a higher learning rate for `LMModel6`?\n",
"1. What are the three regularisation techniques used in an AWD-LSTM model?\n",
"1. What is dropout?\n",
"1. What are the three regularization techniques used in an AWD-LSTM model?\n",
"1. What is \"dropout\"?\n",
"1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?\n",
"1. What is the purpose of this line from `Dropout`?: `if not self.training: return x`\n",
"1. What is the purpose of this line from `Dropout`: `if not self.training: return x`\n",
"1. Experiment with `bernoulli_` to understand how it works.\n",
"1. How do you set your model in training mode in PyTorch? In evaluation mode?\n",
"1. Write the equation for activation regularization (in maths or code, as you prefer). How is it different to weight decay?\n",
"1. Write the equation for temporal activation regularization (in maths or code, as you prefer). Why wouldn't we use this for computer vision problems?\n",
"1. Write the equation for activation regularization (in math or code, as you prefer). How is it different from weight decay?\n",
"1. Write the equation for temporal activation regularization (in math or code, as you prefer). Why wouldn't we use this for computer vision problems?\n",
"1. What is \"weight tying\" in a language model?"
]
},
@ -2325,10 +2324,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. In ` LMModel2` why can `forward` start with `h=0`? Why don't we need to say `h=torch.zeros()`?\n",
"1. Write the code for an LSTM from scratch (but you may refer to <<lstm>>).\n",
"1. Search on the Internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get the similar results as we saw in this chapter. Compare it to the results of PyTorch's built in GRU module.\n",
"1. Have a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter."
"1. In ` LMModel2`, why can `forward` start with `h=0`? Why don't we need to say `h=torch.zeros(...)`?\n",
"1. Write the code for an LSTM from scratch (you may refer to <<lstm>>).\n",
"1. Search the internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get results similar to those we saw in this chapter. Compare you results to the results of PyTorch's built in `GRU` module.\n",
"1. Take a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter."
]
},
{

View File

@ -412,14 +412,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Putting Our Texts Into Batches for a Language Model"
"### Putting Our Texts into Batches for a Language Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
"hide_input": false
},
"outputs": [
{
@ -541,7 +541,6 @@
}
],
"source": [
"#hide\n",
"stream = \"In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\"\n",
"tokens = tkn(stream)\n",
"bs,seq_len = 6,15\n",
@ -919,7 +918,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine Tuning the Language Model"
"### Fine-Tuning the Language Model"
]
},
{
@ -1305,7 +1304,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine Tuning the Classifier"
"### Fine-Tuning the Classifier"
]
},
{
@ -1507,28 +1506,28 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What is self-supervised learning?\n",
"1. What is a language model?\n",
"1. Why is a language model considered self-supervised learning?\n",
"1. What is \"self-supervised learning\"?\n",
"1. What is a \"language model\"?\n",
"1. Why is a language model considered self-supervised?\n",
"1. What are self-supervised models usually used for?\n",
"1. Why do we fine-tune language models?\n",
"1. What are the three steps to create a state-of-the-art text classifier?\n",
"1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?\n",
"1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?\n",
"1. What are the three steps to prepare your data for a language model?\n",
"1. What is tokenization? Why do we need it?\n",
"1. What is \"tokenization\"? Why do we need it?\n",
"1. Name three different approaches to tokenization.\n",
"1. What is 'xxbos'?\n",
"1. List 4 rules that fastai applies to text during tokenization.\n",
"1. Why are repeated characters replaced with a token showing the number of repetitions, and the character that's repeated?\n",
"1. What is numericalization?\n",
"1. What is `xxbos`?\n",
"1. List four rules that fastai applies to text during tokenization.\n",
"1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?\n",
"1. What is \"numericalization\"?\n",
"1. Why might there be words that are replaced with the \"unknown word\" token?\n",
"1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)\n",
"1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)\n",
"1. Why do we need padding for text classification? Why don't we need it for language modeling?\n",
"1. What does an embedding matrix for NLP contain? What is its shape?\n",
"1. What is perplexity?\n",
"1. What is \"perplexity\"?\n",
"1. Why do we have to pass the vocabulary of the language model to the classifier data block?\n",
"1. What is gradual unfreezing?\n",
"1. Why is text generation always likely to be ahead of automatic identification of machine generated texts?"
"1. What is \"gradual unfreezing\"?\n",
"1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?"
]
},
{
@ -1542,9 +1541,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. See what you can learn about language models and disinformation. What are the best language models today? Have a look at some of their outputs. Do you find them convincing? How could a bad actor best use this to create conflict and uncertainty?\n",
"1. Given the limitation that models are unlikely to be able to consistently recognise machine generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leveraged deep learning?"
"1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?\n",
"1. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@ -15,7 +15,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Munging With fastai's mid-Level API"
"# Data Munging with fastai's Mid-Level API"
]
},
{
@ -599,7 +599,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applying the mid-Tier Data API: SiamesePair"
"## Applying the Mid-Level Data API: SiamesePair"
]
},
{
@ -815,17 +815,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Why do we say that fastai has a layered API? What does it mean?\n",
"1. Why does a `Transform` have a decode method? What does it do?\n",
"1. Why does a `Transform` have a setup method? What does it do?\n",
"1. Why do we say that fastai has a \"layered\" API? What does it mean?\n",
"1. Why does a `Transform` have a `decode` method? What does it do?\n",
"1. Why does a `Transform` have a `setup` method? What does it do?\n",
"1. How does a `Transform` work when called on a tuple?\n",
"1. Which methods do you need to implement when writing your own `Transform`?\n",
"1. Write a `Normalize` transform that fully normalizes items (substract the mean and divide by the standard deviation of the dataset), and that can decode that behavior. Try not to peak!\n",
"1. Write a `Transform` that does the numericalization of tokenized texts (it should set its vocab automatically from the dataset seen and have a decode method). Look at the source code of fastai if you need help.\n",
"1. Write a `Normalize` transform that fully normalizes items (subtract the mean and divide by the standard deviation of the dataset), and that can decode that behavior. Try not to peek!\n",
"1. Write a `Transform` that does the numericalization of tokenized texts (it should set its vocab automatically from the dataset seen and have a `decode` method). Look at the source code of fastai if you need help.\n",
"1. What is a `Pipeline`?\n",
"1. What is a `TfmdLists`? \n",
"1. What is a `Datasets`? How is it different from `TfmdLists`?\n",
"1. Why are `TfmdLists` and `Datasets` named with an s?\n",
"1. What is a `Datasets`? How is it different from a `TfmdLists`?\n",
"1. Why are `TfmdLists` and `Datasets` named with an \"s\"?\n",
"1. How can you build a `DataLoaders` from a `TfmdLists` or a `Datasets`?\n",
"1. How do you pass `item_tfms` and `batch_tfms` when building a `DataLoaders` from a `TfmdLists` or a `Datasets`?\n",
"1. What do you need to do when you want to have your custom items work with methods like `show_batch` or `show_results`?\n",
@ -843,8 +843,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Use the mid-level API to grab the data on the pets dataset. On the adult dataset (used in chapter 1).\n",
"1. Look at the siamese tutorial in the fastai documentation to learn how to customize the behavior of `show_batch` and `show_results` for new type of items. Implement it on your own project."
"1. Use the mid-level API to prepare the data in `DataLoaders` on the pets dataset. On the adult dataset (used in chapter 1).\n",
"1. Look at the Siamese tutorial in the fastai documentation to learn how to customize the behavior of `show_batch` and `show_results` for new type of items. Implement it in your own project."
]
},
{
@ -858,11 +858,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations you've completed all of the chapters in this book which cover the key practical parts of training and using deep learning! You know how to use all of fastai's built in applications, and how to customise them using the data blocks API and loss functions. You even know how to create a neural network from scratch, and train it! (And hopefully you now know some of the questions to ask to help make sure your creations help improve society too.)\n",
"Congratulations—you've completed all of the chapters in this book that cover the key practical parts of training models and using deep learning! You know how to use all of fastai's built-in applications, and how to customize them using the data block API and loss functions. You even know how to create a neural network from scratch, and train it! (And hopefully you now know some of the questions to ask to make sure your creations help improve society too.)\n",
"\n",
"The knowledge you already have is enough to create full working prototypes of many types of neural network application. More importantly, it will help you understand the capabilities and limitations of deep learning models, and how to design a system which best handles these capabilities and limitations.\n",
"The knowledge you already have is enough to create full working prototypes of many types of neural network application. More importantly, it will help you understand the capabilities and limitations of deep learning models, and how to design a system that's well adapted to them.\n",
"\n",
"In the rest of this book we will be pulling apart these applications, piece by piece, to understand all of the foundations they are built on. This is important knowledge for a deep learning practitioner, because it is the knowledge which allows you to inspect and debug models that you build, and to create new applications which are customised for your particular projects."
"In the rest of this book we will be pulling apart those applications, piece by piece, to understand the foundations they are built on. This is important knowledge for a deep learning practitioner, because it is what allows you to inspect and debug models that you build and create new applications that are customized for your particular projects."
]
},
{

View File

@ -860,7 +860,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Model"
"### The Model"
]
},
{
@ -1086,7 +1086,7 @@
"\n",
" def forward(self, input, state):\n",
" h,c = state\n",
" #One big multiplication for all the gates is better than 4 smaller ones\n",
" # One big multiplication for all the gates is better than 4 smaller ones\n",
" gates = (self.ih(input) + self.hh(h)).chunk(4, 1)\n",
" ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])\n",
" cellgate = gates[3].tanh()\n",
@ -1339,7 +1339,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### AR and TAR Regularization"
"### Activation Regularization and Temporal Activation Regularization"
]
},
{
@ -1554,16 +1554,16 @@
"source": [
"1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?\n",
"1. Why do we concatenate the documents in our dataset before creating a language model?\n",
"1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?\n",
"1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make to ou model?\n",
"1. How can we share a weight matrix across multiple layers in PyTorch?\n",
"1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.\n",
"1. Write a module that predicts the third word given the previous two words of a sentence, without peeking.\n",
"1. What is a recurrent neural network?\n",
"1. What is hidden state?\n",
"1. What is \"hidden state\"?\n",
"1. What is the equivalent of hidden state in ` LMModel1`?\n",
"1. To maintain the state in an RNN why is it important to pass the text to the model in order?\n",
"1. What is an unrolled representation of an RNN?\n",
"1. To maintain the state in an RNN, why is it important to pass the text to the model in order?\n",
"1. What is an \"unrolled\" representation of an RNN?\n",
"1. Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?\n",
"1. What is BPTT?\n",
"1. What is \"BPTT\"?\n",
"1. Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <<chapter_nlp>>.\n",
"1. What does the `ModelReseter` callback do? Why do we need it?\n",
"1. What are the downsides of predicting just one output word for each three input words?\n",
@ -1573,23 +1573,23 @@
"1. Draw a representation of a stacked (multilayer) RNN.\n",
"1. Why should we get better results in an RNN if we call `detach` less often? Why might this not happen in practice with a simple RNN?\n",
"1. Why can a deep network result in very large or very small activations? Why does this matter?\n",
"1. In a computer's floating point representation of numbers, which numbers are the most precise?\n",
"1. In a computer's floating-point representation of numbers, which numbers are the most precise?\n",
"1. Why do vanishing gradients prevent training?\n",
"1. Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?\n",
"1. What are these two states called in an LSTM?\n",
"1. What is tanh, and how is it related to sigmoid?\n",
"1. What is the purpose of this code in `LSTMCell`?: `h = torch.stack([h, input], dim=1)`\n",
"1. What does `chunk` to in PyTorch?\n",
"1. What is the purpose of this code in `LSTMCell`: `h = torch.stack([h, input], dim=1)`\n",
"1. What does `chunk` do in PyTorch?\n",
"1. Study the refactored version of `LSTMCell` carefully to ensure you understand how and why it does the same thing as the non-refactored version.\n",
"1. Why can we use a higher learning rate for `LMModel6`?\n",
"1. What are the three regularisation techniques used in an AWD-LSTM model?\n",
"1. What is dropout?\n",
"1. What are the three regularization techniques used in an AWD-LSTM model?\n",
"1. What is \"dropout\"?\n",
"1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?\n",
"1. What is the purpose of this line from `Dropout`?: `if not self.training: return x`\n",
"1. What is the purpose of this line from `Dropout`: `if not self.training: return x`\n",
"1. Experiment with `bernoulli_` to understand how it works.\n",
"1. How do you set your model in training mode in PyTorch? In evaluation mode?\n",
"1. Write the equation for activation regularization (in maths or code, as you prefer). How is it different to weight decay?\n",
"1. Write the equation for temporal activation regularization (in maths or code, as you prefer). Why wouldn't we use this for computer vision problems?\n",
"1. Write the equation for activation regularization (in math or code, as you prefer). How is it different from weight decay?\n",
"1. Write the equation for temporal activation regularization (in math or code, as you prefer). Why wouldn't we use this for computer vision problems?\n",
"1. What is \"weight tying\" in a language model?"
]
},
@ -1604,10 +1604,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. In ` LMModel2` why can `forward` start with `h=0`? Why don't we need to say `h=torch.zeros()`?\n",
"1. Write the code for an LSTM from scratch (but you may refer to <<lstm>>).\n",
"1. Search on the Internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get the similar results as we saw in this chapter. Compare it to the results of PyTorch's built in GRU module.\n",
"1. Have a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter."
"1. In ` LMModel2`, why can `forward` start with `h=0`? Why don't we need to say `h=torch.zeros(...)`?\n",
"1. Write the code for an LSTM from scratch (you may refer to <<lstm>>).\n",
"1. Search the internet for the GRU architecture and implement it from scratch, and try training a model. See if you can get results similar to those we saw in this chapter. Compare you results to the results of PyTorch's built in `GRU` module.\n",
"1. Take a look at the source code for AWD-LSTM in fastai, and try to map each of the lines of code to the concepts shown in this chapter."
]
},
{