Updated some minor typos in '10_nlp.ipynb'
This commit is contained in:
parent
62ac21d085
commit
f3a0ea536d
16
10_nlp.ipynb
16
10_nlp.ipynb
@ -47,13 +47,13 @@
|
||||
"source": [
|
||||
"The language model we used in <<chapter_intro>> to classify IMDb reviews was pretrained on Wikipedia. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better. The Wikipedia English is slightly different from the IMDb English, so instead of jumping directly to the classifier, we could fine-tune our pretrained language model to the IMDb corpus and then use *that* as the base for our classifier.\n",
|
||||
"\n",
|
||||
"Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targeting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of the IMDb datset, there will be lots of names of movie directors and actors, and often a less formal style of language than that seen in Wikipedia.\n",
|
||||
"Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targeting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of the IMDb dataset, there will be lots of names of movie directors and actors, and often a less formal style of language than that seen in Wikipedia.\n",
|
||||
"\n",
|
||||
"We already saw that with fastai, we can download a pretrained English language model and use it to get state-of-the-art results for NLP classification. (We expect pretrained models in many more languages to be available soon—they might well be available by the time you are reading this book, in fact.) So, why are we learning how to train a language model in detail?\n",
|
||||
"\n",
|
||||
"One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine-tune the (sequence-based) language model prior to fine-tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. Since there are 25,000 labeled reviews in the training set and 25,000 in the validation set, that makes 100,000 movie reviews altogether. We can use all of these reviews to fine-tune the pretrained language model, which was trained only on Wikipedia articles; this will result in a language model that is particularly good at predicting the next word of a movie review.\n",
|
||||
"\n",
|
||||
"This is known as the Universal Language Model Fine-tuning (ULMFit) approqch. The [paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of fine-tuning of the language model, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarized in <<ulmfit_process>>."
|
||||
"This is known as the Universal Language Model Fine-tuning (ULMFit) approach. The [paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of fine-tuning of the language model, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarized in <<ulmfit_process>>."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1269,7 +1269,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we saw at the beginning of this chapter, thare are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, and then we can use that model to train a classifier.\n",
|
||||
"As we saw at the beginning of this chapter, there are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, and then we can use that model to train a classifier.\n",
|
||||
"\n",
|
||||
"As usual, let's start with assembling our data."
|
||||
]
|
||||
@ -1310,7 +1310,7 @@
|
||||
"source": [
|
||||
"One thing that's different to previous types we've used in `DataBlock` is that we're not just using the class directly (i.e., `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method that, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible preforms a few optimizations: \n",
|
||||
"\n",
|
||||
"- It save the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once\n",
|
||||
"- It saves the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once\n",
|
||||
"- It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs\n",
|
||||
"\n",
|
||||
"We need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing—that's what `from_folder` does.\n",
|
||||
@ -1654,7 +1654,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This completes the second stage of the text classification process: fine-tuning the language model. We can now use it to fine-tune a clasifier using the IMDb sentiment labels."
|
||||
"This completes the second stage of the text classification process: fine-tuning the language model. We can now use it to fine-tune a classifier using the IMDb sentiment labels."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2150,7 +2150,7 @@
|
||||
"source": [
|
||||
"Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
|
||||
"\n",
|
||||
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the necessary tools at your disposal to create a compelling language model—that is, something that can generate context-appropriate, believable text. It won't necessarily be perfectly accurate or correct, but it will be plausible. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about in recent years. Take a look at the Reddit dilaogue shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending."
|
||||
"Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the necessary tools at your disposal to create a compelling language model—that is, something that can generate context-appropriate, believable text. It won't necessarily be perfectly accurate or correct, but it will be plausible. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about in recent years. Take a look at the Reddit dialogue shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2196,7 +2196,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this chapter we exlored the last application covered out of the box by the fastai library: text. We saw two types of models: language models that can generate texts, and a classifier that determines if a review is positive or negative. To build a state-of-the art classifier, we used a pretrained language model, fine-tuned it to the corpus of our task, then used its body (the encoder) with a new head to do the classification.\n",
|
||||
"In this chapter we explored the last application covered out of the box by the fastai library: text. We saw two types of models: language models that can generate texts, and a classifier that determines if a review is positive or negative. To build a state-of-the art classifier, we used a pretrained language model, fine-tuned it to the corpus of our task, then used its body (the encoder) with a new head to do the classification.\n",
|
||||
"\n",
|
||||
"Before we end this section, we'll take a look at how the fastai library can help you assemble your data for your specific problems."
|
||||
]
|
||||
@ -2271,4 +2271,4 @@
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
}
|
Loading…
Reference in New Issue
Block a user