fixes typos

2020-04-02 18:07:31 -03:00 · 2020-04-02 18:07:31 -03:00 · e4d1674163
commit e4d1674163
parent a909a8cb5b
1 changed files with 33 additions and 8 deletions
--- a/12_nlp_dive.ipynb
+++ b/12_nlp_dive.ipynb
@ -44,7 +44,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Whenever we start working on a new problem, we always first try to think of the simplest dataset we can which would allow us to try out methods quickly and easily, and interpret the results. When we started working on language modelling a few years ago, we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *human numbers*, and it simply contains the first 10,000 words written out in English."
+    "Whenever we start working on a new problem, we always first try to think of the simplest dataset we can which would allow us to try out methods quickly and easily, and interpret the results. When we started working on language modelling a few years ago, we didn't find any datasets that would allow for quick prototyping, so we made one. We call it *human numbers*, and it simply contains the first 10,000 numbers written out in English."
   ]
  },
  {
@ -717,7 +717,7 @@
   "source": [
    "Looking at the code for our RNN, one thing that seems problematic is that we are initialising our hidden state to zero for every new input sequence. Why is that a problem? We made our sample sequences short so they would fit easily into batches. But if we order those samples correctly, those sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence. \n",
    "\n",
-    "Another thing we can look at is havin more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? \n",
+    "Another thing we can look at is having more signal: why only predict the fourth word when we could use the intermediate predictions to also predict the second and third words? \n",
    "\n",
    "We'll see how we can implement those changes, starting with adding some state."
   ]
@ -1572,7 +1572,7 @@
    "\n",
    "$$\\tanh(x) = \\frac{e^{x} + e^{-x}}{e^{x}-e^{-x}} = 2 \\sigma(2x) - 1$$\n",
    "\n",
-    "where $\\sigma$ is the sigmoid function. The green boxes are elementwise operations. What goes out is the new hidden state ($h_{t}$) and new cell state ($c_{t}$) on the left, ready for our next input. The new hidden state is also use as output, which is why the arrow splits to go up.\n",
+    "where $\\sigma$ is the sigmoid function. The green boxes are elementwise operations. What goes out is the new hidden state ($h_{t}$) and new cell state ($c_{t}$) on the right, ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.\n",
    "\n",
    "Let's go over the four neural nets (called *gates*) one by one and explain the diagram, but before this, notice how very little the cell state (on the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.\n",
    "\n",
@ -1580,7 +1580,7 @@
    "\n",
    "The first gate (looking from the left to right) is called the *forget gate*. Since it's a linear layer followed by a sigmoid, its output will have scalars between 0 and 1. We multiply this result by the cell gate, so for all the values close to 0, we will forget what was inside that cell state (and for the values close to 1 it doesn't do anything). This gives the ability to the LSTM to forget things about its longterm state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
    "\n",
-    "The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it jsut decides which element of the cell state to update (valeus close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
+    "The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it just decides which element of the cell state to update (values close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
    "\n",
    "The last gate is the *output gate*. It will decides which information take in the cell state to generate the output. The cell state goes through a tanh before this and the output gate combined with the sigmoid decides which values to take inside it.\n",
    "\n",
@ -1894,7 +1894,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Recurrent neural networks, in general, are hard to train, because of the problems of vanishing activations and gradients we saw before. Using LSTMs (or GRUs) cell make training easier than vanilla RNNs, but there are still very prone to overfitting. Data augmentation, while it exists for text data, is less often used because in most cases, it requires another model to generate random augmentation (by translating in another language and back to the language used for instance). Overall, data augmentation for text data is currently not a well explored space.\n",
+    "Recurrent neural networks, in general, are hard to train, because of the problems of vanishing activations and gradients we saw before. Using LSTMs (or GRUs) cells make training easier than vanilla RNNs, but there are still very prone to overfitting. Data augmentation, while it exists for text data, is less often used because in most cases, it requires another model to generate random augmentation (by translating in another language and back to the language used for instance). Overall, data augmentation for text data is currently not a well explored space.\n",
    "\n",
    "However, there are other regularization techniques we can use instead to reduce overfitting, which were thoroughly studied for use with LSTMs in the paper [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182). This paper showed how effective use of *dropout*, *activation regularization*, and *temporal activation regularization* could allow an LSTM to beat state of the art results that previously required much more complicated models. They called an LSTM using these techniques an *AWD LSTM*. We'll look at each of these techniques in turn."
   ]
@ -2090,7 +2090,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can the train the model, and add additional regularization by increasing the weight decay to `0.1`:"
+    "We can then train the model, and add additional regularization by increasing the weight decay to `0.1`:"
   ]
  },
  {
@ -2257,7 +2257,7 @@
    "- weight dropout (applied to the weights of the LSTM at each training step)\n",
    "- hidden dropout (applied to the hidden state between two layers)\n",
    "\n",
-    "which makes it even more regularized. Since fine-tuning those five dropout values (adding the dropout before the output layer) is complicated, so we have determined good defaults, and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw (which is multiplied by each dropout).\n",
+    "which makes it even more regularized. Since fine-tuning those five dropout values (adding the dropout before the output layer) is complicated, we have determined good defaults, and allow the magnitude of dropout to be tuned overall with the `drop_mult` parameter you saw (which is multiplied by each dropout).\n",
    "\n",
    "Another architecture that is very powerful, especially in \"sequence to sequence\" problems (that is, problems where the dependent variable is itself a variable length sequence, such as language translation), is the Transformers architecture. You can find it in an online bonus chapter on the book website."
   ]
@ -2274,7 +2274,7 @@
   "metadata": {},
   "source": [
    "1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?\n",
-    "1. Why do we concatenating the documents in our dataset before creating a language model?\n",
+    "1. Why do we concatenate the documents in our dataset before creating a language model?\n",
    "1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?\n",
    "1. How can we share a weight matrix across multiple layers in PyTorch?\n",
    "1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.\n",
@ -2347,6 +2347,31 @@
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
  }
 },
 "nbformat": 4,