correct typos in nb 14

This commit is contained in:
John Wu 2020-03-03 00:52:18 -05:00
parent cd65c692a2
commit e3e8ce286a

View File

@ -28,7 +28,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we are so good at recognizing threes from sevens, let's move onto something harder—recognized all 10 digits. That means we'll need to use `MNIST` instead of `MNIST_SAMPLE`:"
"Since we are so good at recognizing threes from sevens, let's move onto something harder—recognizing all 10 digits. That means we'll need to use `MNIST` instead of `MNIST_SAMPLE`:"
]
},
{
@ -163,7 +163,7 @@
"source": [
"Let's start with a basic CNN as a baseline. We'll use the same as we had in the last chapter, but with one tweak: we'll use more activations.\n",
"\n",
"As we discussed, we generally want to double the number of filters each time we have a stride 2 layer. So, one way to increase the number of filters throughout our network is to double the number of activations in the first layer them then every layer after that will end up twice as big as the previous version as well.\n",
"As we discussed, we generally want to double the number of filters each time we have a stride 2 layer. So, one way to increase the number of filters throughout our network is to double the number of activations in the first layer then every layer after that will end up twice as big as the previous version as well.\n",
"\n",
"But there is a subtle problem with this. Consider the kernel which is being applied to each pixel. By default, we use a 3x3 pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four filters output. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to 8 filters. Then when we apply our kernel we would be using nine pixels to calculate eight numbers. That means that it isn't really learning much at all — the output size is almost the same as the input size. Neural networks will only create useful features if they're forced to do so—that is, that the number of outputs from an operation is smaller than the number of inputs.\n",
"\n",
@ -191,7 +191,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you'll see in a moment, we're going to look inside our models while they're training, in order to try to find ways to make them train better. To do this, we use the `ActivationStats` callback, which records the mean, standard deviation, and histogram of activations of every trainable layer (as we've seen, callbacks are used to add behavior to the training loop; we'll see how they work in <<chapter_callbacks>>)."
"As you'll see in a moment, we're going to look inside our models while they're training in order to try to find ways to make them train better. To do this, we use the `ActivationStats` callback, which records the mean, standard deviation, and histogram of activations of every trainable layer (as we've seen, callbacks are used to add behavior to the training loop; we'll see how they work in <<chapter_callbacks>>)."
]
},
{
@ -270,7 +270,7 @@
"source": [
"This didn't train at all well! Let's find out why.\n",
"\n",
"One handy feature of callbacks that you pass to `Learner` is that they are made available automatically, with the same name as the callback class, except in `camel_case`. So our `ActivationStats` callback can be accessed through `activation_stats`. In fact--I'm sure you remember `learn.recorder`... can you guess how that is implemented? That's right, it's a callback called `Recorder`!\n",
"One handy feature of the callbacks passed to `Learner` is that they are made available automatically, with the same name as the callback class, except in `camel_case`. So our `ActivationStats` callback can be accessed through `activation_stats`. In fact--I'm sure you remember `learn.recorder`... can you guess how that is implemented? That's right, it's a callback called `Recorder`!\n",
"\n",
"`ActivationStats` includes some handy utilities for plotting the activations during training. `plot_layer_stats(idx)` plots the mean and standard deviation of the activations of layer number `idx`, along with the percent of activations near zero. Here's the first layer's plot:"
]
@ -446,7 +446,7 @@
"source": [
"Our initial weights are not well suited to the task we're trying to solve. Therefore, it is dangerous to begin training with a high learning rate: we may very well make the training diverge instantly, as we've seen above. We probably don't want to end training with a high learning rate either, so that we don't skip over a minimum. But we want to train at a high learning rate for the rest of training, because we'll be able to train more quickly. Therefore, we should change the learning rate during training, from low, to high, and then back to low again.\n",
"\n",
"Leslie Smith (yes, the same guy that invented the learning rate finder!) developed this idea in his article [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120) by designing a schedule for learning rate separated in two phases: one were the learning rate grows from the minimum value to the maximum value (*warm-up*) then one where it decreases back to the minimum value (*annealing*). Smith called this combination of approaches *1cycle training*.\n",
"Leslie Smith (yes, the same guy that invented the learning rate finder!) developed this idea in his article [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120) by designing a schedule for learning rate separated in two phases: one were the learning rate grows from the minimum value to the maximum value (*warm-up*), and then one where it decreases back to the minimum value (*annealing*). Smith called this combination of approaches *1cycle training*.\n",
"\n",
"1cycle training allows us to use a much higher maximum learning rate than other types of training, which gives two benefits:\n",
"\n",
@ -457,7 +457,7 @@
"\n",
"Then, once we have found a nice smooth area for our parameters, we then want to find the very best part of that area, which means we have to bring out learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models, and trains more quickly. That is why it is the approach that is used by default for `fine_tune` in fastai.\n",
"\n",
"Later in this book we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum vary in the opposite direction of the learning rate: when we are at high learning rate, we use less momentum, and we use more again in the annealing phase.\n",
"Later in this book we'll learn all about *momentum* in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also continues in the direction of previous steps. Leslie Smith introduced cyclical momentums in [A disciplined approach to neural network hyper-parameters: Part 1](https://arxiv.org/pdf/1803.09820.pdf). It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rate, we use less momentum, and we use more again in the annealing phase.\n",
"\n",
"We can use 1cycle training in fastai by calling `fit_one_cycle`:"
]