Some typos
Also, I think these phrases need a revision: "Often, data scientists work with data with which they are not familiar as domain experts me" "we first have to understand what the actual data and activations that are loss function is seen look like"
This commit is contained in:
parent
75bb9dcf4d
commit
87df3aacdc
@ -163,11 +163,11 @@
|
||||
"source": [
|
||||
"The most powerful and flexible way to extract information from strings like this is to use a *regular expression*, also known as a *regex*. A regular expression is a special string, written in the regular expression language, which specifies a general rule for deciding if another string passes a test (i.e., \"matches\" the regular expression), and also possibly for plucking a particular part or parts out of that other string. \n",
|
||||
"\n",
|
||||
"In this case, we need a regular expressin that extracts the pet breed from the file name.\n",
|
||||
"In this case, we need a regular expression that extracts the pet breed from the file name.\n",
|
||||
"\n",
|
||||
"We do not have the space to give you a complete regular expression tutorial here, particularly because there are so many excellent ones online. And we know that many of you will already be familiar with this wonderful tool. If you're not, that is totally fine — this is a great opportunity for you to rectify that! We find that regular expressions are one of the most useful tools in our programming toolkit, and many of our students tell us that it is one of the things they are most excited to learn about. So head over to Google and search for *regular expressions tutorial* now, and then come back here after you've had a good look around. The book website also provides a list of our favorites.\n",
|
||||
"\n",
|
||||
"> a: Not only are regular expresssions dead handy, they also have interesting roots. They are \"regular\" because they they were originally examples of a \"regular\" language, the lowest rung within the \"Chomsky hierarchy\", a grammar classification due to the same linguist Noam Chomskey who wrote _Syntactic Structures_, the pioneering work searching for the formal grammar underlying human language. This is one of the charms of computing: it may be that the hammer you reach for every day in fact came from a space ship.\n",
|
||||
"> a: Not only are regular expresssions dead handy, they also have interesting roots. They are \"regular\" because they were originally examples of a \"regular\" language, the lowest rung within the \"Chomsky hierarchy\", a grammar classification due to the same linguist Noam Chomskey who wrote _Syntactic Structures_, the pioneering work searching for the formal grammar underlying human language. This is one of the charms of computing: it may be that the hammer you reach for every day in fact came from a space ship.\n",
|
||||
"\n",
|
||||
"When you are writing a regular expression, the best way to start is just to try it against one example at first. Let's use the `findall` method to try a regular expression against the filename of the `fname` object:"
|
||||
]
|
||||
@ -250,7 +250,7 @@
|
||||
"1. First, resizing images to relatively \"large dimensions\" that is, dimensions significantly larger than the target training dimensions. \n",
|
||||
"1. Second, composing all of the common augmentation operations (including a resize to the final target size) into one, and performing the combined operation on the GPU only once at the end of processing, rather than performing them individually and interpolating multiple times.\n",
|
||||
"\n",
|
||||
"The first step, the resize, creates images large enough that they have spare margin to allow further augmentation transforms on their inner regions without creating empty zones. This transformation works by resizing to to a square, using a large crop size. On the training set, the crop area is chosen randomly, and the size of the crop is selected to cover the entire width or height of the image, whichever is smaller.\n",
|
||||
"The first step, the resize, creates images large enough that they have spare margin to allow further augmentation transforms on their inner regions without creating empty zones. This transformation works by resizing to a square, using a large crop size. On the training set, the crop area is chosen randomly, and the size of the crop is selected to cover the entire width or height of the image, whichever is smaller.\n",
|
||||
"\n",
|
||||
"In the second step, the GPU is used for all data augmentation, and all of the potentially destructive operations are done together, with a single interpolation at the end."
|
||||
]
|
||||
@ -873,7 +873,7 @@
|
||||
"source": [
|
||||
"In <<chapter_mnist_basics>>, the neural net created a single activation per image, which we passed through the sigmoid function. That single activation represented the confidence that the input was a \"3\". Binary problems are a special case of classification problems, because the target can be treated as a single boolean value, as we did in `mnist_loss`. Binary problems can also be thought of as part of the more general group of classifiers with any number of categories--where in this case we happen to have 2 categories. As we saw in the bear classifier, our neural net will return one activation per category.\n",
|
||||
"\n",
|
||||
"So in the binary case, what do those activations really indicate? A single pair of activations simply indicates the *relative* confidence of being a \"3\" versus being a \"7\". The overall values, whether they are both high, or both low, don't matter--all that matters it which is higher, and by how much.\n",
|
||||
"So in the binary case, what do those activations really indicate? A single pair of activations simply indicates the *relative* confidence of being a \"3\" versus being a \"7\". The overall values, whether they are both high, or both low, don't matter--all that matters is which is higher, and by how much.\n",
|
||||
"\n",
|
||||
"We would expect that since this is just another way of representing the same problem (in the binary case) that we would be able to use sigmoid directly on the two-activation version of our neural net. And indeed we can! We can just take the *difference* between the neural net activations, because that reflects how much more sure we are of being a \"3\" vs a \"7\", and then take the sigmoid of that:"
|
||||
]
|
||||
@ -1157,7 +1157,7 @@
|
||||
"source": [
|
||||
"Looking at this table, you can see that the final column can be calculated by taking the `targ` and `idx` columns as indices into the 2-column matrix containing the `3` and `7` columns. That's what `sm_acts[idx, targ]` is actually doing.\n",
|
||||
"\n",
|
||||
"The really interesting thing here is that this actually works just as well with more than two columns. To see this, consider what would happen if we added a activation column above for every digit (zero through nine), and then `targ` contained a number from zero to nine. As long as the activation columns sum to one (as they will, if we use softmax), then we'll have a loss function that shows how well we're predicting each digit.\n",
|
||||
"The really interesting thing here is that this actually works just as well with more than two columns. To see this, consider what would happen if we added an activation column above for every digit (zero through nine), and then `targ` contained a number from zero to nine. As long as the activation columns sum to one (as they will, if we use softmax), then we'll have a loss function that shows how well we're predicting each digit.\n",
|
||||
"\n",
|
||||
"We're only picking the loss from the column containing the correct label. We don't need to consider the other columns, because by the definition of softmax, they add up to one minus the activation corresponding to the correct label. Therefore, making the activation for the correct label as high as possible, must mean we're also decreasing the activations of the remaining columns.\n",
|
||||
"\n",
|
||||
@ -1485,7 +1485,7 @@
|
||||
"source": [
|
||||
"Since we are not pet breed experts, it is hard for us to know whether these category errors reflect actual difficulties in recognising breeds. So again, we turn to Google. A little bit of googling tells us that the most common category errors shown here are actually breed differences which even expert breeders sometimes disagree about. So this gives us some comfort that we are on the right track.\n",
|
||||
"\n",
|
||||
"So we seem to have a good baseline. What can we do now ot make it even better?"
|
||||
"So we seem to have a good baseline. What can we do now to make it even better?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1499,7 +1499,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will now look a at a range of techniques to improve the training of our model and make it better. While doing so, we will explain a little bit more about transfer learning and how to fine-tune our pretrained model as best as possible, without breaking the pretrained weights.\n",
|
||||
"We will now look at a range of techniques to improve the training of our model and make it better. While doing so, we will explain a little bit more about transfer learning and how to fine-tune our pretrained model as best as possible, without breaking the pretrained weights.\n",
|
||||
"\n",
|
||||
"The first thing we need to set when training a model is the learning rate. We saw in the previous chapter that it needed to be just right to train as efficiently as possible, so how do we pick a good one? fastai provides something called the Learning rate finder for this."
|
||||
]
|
||||
@ -1515,7 +1515,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"One of the most important things we can do when training a model is to make sure that we have the right learning rate. If our learning rate is too low, it's can take many many epochs. Not only does this waste time, but it also means that we may have problems with overfitting, because every time we do a complete pass through the data, we give our model a chance to memorise it.\n",
|
||||
"One of the most important things we can do when training a model is to make sure that we have the right learning rate. If our learning rate is too low, it can take many many epochs. Not only does this waste time, but it also means that we may have problems with overfitting, because every time we do a complete pass through the data, we give our model a chance to memorise it.\n",
|
||||
"\n",
|
||||
"So let's just make our learning rate really high, right? Sure, let's try that and see what happens:"
|
||||
]
|
||||
@ -1599,7 +1599,7 @@
|
||||
"source": [
|
||||
"That did not look good. Here's what happened. The optimiser stepped in the correct direction, but it stepped so far that it totally overshot the minimum loss. Repeating that multiple times makes it get further and further away, not closer and closer!\n",
|
||||
"\n",
|
||||
"What do we to find the perfect learning rate, not too high, and not too low? In 2015 the researcher Leslie Smith came up with a brilliant idea, called the *learning rate finder*. His idea was to start with a very very small learning rate, something so small that we would never expect it to be too big to handle. We use that for one mini batch, find what the losses afterwards, and then increase the learning rate by some percentage (e.g. doubling it each time). Then we do another mini batch, track the loss, and double the learning rate again. We keep doing this until the loss gets worse, instead of better. This is the point where we know we have gone too far. We then select a learning rate a bit lower than this point. Our advice is to pick either:\n",
|
||||
"What do we do to find the perfect learning rate, not too high, and not too low? In 2015 the researcher Leslie Smith came up with a brilliant idea, called the *learning rate finder*. His idea was to start with a very very small learning rate, something so small that we would never expect it to be too big to handle. We use that for one mini batch, find the losses afterwards, and then increase the learning rate by some percentage (e.g. doubling it each time). Then we do another mini batch, track the loss, and double the learning rate again. We keep doing this until the loss gets worse, instead of better. This is the point where we know we have gone too far. We then select a learning rate a bit lower than this point. Our advice is to pick either:\n",
|
||||
"\n",
|
||||
"- one order of magnitude less than where the minimum loss was achieved (i.e. the minimum divided by 10)\n",
|
||||
"- the last point where the loss was clearly decreasing. \n",
|
||||
@ -1661,7 +1661,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see on this plot that in the range 1e-6 to 1e-3, nothing really happens and the model doesn't train. Then the loss starts to decrease until it reaches a minimum then increases again. We don't want a learning rate greater than 1e-1 as it will give a training that diverges (you can try for yourself) but 1e-1 is already too high: at this stage we left the period where the loss was decreasing steadily.\n",
|
||||
"We can see on this plot that in the range 1e-6 to 1e-3, nothing really happens and the model doesn't train. Then the loss starts to decrease until it reaches a minimum and then increases again. We don't want a learning rate greater than 1e-1 as it will give a training that diverges (you can try for yourself) but 1e-1 is already too high: at this stage we left the period where the loss was decreasing steadily.\n",
|
||||
"\n",
|
||||
"In this learning rate plot it appears that a learning rate around 3e-3 would be appropriate, so let's choose that."
|
||||
]
|
||||
@ -1780,15 +1780,15 @@
|
||||
"source": [
|
||||
"We discussed briefly in <<chapter_intro>> how transfer learning works. We saw that the basic idea is that a pretrained model, trained potentially on millions of data points (such as ImageNet), is fine tuned for some other task. But what does this really mean?\n",
|
||||
"\n",
|
||||
"We now know that a convolutional neural network consists of many layers with a non-linear activation function between each and one or more final linear layers, with an activation functions such as softmax at the very end. The final linear layer uses a matrix with enough columns such that the output size is the same as the number of classes in our model (assuming that we are doing classification).\n",
|
||||
"We now know that a convolutional neural network consists of many layers with a non-linear activation function between each and one or more final linear layers, with an activation function such as softmax at the very end. The final linear layer uses a matrix with enough columns such that the output size is the same as the number of classes in our model (assuming that we are doing classification).\n",
|
||||
"\n",
|
||||
"This final linear layer is unlikely to be of any use for us, when we are fine tuning in a transfer learning setting, because it is specifically designed to classify the categories in the original pretraining dataset. So when we do transfer learning we remove it, and throw it away, and replace it with a new linear layer with the correct number of outputs for our desired task (in this case, there would be 37 activations).\n",
|
||||
"This final linear layer is unlikely to be of any use for us, when we are fine tuning in a transfer learning setting, because it is specifically designed to classify the categories in the original pretraining dataset. So when we do transfer learning we remove it, throw it away, and replace it with a new linear layer with the correct number of outputs for our desired task (in this case, there would be 37 activations).\n",
|
||||
"\n",
|
||||
"This newly added linear layer will have entirely random weights. Therefore, our model prior to fine tuning has entirely random outputs. But that does not mean that it is an entirely random model! All of the layers prior to the last one have been carefully trained to be good at image classification tasks in general. As we saw in the images from the Zeiler and Fergus paper in <<chapter_intro>> (see <<img_layer1>> and followings), the first layers encode very general concepts such as finding gradients and edges, and later layers encode concepts that are still very useful for us, such as finding eyeballs and fur.\n",
|
||||
"\n",
|
||||
"We want to train a model in such a way that we allow it to remember all of these generally useful ideas from the pretrained model, use them to solve our particular task (classify pet breeds), and only adjust them as required for the specifics of our particular task.\n",
|
||||
"\n",
|
||||
"Our challenge than when fine tuning is to replace the random weights in our added linear layers with weights that correctly achieve our desired task (classifying pet breeds) without breaking the carefully pretrained weights and the other layers. There is actually a very simple trick to allow this to happen: tell the optimiser to only update the weights in those randomly added final layers. Don't change the weights in the rest of the neural network at all. This is called *freezing* those pretrained layers."
|
||||
"Our challenge then when fine tuning is to replace the random weights in our added linear layers with weights that correctly achieve our desired task (classifying pet breeds) without breaking the carefully pretrained weights and the other layers. There is actually a very simple trick to allow this to happen: tell the optimiser to only update the weights in those randomly added final layers. Don't change the weights in the rest of the neural network at all. This is called *freezing* those pretrained layers."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1883,7 +1883,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"...and run `lr_find` again, because having more layers to train, and weights that have already been trained for 3 epochs, means our previously found learning rate isn't appropriate and more:"
|
||||
"...and run `lr_find` again, because having more layers to train, and weights that have already been trained for 3 epochs, means our previously found learning rate isn't appropriate any more:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2274,7 +2274,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Often you will find that you are limited by time, rather than generalisation and accuracy, when choosing how many epochs to train for. So your first approach to training should be to simply pick a number of epochs that will train in the amount of time that you are happy to wait for. Have a look at the training and validation loss plots, likely showed above, and in particular your metrics, and if you see that they are still getting better even in your final epochs, then you know that you have not trained for too long.\n",
|
||||
"Often you will find that you are limited by time, rather than generalisation and accuracy, when choosing how many epochs to train for. So your first approach to training should be to simply pick a number of epochs that will train in the amount of time that you are happy to wait for. Have a look at the training and validation loss plots, like showed above, and in particular your metrics, and if you see that they are still getting better even in your final epochs, then you know that you have not trained for too long.\n",
|
||||
"\n",
|
||||
"On the other hand, you may well see that the metrics you have chosen are really getting worse at the end of training. Remember, it's not just that were looking for the validation loss to get worse, but your actual metrics. Your validation loss will first of all during training get worse because it gets overconfident, and only later will get worse because it is incorrectly memorising the data. We only care in practice about the latter issue. Our loss function is just something, remember, that we used to allow our optimiser to have something it could differentiate and optimise; it's not actually the thing we care about in practice.\n",
|
||||
"\n",
|
||||
|
Loading…
Reference in New Issue
Block a user