commit
5ca8bf280f
@ -97,7 +97,7 @@
|
||||
"\n",
|
||||
"- NLP: answering questions; speech recognition; summarizing documents; classifying documents; finding names, dates, etc in documents; searching for articles mentioning a concept\n",
|
||||
"- Computer vision: satellite and drone imagery interpretation (e.g. for disaster resilience); face recognition; image captioning; reading traffic signs; locating pedestrians and vehicles in autonomous vehicles\n",
|
||||
"- Medicine: Finding anomolies in radiology images, including CT, MRI, and x-ray; counting features in pathology slides; measuring features in ultrasounds; diagnosing diabetic retinopathy\n",
|
||||
"- Medicine: Finding anomalies in radiology images, including CT, MRI, and x-ray; counting features in pathology slides; measuring features in ultrasounds; diagnosing diabetic retinopathy\n",
|
||||
"- Biology: folding proteins; classifying proteins; many genomics tasks, such as tumor-normal sequencing and classifying clinically actionable genetic mutations; cell classification; analyzing protein/protein interactions\n",
|
||||
"- Image generation: Colorizing images; increasing image resolution; removing noise from images; Converting images to art in the style of famous artists\n",
|
||||
"- Recommendation systems: web search; product recommendations; home page layout\n",
|
||||
@ -1464,7 +1464,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Overfitting is the most important and challenging single issue** when training for all machine learning practitioners, and all algorithms. As we will see, it is very easy to create a model that does a great job at making predictions on the exact data which it has been trained on, but it is much harder to make predictions on data that it has never seen before. And of course this is the data that will actually matter in practice. For instance, if you create and hand-written digit classifier (as we will very soon!) and use it to recognise numbers written on cheques, then you are never going to see any of the numbers that the model was trained on -- every cheque will have slightly different variations of writing to deal with. We will learn many methods to avoid overfitting in this book. However, you should only use those methods after you have confirmed that overfitting is actually occuring (i.e. you have actually observed the validation accuracy getting worse during training). We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn't need to do so, ending up with a model that could be less accurate than what they could have gotten."
|
||||
"**Overfitting is the most important and challenging single issue** when training for all machine learning practitioners, and all algorithms. As we will see, it is very easy to create a model that does a great job at making predictions on the exact data which it has been trained on, but it is much harder to make predictions on data that it has never seen before. And of course this is the data that will actually matter in practice. For instance, if you create and hand-written digit classifier (as we will very soon!) and use it to recognise numbers written on cheques, then you are never going to see any of the numbers that the model was trained on -- every cheque will have slightly different variations of writing to deal with. We will learn many methods to avoid overfitting in this book. However, you should only use those methods after you have confirmed that overfitting is actually occurring (i.e. you have actually observed the validation accuracy getting worse during training). We often see practitioners using over-fitting avoidance techniques even when they have enough data that they didn't need to do so, ending up with a model that could be less accurate than what they could have gotten."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2089,7 +2089,7 @@
|
||||
"\n",
|
||||
"The outputs themselves can be deceiving: they have the results of the last time the cell was executed, but if you change the code inside a cell without executing it, you will keep them.\n",
|
||||
"\n",
|
||||
"Except when we mention it explicitely, the notebooks provided on the book website are meant to be run in order, from top to bottom. In general, when experimenting, you will find yourself executing cells in any order to go fast (which is a super neat featur of Jupyter Notebooks) but once you have explored and arrive at the final version of your code, make sure you can run the cells of your notebooks in order (your future self won't necessarily remember the convoluted path you took otherwise!). \n",
|
||||
"Except when we mention it explicitly, the notebooks provided on the book website are meant to be run in order, from top to bottom. In general, when experimenting, you will find yourself executing cells in any order to go fast (which is a super neat feature of Jupyter Notebooks) but once you have explored and arrive at the final version of your code, make sure you can run the cells of your notebooks in order (your future self won't necessarily remember the convoluted path you took otherwise!). \n",
|
||||
"\n",
|
||||
"In edit mode, pressing `0` twice will restart the *kernel* (which is the engine powering your notebook). This will wipe your state clean and make it as if you had just started in the notebook. Clean then on the \"Cell\" menu and then on \"Run All Above\" to run all the cells above the point you are. We have found this to be very useful when developing the fastai library."
|
||||
]
|
||||
@ -2581,7 +2581,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To do a good job of defining a validation set (and possibly a test set), you will sometimes want to do more than just randomly grab a fraction of your original datset. Remember: a key property of the validation and test sets is that they must be representative of the new data you will see in the future. This may sound like an impossible order! By definition, you haven’t seen this data yet. But you usually still do know some things.\n",
|
||||
"To do a good job of defining a validation set (and possibly a test set), you will sometimes want to do more than just randomly grab a fraction of your original dataset. Remember: a key property of the validation and test sets is that they must be representative of the new data you will see in the future. This may sound like an impossible order! By definition, you haven’t seen this data yet. But you usually still do know some things.\n",
|
||||
"\n",
|
||||
"It's instructive to look at a few example cases. Many of these examples come from predictive modeling competitions on the *Kaggle* platform, which is a good representation of problems and methods you would see in practice.\n",
|
||||
"\n",
|
||||
@ -2721,7 +2721,7 @@
|
||||
"1. What is the name of the theorem that a neural network can solve any mathematical problem to any level of accuracy?\n",
|
||||
"1. What do you need in order to train a model?\n",
|
||||
"1. How could a feedback loop impact the rollout of a predictive policing model?\n",
|
||||
"1. Do we always have to use 224x224 pixel images with the cat recogition model?\n",
|
||||
"1. Do we always have to use 224x224 pixel images with the cat recognition model?\n",
|
||||
"1. What is the difference between classification and regression?\n",
|
||||
"1. What is a validation set? What is a test set? Why do we need them?\n",
|
||||
"1. What will fastai do if you don't provide a validation set?\n",
|
||||
|
@ -77,7 +77,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Text (natural language processing)**: just like in computer vision, computers are very good at categorising both short and long documents based on categories such as spam, sentiment, author, source website, and so forth. We are not aware of any rigourous work done in this area to compare to human performance, but anecdotally it seems to us that deep learning performance is similar to human performance here. Deep learning is also very good at generating context-appropriate text, such as generating replies to social media posts, and imitating a particular author's style. It is also good at making this content compelling to humans, and has been shown to be even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information, along with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content which appears to a layman to be compelling, but actually is entirely incorrect.\n",
|
||||
"**Text (natural language processing)**: just like in computer vision, computers are very good at categorising both short and long documents based on categories such as spam, sentiment, author, source website, and so forth. We are not aware of any rigorous work done in this area to compare to human performance, but anecdotally it seems to us that deep learning performance is similar to human performance here. Deep learning is also very good at generating context-appropriate text, such as generating replies to social media posts, and imitating a particular author's style. It is also good at making this content compelling to humans, and has been shown to be even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information, along with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content which appears to a layman to be compelling, but actually is entirely incorrect.\n",
|
||||
"\n",
|
||||
"Another concern is that context-appropriate, highly compelling responses on social media can be used at massive scale — thousands of times greater than any troll farm previously seen — to spread disinformation, create unrest, and encourage conflict. As a rule of thumb, text generation will always be technologically a bit ahead of the ability of models to recognize automatically generated text. For instance, as we will see in this book, it is possible to use a model that can recognize artificially generated content to actually improve the generator that creates that content, until the classification model is no longer able to complete its task.\n",
|
||||
"\n",
|
||||
|
@ -173,7 +173,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"images/ethics/image2.png\" id=\"meeting\" caption=\"IBM CEO Tom Watson Sr. metting with Adolf Hitler\" alt=\"A picture of IBM CEO Tom Watson Sr. metting with Adolf Hitler\" width=\"400\">"
|
||||
"<img src=\"images/ethics/image2.png\" id=\"meeting\" caption=\"IBM CEO Tom Watson Sr. meeting with Adolf Hitler\" alt=\"A picture of IBM CEO Tom Watson Sr. meeting with Adolf Hitler\" width=\"400\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -328,7 +328,7 @@
|
||||
"source": [
|
||||
"Russia Today's coverage of the Mueller report was an extreme outlier in how many channels were recommending it. This suggests the possibility that Russia Today, a state-owned Russia media outlet, has been successful in gaming YouTube's recommendation algorithm. The lack of transparency of systems like this make it hard to uncover the kinds of problems that we're discussing.\n",
|
||||
"\n",
|
||||
"Another example of a feedbakc loop is a predictive policing algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. University of Utah computer science processor Suresh Venkatasubramanian says about this: \"Predictive policing is aptly named: it is predicting future policing, not future crime.”\n",
|
||||
"Another example of a feedback loop is a predictive policing algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. University of Utah computer science processor Suresh Venkatasubramanian says about this: \"Predictive policing is aptly named: it is predicting future policing, not future crime.”\n",
|
||||
"\n",
|
||||
"There are positive examples of people and organizations attempting to combat these problems. Evan Estola, lead machine learning engineer at Meetup, [discussed the example](https://www.youtube.com/watch?v=MqoRzNhrTnQ) of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop, but explicitly not using gender for that part of their model. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact. \"You need to decide which feature not to use in your algorithm… the most optimal algorithm is perhaps not the best one to launch into production\", he said.\n",
|
||||
"\n",
|
||||
@ -420,7 +420,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Yes, that is showing what you think it is: Google Photos classified a Black user's photo with their friend as \"gorrilas\"! This algorithmic mis-step got a lot of attention in the media. “We’re appalled and genuinely sorry that this happened,” a company spokeswoman said. “There is still clearly a lot of work to do with automatic image labeling, and we’re looking at how we can prevent these types of mistakes from happening in the future.”\n",
|
||||
"Yes, that is showing what you think it is: Google Photos classified a Black user's photo with their friend as \"gorillas\"! This algorithmic mis-step got a lot of attention in the media. “We’re appalled and genuinely sorry that this happened,” a company spokeswoman said. “There is still clearly a lot of work to do with automatic image labeling, and we’re looking at how we can prevent these types of mistakes from happening in the future.”\n",
|
||||
"\n",
|
||||
"Unfortunately, fixing problems in machine learning systems when the input data has problems is hard. Google's first attempt didn't inspire confidence, as covered by The Guardian:"
|
||||
]
|
||||
@ -429,7 +429,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"images/ethics/image8.png\" id=\"gorilla-ban\" caption=\"Google first response to the problem\" alt=\"Pictures of a headlines form the Guardian, whoing Google removed gorialls and other moneys fomr the possible labels of its algorithm\" width=\"500\">"
|
||||
"<img src=\"images/ethics/image8.png\" id=\"gorilla-ban\" caption=\"Google first response to the problem\" alt=\"Pictures of a headlines from the Guardian, whoing Google removed gorillas and other moneys from the possible labels of its algorithm\" width=\"500\">"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -795,14 +795,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> : The ethical implications of algorithmic systems have been much discussed in both HCI and the broader community of those interested in technology design, development and policy. In this paper, we explore the application of one prominent ethical framework - Fairness, Accountability, and Transparency - to a proposed algorithm that resolves various societal issues around food security and population ageing. Using various standardised forms of algorithmic audit and evaluation, we drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system. We discuss how this might serve as a guide to other researchers or practitioners looking to ensure better ethical outcomes from algorithmic systems in their line of work."
|
||||
"> : The ethical implications of algorithmic systems have been much discussed in both HCI and the broader community of those interested in technology design, development and policy. In this paper, we explore the application of one prominent ethical framework - Fairness, Accountability, and Transparency - to a proposed algorithm that resolves various societal issues around food security and population aging. Using various standardised forms of algorithmic audit and evaluation, we drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system. We discuss how this might serve as a guide to other researchers or practitioners looking to ensure better ethical outcomes from algorithmic systems in their line of work."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this paper, the rather contraversial proposal (\"Turning the Elderly into High-Nutrient Slurry\") and the results (\"drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system\") are at odds... to say the least!\n",
|
||||
"In this paper, the rather controversial proposal (\"Turning the Elderly into High-Nutrient Slurry\") and the results (\"drastically increase the algorithm's adherence to the FAT framework, resulting in a more ethical and beneficent system\") are at odds... to say the least!\n",
|
||||
"\n",
|
||||
"In philosophy, and especially philosophy of ethics, this is one of the most effective tools: first, come up with a process, definition, set of questions, etc, which is designed to resolve some problem. Then try to come up with an example where that apparent solution results in a proposal that no-one would consider acceptable. This can then lead to a further refinement of the solution."
|
||||
]
|
||||
@ -828,7 +828,7 @@
|
||||
"\n",
|
||||
"Ethical behavior in industry is necessary as well, since:\n",
|
||||
"- Law will not always keep up\n",
|
||||
"- Edge cases will arise in which pracitioners must use their best judgement."
|
||||
"- Edge cases will arise in which practitioners must use their best judgement."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -3053,7 +3053,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from of a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the speficic task we are targetting. So the model will need to *learn* better weights.\n",
|
||||
"To summarize, at the beginning, the weights of our model can be random (training *from scratch*) or come from of a pretrained model (*transfer learning*). In the first case, the output we will get from our inputs won't have anything to do with what we want, and even in the second case, it's very likely the pretrained model won't be very good at the specific task we are targeting. So the model will need to *learn* better weights.\n",
|
||||
"\n",
|
||||
"To do this, we will compare the outputs the model gives us with our targets (we have labelled data, so we know what result the model should give) using a *loss function*, which returns a number that needs to be as low as possible. Our weights need to be improved. To do this, we take a few data items (such as images) that we feed to our model. After going through our model, we compare to the corresponding targets using our loss function. The score we get tells us how wrong our predictions were, and we will change the weights a little bit to make it slightly better.\n",
|
||||
"\n",
|
||||
@ -3432,7 +3432,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This image shows two matrices, `A` and `B` being multiplied together. Each item of the result, which we'll call `AB`, contains each item of its corresponding row of `A` multiplied by each item of its corresponding column of `B`, added together. For instance, row 1 column 2 (the orange dot with a red border) is calculated as $a_{1,1} * b_{1,2} + a_{1,2} * b_{2,2}$. If you need a refresher on matrix multiplication, we suggest you take a look at the great *Introduction to Matrix Multiplcation* on *Khan Academy*, since this is the most important mathematical operation in deep learning.\n",
|
||||
"This image shows two matrices, `A` and `B` being multiplied together. Each item of the result, which we'll call `AB`, contains each item of its corresponding row of `A` multiplied by each item of its corresponding column of `B`, added together. For instance, row 1 column 2 (the orange dot with a red border) is calculated as $a_{1,1} * b_{1,2} + a_{1,2} * b_{2,2}$. If you need a refresher on matrix multiplication, we suggest you take a look at the great *Introduction to Matrix Multiplication* on *Khan Academy*, since this is the most important mathematical operation in deep learning.\n",
|
||||
"\n",
|
||||
"In Python, matrix multiplication is represented with the `@` operator. Let's try it:"
|
||||
]
|
||||
|
@ -949,7 +949,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You have now seen everything you need to train a state-of-the-art model in computer vision, whether from scratch or using tranfer learning. Now all you have to do is experiment on your own problems! See if training longer with Mixup and/or label smoothing avoids overfitting and gives you better results. Try progressive resizing, and test time augmentation.\n",
|
||||
"You have now seen everything you need to train a state-of-the-art model in computer vision, whether from scratch or using transfer learning. Now all you have to do is experiment on your own problems! See if training longer with Mixup and/or label smoothing avoids overfitting and gives you better results. Try progressive resizing, and test time augmentation.\n",
|
||||
"\n",
|
||||
"Most importantly, remember that if your dataset is big, there is no point prototyping on the whole thing. Find a small subset that is representative of the whole, like we did with Imagenette, and experiment on it.\n",
|
||||
"\n",
|
||||
|
@ -730,9 +730,9 @@
|
||||
"\n",
|
||||
"How do we determine numbers to characterize those? The answer is, we don't. We will let our model *learn* them. By analyzing the existing relations between users and movies, let our model figure out itself the features that seem important or not.\n",
|
||||
"\n",
|
||||
"This is what embeddings are. We will attribute to each of our users and each of our movie a random vector of a certain length (here `n_factors=5`), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and udpate them with the rule of SGD (or another optimizer).\n",
|
||||
"This is what embeddings are. We will attribute to each of our users and each of our movie a random vector of a certain length (here `n_factors=5`), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and update them with the rule of SGD (or another optimizer).\n",
|
||||
"\n",
|
||||
"At the beginning, those numbers don't mean anything since we have chosen them randomly, but by the end of training, they will. By learning on existing data between users and movies, without having any other information, we will see that they still get some important features, and can isolate blockbusters from indepent cinema, action movies from romance...\n",
|
||||
"At the beginning, those numbers don't mean anything since we have chosen them randomly, but by the end of training, they will. By learning on existing data between users and movies, without having any other information, we will see that they still get some important features, and can isolate blockbusters from independent cinema, action movies from romance...\n",
|
||||
"\n",
|
||||
"We are now in a position that we can create our whole model from scratch."
|
||||
]
|
||||
|
@ -143,7 +143,7 @@
|
||||
"source": [
|
||||
"What stands out in these two examples is that we provide the model fundamentally categorical data about discrete entities (German states or days of the week), and then the model learns an embedding for these entities which defines a continuous notion of distance between them. Because the embedding distance was learned based on real patterns in the data, that distance tends to match up with our intuitions.\n",
|
||||
"\n",
|
||||
"In addition, it is also valuable in its own right that embeddings are continuous. It is valuable because models are better at understanding continuous variables. This is unsurprising considering models are built of many continous parameter weights, continuous activation values, all updated via gradient descent, a learning algorithm for finding the minimums of continuous functions.\n",
|
||||
"In addition, it is also valuable in its own right that embeddings are continuous. It is valuable because models are better at understanding continuous variables. This is unsurprising considering models are built of many continuous parameter weights, continuous activation values, all updated via gradient descent, a learning algorithm for finding the minimums of continuous functions.\n",
|
||||
"\n",
|
||||
"Is is also valuable because we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer, before it interacts with the raw continuous input data. This is how fastai, and the entity embeddings paper, handle tabular models containing continuous and categorical variables.\n",
|
||||
"\n",
|
||||
@ -191,7 +191,7 @@
|
||||
"source": [
|
||||
"Although deep learning is nearly always clearly superior for unstructured data, these two approaches tend to give quite similar results for many kinds of structured data. But ensembles of decision trees tend to train faster, are often easier to interpret, do not require special GPU hardware for inference at scale, and often require less hyperparameter tuning. They have been popular for quite a lot longer than deep learning, so there is a more mature ecosystem for tooling and documentation around them.\n",
|
||||
"\n",
|
||||
"Most importantly, the critical step of intepreting a model of tabular data is significantly easier for decision tree ensembles. There are tools and methods for answering the pertinent questions. For instance, which columns in the dataset were the most important for your predictions? How are they related to the dependent variable? How do they interact with each other? And which particular features were most important for some particular observation?\n",
|
||||
"Most importantly, the critical step of interpreting a model of tabular data is significantly easier for decision tree ensembles. There are tools and methods for answering the pertinent questions. For instance, which columns in the dataset were the most important for your predictions? How are they related to the dependent variable? How do they interact with each other? And which particular features were most important for some particular observation?\n",
|
||||
"\n",
|
||||
"Therefore, ensembles of decision trees are our first approach for analysing a new tabular dataset.\n",
|
||||
"\n",
|
||||
@ -553,7 +553,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The first piece of data preparation we need to do is to enrich our representation of dates. The fundamental basis of the decision tree which we just described is bisection -- dividing up a group into two. We look at the ordinal variables and divide up the dataset based on whether the variable's value is greater (or lower) than a threshhold, and we look at the categorical variables and divided up the datset based on whether the variable's level is a particular level. So this algorithm has a way of dividing up the dataset based on both ordinal and categorical data.\n",
|
||||
"The first piece of data preparation we need to do is to enrich our representation of dates. The fundamental basis of the decision tree which we just described is bisection -- dividing up a group into two. We look at the ordinal variables and divide up the dataset based on whether the variable's value is greater (or lower) than a threshhold, and we look at the categorical variables and divided up the dataset based on whether the variable's level is a particular level. So this algorithm has a way of dividing up the dataset based on both ordinal and categorical data.\n",
|
||||
"\n",
|
||||
"How does this apply to a common data type, the date? You might want to treat a date as an ordinal value, because it is meaningful to say that one date is greater than another. However, dates are a bit different from most ordinal values in that some dates are qualitatively different from others in a way that that is often relevant to the systems we are modelling.\n",
|
||||
"\n",
|
||||
|
@ -393,7 +393,7 @@
|
||||
"- `spec_add_spaces`: add spaces around / and # ;\n",
|
||||
"- `rm_useless_spaces`: remove all repetitions of the space character ;\n",
|
||||
"- `replace_all_caps`: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;\n",
|
||||
"- `replace_maj`: lowercase a capilaized word and adds a special token for capitalized (xxmaj) in front of it ;\n",
|
||||
"- `replace_maj`: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;\n",
|
||||
"- `lowercase`: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)."
|
||||
]
|
||||
},
|
||||
@ -447,7 +447,7 @@
|
||||
"\n",
|
||||
"To handle these cases, it's generally best to use subword tokenization. This proceeds in two steps:\n",
|
||||
"\n",
|
||||
"1. Analyze a corpus of documents to find the most commonly occuring groups of letters. These become the vocab.\n",
|
||||
"1. Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.\n",
|
||||
"2. Tokenize the corpus using this vocab of *subword units*.\n",
|
||||
"\n",
|
||||
"Let's look at an example. For our corpus, we'll use the first 2000 movie reviews:"
|
||||
@ -1189,11 +1189,11 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Going back to our dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order in which the inputs come, so at the beginnaing of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside, otherwise the text would not make sense anymore).\n",
|
||||
"Going back to our dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order in which the inputs come, so at the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside, otherwise the text would not make sense anymore).\n",
|
||||
"\n",
|
||||
"We will then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...) because we want the model to read continuous rows of text (as in our example above). This is why each text has been added a `xxbos` token during preprocessing, so that the model knows when it reads the stream we are beginning a new entry.\n",
|
||||
"\n",
|
||||
"So to recap, at every epoch we shuffle our collection of documents to pick one docment, and then we transform that one into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
|
||||
"So to recap, at every epoch we shuffle our collection of documents to pick one document, and then we transform that one into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
|
||||
"\n",
|
||||
"This is all done behind the scenes by the fastai library when we create a `LMDataLoader`. We can create one by first applying our `Numericalize` object to the tokenized texts:"
|
||||
]
|
||||
|
@ -535,7 +535,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at `tokens` reminded me that large numbers are written with many words, so on the way to 10,000 you write \"thousand\" a lot: five thousand, five thousand and one, five thousand and two, etc.. Oops! Looking at your data is great for noticing subtle featues and also embarrassingly obvious ones."
|
||||
"> A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at `tokens` reminded me that large numbers are written with many words, so on the way to 10,000 you write \"thousand\" a lot: five thousand, five thousand and one, five thousand and two, etc.. Oops! Looking at your data is great for noticing subtle features and also embarrassingly obvious ones."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -390,7 +390,7 @@
|
||||
"\n",
|
||||
"The first gate (looking from the left to right) is called the *forget gate*. Since it's a linear layer followed by a sigmoid, its output will have scalars between 0 and 1. We multiply this result y the cell gate, so for all the values close to 0, we will forget what was inside that cell state (and for the values close to 1 it doesn't do anything). This gives the ability to the LSTM to forget things about its longterm state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.\n",
|
||||
"\n",
|
||||
"The second gate works is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoum, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it jsut decides which element of the cell state to update (valeus close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
|
||||
"The second gate works is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance we may see a new gender pronoun, so we must replace the information about gender that the forget gate removed by the new one. Like the forget gate, the input gate ends up on a product, so it jsut decides which element of the cell state to update (valeus close to 1) or not (values close to 0). The third gate will then fill those values with things between -1 and 1 (thanks to the tanh). The result is then added to the cell state.\n",
|
||||
"\n",
|
||||
"The last gate is the *output gate*. It will decides which information take in the cell state to generate the output. The cell state goes through a tanh before this and the output gate combined with the sigmoid decides which values to take inside it.\n",
|
||||
"\n",
|
||||
|
@ -1540,7 +1540,7 @@
|
||||
"\n",
|
||||
"The reason for these extra axes is that PyTorch has a few tricks up its sleeve. The first trick is that PyTorch can apply a convolution to multiple images at the same time. That means we can call it on every item in a batch at once!\n",
|
||||
"\n",
|
||||
"The second trick is that PyTorch can apply multiple kernels at the same time. So let's create the diagnoal edge kernels too, and then stack all 4 of our edge kernels into a single tensor:"
|
||||
"The second trick is that PyTorch can apply multiple kernels at the same time. So let's create the diagonal edge kernels too, and then stack all 4 of our edge kernels into a single tensor:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2409,7 +2409,7 @@
|
||||
"source": [
|
||||
"In this example, we just have two convolutional layers, each of stride 2, so this is now tracing right back to the input image. We can see that a 7x7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This 7x7 area is the *receptive field* in the Input of the green activation in Conv2. We can also see that a second filter kernel is needed now, since we have two layers.\n",
|
||||
"\n",
|
||||
"As you see from this example, the deeper we are in the network (specfically, the more stride 2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation in that layer. So we know now that in the deeper layers of the network, we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of seeing the same thing we saw in the previous section: when we introduce a stride 2 conv in our network, we should also increase the number of channels."
|
||||
"As you see from this example, the deeper we are in the network (specifically, the more stride 2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation in that layer. So we know now that in the deeper layers of the network, we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of seeing the same thing we saw in the previous section: when we introduce a stride 2 conv in our network, we should also increase the number of channels."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -633,7 +633,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To create `color_dim`, we take the histgram shown on the left here, and convert it into just the colored representation shown at the bottom. Then we flip it on its side, as shown on the right. We found that the distribution is clearer if we take the `log` of the histogram values. Then, Stefano describes:\n",
|
||||
"To create `color_dim`, we take the histogram shown on the left here, and convert it into just the colored representation shown at the bottom. Then we flip it on its side, as shown on the right. We found that the distribution is clearer if we take the `log` of the histogram values. Then, Stefano describes:\n",
|
||||
"\n",
|
||||
"> : The final plot for each layer is made by stacking the histogram of the activations from each batch along the horizontal axis. So each vertical slice in the visualisation represents the histogram of activations for a single batch. The color intensity corresponds to the height of the histogram, in other words the number of activations in each histogram bin.\n",
|
||||
"\n",
|
||||
@ -722,7 +722,7 @@
|
||||
"source": [
|
||||
"The way batch normalization (often just called *batchnorm*) works is that it takes an average of the mean and standard deviations of the activations of a layer, and uses those to normalize the activations. However, this can cause problems because the network might really want some activations to be really high in order to make accurate predictions, they also add two learnable parameters (meaning they will be updated in our SGD step), usually called `gamma` and `beta`; after normalizing the activations to get some new activation vector `y`, a batchnorm layer returns `gamma*y + beta`.\n",
|
||||
"\n",
|
||||
"That why our activations can have any mean or variance, which is independant from the mean and std of the results of the previous layer. Those statistics are learned separately, making training easier on our model. The behavior is different during traning and validation: during training, we use the mean and standard deviation of the batch to normalize the data. During validation, we instead use a running mean of the statistics calculated during training.\n",
|
||||
"That why our activations can have any mean or variance, which is independent from the mean and std of the results of the previous layer. Those statistics are learned separately, making training easier on our model. The behavior is different during training and validation: during training, we use the mean and standard deviation of the batch to normalize the data. During validation, we instead use a running mean of the statistics calculated during training.\n",
|
||||
"\n",
|
||||
"Let's add a batchnorm layer to `conv`:"
|
||||
]
|
||||
|
@ -215,7 +215,7 @@
|
||||
"source": [
|
||||
"This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2x2 max pooling instead of stride 2 convolutions, since this paper was written before ResNets came along) and it shows the transposed convolutional layers on the right (they're called \"up-conv\" in this picture). Then then extra skip connections are shown as grey arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-net\" when you see this picture!\n",
|
||||
"\n",
|
||||
"With this architecture, the input to the transposed convolutions is not just the lower resolution grid in the preceding layer, but also the higher resolution grid in the resnet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class which auto-generates an archicture of the right size based on the data provided."
|
||||
"With this architecture, the input to the transposed convolutions is not just the lower resolution grid in the preceding layer, but also the higher resolution grid in the resnet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class which auto-generates an architecture of the right size based on the data provided."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -249,7 +249,7 @@
|
||||
"source": [
|
||||
"In practice, what this is saying is that the classifier contains a for loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models — but this time, we do not pool over CNN grid cells, but over RNN sequences.\n",
|
||||
"\n",
|
||||
"For this for loop we need to gather our data in batches, but each text needs to be treated separatedly, as they each have their own label. However, it's very likely that those texts won't have the good taste of being all of the same length, which means we won't be able to put them all in the same array, like we did with the language model.\n",
|
||||
"For this for loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own label. However, it's very likely that those texts won't have the good taste of being all of the same length, which means we won't be able to put them all in the same array, like we did with the language model.\n",
|
||||
"\n",
|
||||
"That's where padding is going to help: when grabbing a bunch of texts, we determine the one with the greater length, then we fill the ones that are shorter with a special token called `xxpad`. To avoid having an extreme case where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation) we alter the randomness by making sure texts of comparable size are put together. It will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely random.\n",
|
||||
"\n",
|
||||
|
@ -38,7 +38,7 @@
|
||||
"new_weight = weight - lr * weight.grad\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"We implemented this from scratch in a training loop, and also saw that Pytorch provides a simple `nn.SGD` class that does this calcuation for each parameter for us. Let's now build some faster optimizers, using a flexible foundation."
|
||||
"We implemented this from scratch in a training loop, and also saw that Pytorch provides a simple `nn.SGD` class that does this calculation for each parameter for us. Let's now build some faster optimizers, using a flexible foundation."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -731,7 +731,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"And we can define our step function and optimzer as before:"
|
||||
"And we can define our step function and optimizer as before:"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -299,11 +299,11 @@
|
||||
"source": [
|
||||
"You can detect one of those exceptions occurred and add code that executes right after with the following events:\n",
|
||||
"\n",
|
||||
"- `after_cancel_batch`: reached imediately after a `CancelBatchException` before proceeding to `after_batch`\n",
|
||||
"- `after_cancel_train`: reached imediately after a `CancelTrainException` before proceeding to `after_epoch`\n",
|
||||
"- `after_cancel_valid`: reached imediately after a `CancelValidException` before proceeding to `after_epoch`\n",
|
||||
"- `after_cancel_epoch`: reached imediately after a `CancelEpochException` before proceeding to `after_epoch`\n",
|
||||
"- `after_cancel_fit`: reached imediately after a `CancelFitException` before proceeding to `after_fit`"
|
||||
"- `after_cancel_batch`: reached immediately after a `CancelBatchException` before proceeding to `after_batch`\n",
|
||||
"- `after_cancel_train`: reached immediately after a `CancelTrainException` before proceeding to `after_epoch`\n",
|
||||
"- `after_cancel_valid`: reached immediately after a `CancelValidException` before proceeding to `after_epoch`\n",
|
||||
"- `after_cancel_epoch`: reached immediately after a `CancelEpochException` before proceeding to `after_epoch`\n",
|
||||
"- `after_cancel_fit`: reached immediately after a `CancelFitException` before proceeding to `after_fit`"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -1070,7 +1070,7 @@
|
||||
"Result (2d tensor): 3 x 3\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"As a little exercice around those rules, try to determine what dimensions to add (and where) when you need to normalize a batch of images of size `64 x 3 x 256 x 256` with vectors of three elements (one for the mean and one for the standard deviation)."
|
||||
"As a little exercise around those rules, try to determine what dimensions to add (and where) when you need to normalize a batch of images of size `64 x 3 x 256 x 256` with vectors of three elements (one for the mean and one for the standard deviation)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1114,7 +1114,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Einsteim summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have only one member in the left hand side. For instance\n",
|
||||
"Einstein summation is a very practical way of expressing operations involving indexing and sum of products. Note that you can have only one member in the left hand side. For instance\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"torch.einsum('ij->ji', a)\n",
|
||||
@ -1182,7 +1182,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will take the example of a two-layer neural net first. As we saw, one layer can be expressed as `y = x @ w + b` with `x` out inputs, `y` our outputs, `w` the weights of the layer (which is of size numbe of inputs by neuron of neurons if we don't transpose like before) and `b` is the bias vector. "
|
||||
"We will take the example of a two-layer neural net first. As we saw, one layer can be expressed as `y = x @ w + b` with `x` out inputs, `y` our outputs, `w` the weights of the layer (which is of size number of inputs by neuron of neurons if we don't transpose like before) and `b` is the bias vector. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1266,7 +1266,7 @@
|
||||
"source": [
|
||||
"Note that this formula works with our batch of inputs, and returns a batch of hidden state: `l1` is a matrix of 200 (our batch size) by 50 (our hidden size).\n",
|
||||
"\n",
|
||||
"There is a problem with the way our model was initiliazed however. To understand it, we need to look at the mean and standard deviation (std) of `l1`."
|
||||
"There is a problem with the way our model was initialized however. To understand it, we need to look at the mean and standard deviation (std) of `l1`."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1423,7 +1423,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can play a little bit with the values of the scale and notice that even a slight variation from 0.1 will get you either to very small or very alrge numbers, so initializing the weights properly is extremely important. Let's go back to our neural net. Since we messed a bit with our inputs we need to redefine them:"
|
||||
"You can play a little bit with the values of the scale and notice that even a slight variation from 0.1 will get you either to very small or very large numbers, so initializing the weights properly is extremely important. Let's go back to our neural net. Since we messed a bit with our inputs we need to redefine them:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -2317,7 +2317,7 @@
|
||||
"- A neural net is basically a bunch of matrix multiplications with non-linearities in-between.\n",
|
||||
"- Python is slow so to write fast code we have to vectorize it and take advantage of element-wise arithmetic or broadcasting.\n",
|
||||
"- Two tensors are broadcastable if the dimensions starting from the end and going backward match (they are the same or one of them is 1). To make tensors broadcastable, we may need to add dimensions of size 1 with `unsqueeze` or a `None` index.\n",
|
||||
"- Properly initiliazing a neural net is crucial to get training started. Kaiming initialization should be used when we have ReLU non-linearities.\n",
|
||||
"- Properly initializing a neural net is crucial to get training started. Kaiming initialization should be used when we have ReLU non-linearities.\n",
|
||||
"- The backward pass is the chain rule applied multiple times, computing the gradients from the output of our model and going back, one layer at a time.\n",
|
||||
"- When subclassing `nn.Module` (if not using fastai's `Module`) we have to call the superclass `__init__` method in our `__init__` method and we have to define a `forward` function that takes an input and returns the desired result."
|
||||
]
|
||||
|
@ -22,7 +22,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the previous chapter, we showed on examples of texts what `Tokenizer` or a `Numericalize` do to a collection of texts, to explain the usual preprocessing in NLP. We then switched to the data block API, that handles those transforms for us directly using the `TextBlock`. But what if we want to only apply one of those transforms, either to see intermediate results or because we have already tokenized texts. More generally, what can we do when the data block API is not flexible enough to accomodate our particular use case?\n",
|
||||
"In the previous chapter, we showed on examples of texts what `Tokenizer` or a `Numericalize` do to a collection of texts, to explain the usual preprocessing in NLP. We then switched to the data block API, that handles those transforms for us directly using the `TextBlock`. But what if we want to only apply one of those transforms, either to see intermediate results or because we have already tokenized texts. More generally, what can we do when the data block API is not flexible enough to accommodate our particular use case?\n",
|
||||
"\n",
|
||||
"In this chapter, we will saw how to use what we call the mid-level API for processing data. The data block API is built on top of that layer, so it will allow you to do everything the data block API does, and much much more! After looking at all the pieces that compose it on our example to preprocess text (like in the last chapter), we will show you an example of preparing data for a Siamese Network, which is a model that takes two images as inputs, and has to predict if those images are of the same class or not."
|
||||
]
|
||||
@ -207,7 +207,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"On ther other end, when looking at our data, we would like to see the result of the tokenization, to make sure none of the rules damaged the texts, so `tok` does not have a decode method (in practice, it has one that does nothing). It's the same for data augmentation transforms: since we want to show the effects on images, to make sure we didn't do too much data augmentation (or not enough) we don't decode those transforms. However, we need to undo the effects of the `Normalize` transform we saw in <<chapter_sizing_and_tta>> to be able to plt the images, so this one has a decode method."
|
||||
"On the other end, when looking at our data, we would like to see the result of the tokenization, to make sure none of the rules damaged the texts, so `tok` does not have a decode method (in practice, it has one that does nothing). It's the same for data augmentation transforms: since we want to show the effects on images, to make sure we didn't do too much data augmentation (or not enough) we don't decode those transforms. However, we need to undo the effects of the `Normalize` transform we saw in <<chapter_sizing_and_tta>> to be able to plt the images, so this one has a decode method."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -287,7 +287,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here `MyTfm` will initialize some states during the setup (the mean of all elements passed), then the transformation is to add that mean. For decoding purposes, we implement the reverse of that transformation by substracting the mean. Here is an example of `myTfm` in action:"
|
||||
"Here `MyTfm` will initialize some states during the setup (the mean of all elements passed), then the transformation is to add that mean. For decoding purposes, we implement the reverse of that transformation by subtracting the mean. Here is an example of `myTfm` in action:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -333,7 +333,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To compose several transforms together, fastai uses `Pipeline`. You can define a `Pipeline` by passing it a list of `Transform`s and it will then compose the transforms indide it: when you call it on an object, it will automatically call the transforms inside in order:"
|
||||
"To compose several transforms together, fastai uses `Pipeline`. You can define a `Pipeline` by passing it a list of `Transform`s and it will then compose the transforms incide it: when you call it on an object, it will automatically call the transforms inside in order:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -402,7 +402,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You data is usually a set of raw items (like filenames, or rows in a dataframe) to which you want to apply a succesion of transformations. We just saw that the sucession of transformations was represented by a `Pipeline` in fastai. The class that groups together this pipeline with your raw items is called `TfmdLists`. Here is the short way of doing the transformation we saw in the previous section:"
|
||||
"You data is usually a set of raw items (like filenames, or rows in a dataframe) to which you want to apply a succession of transformations. We just saw that the succession of transformations was represented by a `Pipeline` in fastai. The class that groups together this pipeline with your raw items is called `TfmdLists`. Here is the short way of doing the transformation we saw in the previous section:"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -807,7 +807,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A Siamese model takes to images and has to determine if they are of the same classe or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. TK see if we train that model later in the book. "
|
||||
"A Siamese model takes two images and has to determine if they are of the same class or not. For this example, we will use the pets dataset again, and prepare the data for a model that will have to predict if two images of pets are of the same breed or not. TK see if we train that model later in the book. "
|
||||
]
|
||||
},
|
||||
{
|
||||
|
Loading…
Reference in New Issue
Block a user