This commit is contained in:
Sylvain Gugger 2020-03-29 11:09:34 -07:00
parent 2599629a9f
commit e6c0f5ace0

View File

@ -1146,7 +1146,7 @@
"\n",
"We will then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...) because we want the model to read continuous rows of text (as in our example above). This is why each text has been added a `xxbos` token during preprocessing, so that the model knows when it reads the stream we are beginning a new entry.\n",
"\n",
"So to recap, at every epoch we shuffle our collection of documents to pick one document, and then we transform that one into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
"So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
"\n",
"This is all done behind the scenes by the fastai library when we create a `LMDataLoader`. We can create one by first applying our `Numericalize` object to the tokenized texts:"
]