Fix #69
This commit is contained in:
parent
2599629a9f
commit
e6c0f5ace0
@ -1146,7 +1146,7 @@
|
||||
"\n",
|
||||
"We will then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...) because we want the model to read continuous rows of text (as in our example above). This is why each text has been added a `xxbos` token during preprocessing, so that the model knows when it reads the stream we are beginning a new entry.\n",
|
||||
"\n",
|
||||
"So to recap, at every epoch we shuffle our collection of documents to pick one document, and then we transform that one into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
|
||||
"So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length you picked.\n",
|
||||
"\n",
|
||||
"This is all done behind the scenes by the fastai library when we create a `LMDataLoader`. We can create one by first applying our `Numericalize` object to the tokenized texts:"
|
||||
]
|
||||
|
Loading…
Reference in New Issue
Block a user