Tokenizer description fixes in 10_nlp.ipynb (#350)
* Fix description of tokenizing repeating chars The description of the tokenization of "!!!!" was out of order of the actual functionality. * Update "xxcap" to "xxup" On line 387 it shows that "xxup" is used, not "xxcap"
This commit is contained in:
parent
074592b3e6
commit
08746b427e
@ -309,7 +309,7 @@
|
||||
"\n",
|
||||
"These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenized language—a language that is designed to be easy for a model to learn.\n",
|
||||
"\n",
|
||||
"For instance, the rules will replace a sequence of four exclamation points with a single exclamation point, followed by a special *repeated character* token, and then the number four. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.\n",
|
||||
"For instance, the rules will replace a sequence of four exclamation points with a special *repeated character* token, followed by the number four, and then a single exclamation point. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.\n",
|
||||
"\n",
|
||||
"Here are some of the main special tokens you'll see:\n",
|
||||
"\n",
|
||||
@ -364,7 +364,7 @@
|
||||
"- `replace_wrep`:: Replaces any word repeated three times or more with a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word\n",
|
||||
"- `spec_add_spaces`:: Adds spaces around / and #\n",
|
||||
"- `rm_useless_spaces`:: Removes all repetitions of the space character\n",
|
||||
"- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxcap`) in front of it\n",
|
||||
"- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxup`) in front of it\n",
|
||||
"- `replace_maj`:: Lowercases a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it\n",
|
||||
"- `lowercase`:: Lowercases all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`)"
|
||||
]
|
||||
|
Loading…
Reference in New Issue
Block a user