{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# Generative Character-Level Language Models\n", "\n", "This is a variant of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level language models, which in turn was a response to [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recursive neural network (RNN) language models. The term [generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text) based on a model learned from training data. Back in 2015 generative AI was just starting to take off, and Karpathy's point was that the RNNs were unreasonably effective at generating good text, even though they are at heart quite simple. Goldberg's point was that, yes, that's true, but actually most of the magic is not in the RNNs, it is in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg did agree with Karpathy that the RNN captures some aspects of C++ code that the character-level model does not.\n", "\n", "My implementation is similar to Goldberg's, but I updated his code to use Python 3 instead of Python 2, and made some additional changes for simplicity and clarity. (This makes the code less efficient than it could be, but plenty fast enough.) \n", "\n", "## Definition\n", "\n", "What do we mean by a **generative character-level language model**? It means a model that, when given a sequence of characters, can predict what character comes next; it can generate a continuation of a partial text. (And when the partial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*c* | *h*), the probability distribution that the next character will be *c*, given a history of previous characters *h*. For example, given the previous characters `'chai'`, a character-level model should learn to predict that the next character is probably `'r'` or `'n'` (to form the word `'chair'` or `'chain'`). Goldberg calls this a model of order 4 (because it considers histories of length 4) while other authors call it an *n*-gram model with *n* = 5 (because it represents the probabilities of sequences of 5 characters).\n", "\n", "## Training Data\n", "\n", "How does the language model learn these probabilities? By observing a sequence of characters that we call the **training data**. Both Karpathy and Goldberg use the complete works of Shakespeare as their initial training data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 167204 832301 4573338 shakespeare_input.txt\n" ] } ], "source": [ "! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt\n", "! wc shakespeare_input.txt # Print the number of lines, words, and characters" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "All:\n", "Speak, speak.\n", "\n", "First Citizen:\n", "You are all resolved rather to die than to famish?\n", "\n", "All:\n" ] } ], "source": [ "! head shakespeare_input.txt # First 10 lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Code\n", "\n", "There are four main parts to the code:\n", "\n", "- `LanguageModel` is a `defaultdict` that maps a history *h* to a `Counter` of the number of times each character *c* appears immediately following *h* in the training data. \n", "- `train_LM` takes a string of training `data` and an `order`, and builds a language model, formed by counting the times each character *c* occurs and storing that under the entry for the history *h* of characters that precede *c*. \n", "- `generate_text` generates a random text, given a language model, a desired length, and an optional start of the text. At each step it looks at the previous `order` characters and chooses a new character at random from the language model's counter for those previous characters.\n", "- `random_sample` randomly chooses a single character from a counter, with each possibility chosen in proportion to the character's count." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import random\n", "from collections import defaultdict, Counter\n", "\n", "class LanguageModel(defaultdict): \"\"\"A mapping of {history: Counter(characters)}.\"\"\"\n", "\n", "def train_LM(data: str, order: int) -> LanguageModel:\n", " \"\"\"Train a character-level language model of given `order` on the training `data`.\"\"\"\n", " LM = LanguageModel(Counter)\n", " LM.order = order\n", " history = ''\n", " for c in data:\n", " LM[history][c] += 1\n", " history = (history + c)[-order:] # add c to history; truncate history to length `order`\n", " return LM\n", "\n", "def generate_text(LM: LanguageModel, length=1000, text='') -> str:\n", " \"\"\"Generate a random text of `length` characters, with an optional start, from `LM`.\"\"\"\n", " while len(text) < length:\n", " history = text[-LM.order:]\n", " text = text + random_sample(LM[history])\n", " return text\n", "\n", "def random_sample(counter: Counter) -> str:\n", " \"\"\"Randomly sample from the counter, proportional to each entry's count.\"\"\"\n", " i = random.randint(1, sum(counter.values()))\n", " cumulative = 0\n", " for c in counter:\n", " cumulative += counter[c]\n", " if cumulative >= i: \n", " return c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train a model of order 4 on the Shakespeare data. We'll call the model `LM`, and we'll do some queries of it:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = open(\"shakespeare_input.txt\").read()\n", "\n", "LM = train_LM(data, order=4)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'n': 78, 'r': 35})" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"chai\"]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Counter({'p': 1360,\n", " 's': 2058,\n", " 'l': 1006,\n", " 'o': 530,\n", " 'g': 1037,\n", " 'c': 1561,\n", " 'a': 554,\n", " 'C': 81,\n", " 'r': 804,\n", " 'h': 1029,\n", " 'R': 45,\n", " 'd': 1170,\n", " 'w': 1759,\n", " 'b': 1217,\n", " 'm': 1392,\n", " 'v': 388,\n", " 't': 1109,\n", " 'f': 1258,\n", " 'i': 298,\n", " 'n': 616,\n", " 'V': 18,\n", " 'e': 704,\n", " 'u': 105,\n", " 'L': 105,\n", " 'y': 120,\n", " 'A': 29,\n", " 'H': 20,\n", " 'k': 713,\n", " 'M': 54,\n", " 'T': 102,\n", " 'j': 99,\n", " 'q': 171,\n", " 'K': 22,\n", " 'D': 146,\n", " 'P': 54,\n", " 'S': 40,\n", " 'G': 75,\n", " 'I': 14,\n", " 'B': 31,\n", " 'W': 14,\n", " 'E': 77,\n", " 'F': 103,\n", " 'O': 3,\n", " \"'\": 10,\n", " 'z': 6,\n", " 'J': 30,\n", " 'N': 18,\n", " 'Q': 7})" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"the \"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `\"chai\"` is followed by either `'n'` or `'r'`, and almost any letter can follow `\"the \"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Shakespeare\n", "\n", "Let's try to generate random text based on character language models of various orders, starting with order 4." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First\n", "five men crown tribunes an aunt, holden a suddenly daught.\n", "\n", "HECTOR CAIUS:\n", "And drawn by your confess, the jangle such a conferritoriest an make a cost, as you were the world,--\n", "\n", "BENVOLIO:\n", "Where shorter:\n", "A stom old been;\n", "Get you may parts food;\n", "I serve memory her. He is come fire to the\n", "skirted great knowledges,\n", "monster Ajax, thou do thy heart to spend theat--unhappy in and so! There shall spectar? Goodman! we are mine an heart in then\n", "The stomach times bear too: the emperformane\n", "And least,\n", "And then you are my grate and as\n", "A woman, I cannon down!' 'Course of my\n", "love,\n", "The tillo, away heard\n", "You soul issue us comes hand,\n", "To Julius, that pattering teach thither\n", "that for come in\n", "a fathere growned from far\n", "Crying, from yoursed into this.\n", "\n", "SILVIA:\n", "Wilt before.\n", "\n", "PAULINA:\n", "Might him: but Marshall be my fail age in fat, remember than arms? calls and the compulsion liar came him. If thy is bled this ever; or your tempts,\n", "Open an of think it from him our changentleman, more ther titless them th\n" ] } ], "source": [ "print(generate_text(LM))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Order 4 captures the structure of plays, mentions some characters, and generates mostly English words, although they don't always go together to form grammatical sentences, and there is certainly no coherence or plot. \n", "\n", "## Generating Order 7 Shakespeare\n", "\n", "What if we increase it to order 7? Or more? We find that it gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler model." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Clown:\n", "Like an ass of France to kill horns;\n", "And Brutus, and in such a one as he weariness does any strange fish! Were I adore.' When we arrives him not the time, whither argument?\n", "\n", "MARIA:\n", "Get thee on.\n", "\n", "SIR TOBY BELCH:\n", "Why, 'tis well.\n", "\n", "CLOTEN:\n", "Sayest trusts to your royal graces,\n", "I will draw his heinous and holiness\n", "Than are to breath.\n", "\n", "ISABELLA:\n", "Madam, pardon me: teach you, sirs, be it lying so, yet but the 'ever' last?\n", "\n", "EDWARD:\n", "An oath in it to bid you. You a lover dearly to our roses;\n", "For intercepted pardon him,\n", "And even now\n", "In any branches, wherefore let us go seek him:\n", "There's a good master; thyself a wise men,\n", "Let him when you are\n", "going to his entering\n", "into so quickly.\n", "Which all bosom as a bell,\n", "Remember thee who I am. Good Paulina more.' And in an hour?\n", "\n", "ORLANDO:\n", "As I wear\n", "In the eastern gate, horse!\n", "Do but he hath astonish thee apt;\n", "And this pardon me, I conjure them:\n", "To show more offering in saying them, whose beauty starves the night\n", "Did Jessica:\n", "Besides, Antony. But art \n" ] } ], "source": [ "print(generate_text(train_LM(data, order=7)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 Shakespeare" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "That cannot go but thirty miles to ride yet ere day.\n", "\n", "PUCK:\n", "Now the pleasure of the realm in farm.\n", "\n", "LORD WILLOUGHBY:\n", "And daily graced by an inkhorn mate,\n", "We and our power,\n", "Let us see:\n", "Write, 'Lord have mercies\n", "More than all his creature in her, you may\n", "say they be not take my plight shall lie\n", "His old betrothed lord.\n", "\n", "URSULA:\n", "She's limed, I warrant;\n", "speciously on him;\n", "Lose not so near:\n", "I had rather be at a breakfast to the abject rear,\n", "O'er-run and trampled on: then what is this law?\n", "\n", "First Murderer:\n", "What speech, my lord\n", "For certain, and is gone aboard a\n", "new ship to purge him of the affected.\n", "\n", "PRINCE:\n", "Give me a copy of the forlorn French!\n", "Him I forgive thee,\n", "Unnatural though the very life\n", "Of my dear friend Leonato hath\n", "invited you all. I tell him we shall have 'em\n", "Talk us to silence.\n", "\n", "ANNE:\n", "You can do better yet\n", "And show the increasing in love?\n", "\n", "LUCETTA:\n", "That they travail for, if it were not virtue, not\n", "For such proceeding by the way\n", "Should have both the parties of suspic\n" ] } ], "source": [ "print(generate_text(train_LM(data, order=10)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Probabilities\n", "\n", "Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P(c* | *h)* can be computed as follows:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def P(c, h, LM=LM): \n", " \"The probability that character c follows history h.\"\"\"\n", " return LM[h][c] / sum(LM[h].values())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.09286165508528112" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'the ')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6902654867256637" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('n', 'chai')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.30973451327433627" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('r', 'chai')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'chai')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shakespeare never wrote about \"chaise longues,\" so the probability of an `'s'` following `'chai'` is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters `'chais'` to appear, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. But in this notebook we stick to the simple unsmoothed model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Starting Text\n", "\n", "One thing you may have noticed: all the generated passages start with \"F\". Why is that? Because the training data happens to start with the line \"First Citizen:\", and so when we call `generate_text`, we start with an empty history, and the only thing that follows the empty history is the letter \"F\". We could get more variety in the generated text by breaking the training text up into separate sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of characters.\n", "\n", "We can give a starting text to `generate_text` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ROMEO:\n", "The kill not my come.\n", "\n", "FALSTAFF:\n", "No, good, sister, whereforeson, and strive merry Blance; and by the like a heart of mind;\n", "And him sough they bodied;\n", "The could thy made know are corona's and proved,\n", "To fresh are did call'd Messenge you into\n", "termity.\n", "On they found\n", "they.\n", "\n", "Firstling our such a score you a\n", "touch'd,\n", "I make you them compossessed in dead,\n", "And when the hadst be, thy lament:\n", "Your to doth due that ring, and quiet not be fetch hear: but stop. All\n", "the map o'erween attery seat most wonder of beat imprison want me hear to the general we wick outward's he gentre, doth receive doom; and forth\n", "Do you know'd cond,\n", "And root us madam, yours of with her of it is.\n", "\n", "SHALLOW:\n", "'Swound and cry a bravel your land, crystally carry with than and present they display\n", "Is no remembrass, each and monkey, thrive does wife countain,\n", "We will we marry, I\n", "shall have I never ask, the reason thes:\n", "He's good for me: thou and me good to bedded:\n", "Again, thee, death made best.\n", "\n", "MARK ANTONIO:\n", "Amen, stays to\n" ] } ], "source": [ "print(generate_text(LM, text='ROMEO:'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linux Kernel C++ Code\n", "\n", "Goldberg's point is that the simple character-level model performs about as well as the much more complex RNN model on Shakespearean text. But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 241465 759639 6206997 linux_input.txt\n" ] } ], "source": [ "! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt\n", "linux = open(\"linux_input.txt\").read()\n", "! wc linux_input.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 C++\n", "\n", "We'll start with an order-10 model, and compare that to an order-20 model. WEe'll generate a longer text, because sometimes a 1000-character text ends up being just one long comment." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel.h>\n", "#include