{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# The Effectiveness of Generative Language Models\n", "\n", "This notebook is an expansion of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level *n*-gram language models, which in turn was a response to [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recurrent neural network (RNN) language models. \n", "\n", "The term [**generative AI**](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text). \n", "\n", "In 2015 Karpathy's point was that recurrent neural networks were unreasonably effective at generating good text, even though they are at heart rather simple. Goldberg's point was that, yes, they are effective, but actually most of the magic is not in the RNNs, it is in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg and Karpathy agree that the RNN captures some aspects of C++ code that the simpler model does not.\n", "\n", "\n", "## Definitions\n", "\n", "- A **generative language model** is a model that, when given an initial text, can predict what tokens come next; it can generate a continuation of a partial text. (And when the initial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*t* | *h*), the probability distribution that the next token will be *t*, given a history of previous tokens *h*. The probability distribution is estimated by looking at a training corpus of text.\n", "\n", "- A **token** is a unit of text. It can be a single character (as covered by Karpathy and Goldberg) or more generally it can be a word or a part of a word (as allowed in my implementation).\n", "\n", "- A generative model stands in contrast to a **discriminative model**, such as an email spam filter, which can discriminate between spam and non-spam, but can't be used to generate a new sample of spam.\n", "\n", "\n", "- An ***n*-gram model** is a generative model that estimates the probability of *n*-token sequences. For example, a 5-gram character model would be able to say that given the previous 4 characters `'chai'`, the next character might be `'r'` or `'n'` (to form `'chair'` or `'chain'`). A 5-gram model is also called a model of **order** 4, because it maps from the 4 previous tokens to the next token.\n", "\n", "- A **recurrent neural network (RNN) model** is more powerful than an *n*-gram model, because it contains memory units that allow it to retain information from more than *n* tokens. See Karpathy for [details](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).\n", "\n", "- Current **large language models** such as Chat-GPT, Claude, Llama, and Gemini use an even more powerful model called a [transformer](https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29). Karpathy has [an introduction](https://www.youtube.com/watch?v=zjkBMFhNj_g&t=159s).\n", "\n", "## Training Data\n", "\n", "A language model learns probabilities by observing a corpus of text that we call the **training data**. \n", "\n", "Both Karpathy and Goldberg use the works of Shakespeare (about 800,000 words) as their initial training data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 167204 832301 4573338 shakespeare_input.txt\n" ] } ], "source": [ "! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt\n", "! wc shakespeare_input.txt \n", "# Print the number of lines, words, and characters" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "All:\n", "Speak, speak.\n", "\n", "First Citizen:\n", "You are all resolved rather to die than to famish?\n" ] } ], "source": [ "! head -8 shakespeare_input.txt \n", "# First 8 lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Code for n-Gram Model\n", "\n", "I do some imports and then define two data types:\n", "- A `Token` is an individual unit of text, a string of one or more characters.\n", "- A `LanguageModel` is a subclass of `defaultdict` that maps a history of length `order` tokens to a `Counter` of the number of times each token appears immediately following the history in the training data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import random\n", "from typing import *\n", "from collections import defaultdict, Counter\n", "\n", "Token = str # Datatype to represent a token\n", "\n", "class LanguageModel(defaultdict): \n", " \"\"\"A mapping of {'history': Counter(next_token)}.\"\"\"\n", " def __init__(self, order):\n", " self.order = order\n", " super().__init__(Counter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I define two main functions that do essentially all the work:\n", "\n", "- `train_LM` takes a sequence of tokens (the training data) and an integer `order`, and builds a language model, formed by counting the times each token *t* occurs and storing that under the entry for the history *h* of `order` tokens that precede *t*. \n", "- `generate_tokens` generates a random sequence of tokens. At each step it looks at the history of previously generated tokens and chooses a new token at random from the language model's counter for that history." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def train_LM(tokens, order: int) -> LanguageModel:\n", " \"\"\"Create and train a language model of given order on the given tokens.\"\"\"\n", " LM = LanguageModel(order)\n", " history = []\n", " for token in tokens:\n", " LM[cat(history)][token] += 1\n", " history = (history + [token])[-order:] \n", " return LM\n", "\n", "def generate_tokens(LM: LanguageModel, length=1000, start=()) -> List[Token]:\n", " \"\"\"Generate a random text of `length` tokens, with an optional start, from `LM`.\"\"\"\n", " tokens = list(start)\n", " while len(tokens) < length:\n", " history = cat(tokens[-LM.order:])\n", " tokens.append(random_sample(LM[history]))\n", " return tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are three auxiliary functions:\n", "- `gen` is a convenience function to call `generate_tokens`, concatenate the resulting tokens, and print them.\n", "- `random_sample` randomly chooses a single token from a Counter, with probability in proportion to its count.\n", "- `cat` is a utility function to concatenate strings (tokens) into one big string." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def gen(LM: LanguageModel, length=1000, start=()) -> None:\n", " \"\"\"Call generate_tokens and print the resulting tokens.\"\"\"\n", " print(cat(generate_tokens(LM, length, start)))\n", " \n", "def random_sample(counter: Counter) -> Token:\n", " \"\"\"Randomly sample a token from the counter, proportional to each token's count.\"\"\"\n", " return random.choices(list(counter), weights=list(counter.values()), k=1)[0]\n", "\n", "cat = ''.join # Function to join strings together" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train a character-level language model of order 4 on the Shakespeare data. We'll call the model `LM`. (Note that saying `tokens=data` means that the sequence of tokens is equal to the sequence of characters in `data`; in other words each character is a token.)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "data = open(\"shakespeare_input.txt\").read()\n", "\n", "LM = train_LM(tokens=data, order=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some examples of what's in the model, for various 4-character histories:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'n': 78, 'r': 35})" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"chai\"]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'n'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_sample(LM[\"chai\"])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Counter({'p': 1360,\n", " 's': 2058,\n", " 'l': 1006,\n", " 'o': 530,\n", " 'g': 1037,\n", " 'c': 1561,\n", " 'a': 554,\n", " 'C': 81,\n", " 'r': 804,\n", " 'h': 1029,\n", " 'R': 45,\n", " 'd': 1170,\n", " 'w': 1759,\n", " 'b': 1217,\n", " 'm': 1392,\n", " 'v': 388,\n", " 't': 1109,\n", " 'f': 1258,\n", " 'i': 298,\n", " 'n': 616,\n", " 'V': 18,\n", " 'e': 704,\n", " 'u': 105,\n", " 'L': 105,\n", " 'y': 120,\n", " 'A': 29,\n", " 'H': 20,\n", " 'k': 713,\n", " 'M': 54,\n", " 'T': 102,\n", " 'j': 99,\n", " 'q': 171,\n", " 'K': 22,\n", " 'D': 146,\n", " 'P': 54,\n", " 'S': 40,\n", " 'G': 75,\n", " 'I': 14,\n", " 'B': 31,\n", " 'W': 14,\n", " 'E': 77,\n", " 'F': 103,\n", " 'O': 3,\n", " \"'\": 10,\n", " 'z': 6,\n", " 'J': 30,\n", " 'N': 18,\n", " 'Q': 7})" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"the \"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `\"chai\"` is followed by either `'n'` or `'r'`, and almost any letter can follow `\"the \"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Shakespeare\n", "\n", "We cann generate a random text from the order 4 character model:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First.\n", "\n", "ROSALIND:\n", "Warwick, but all house, stillow far into my ghost\n", "prescience,\n", "And will make the emperous in you gods!--One: that\n", "Is ther,\n", "that we stranglish, where assure, sir; a be king I will be accusers kind.\n", "\n", "SIR HUGH EVANS:\n", "Four grace\n", "By so ye\n", "As the admit trustice; but my wantony of thou did; and the shape Hectors:\n", "Borest long.\n", "Thou Marchs of him: his honour?\n", "I this? hath is feed.\n", "\n", "PANDARUS:\n", "You anon her for you.\n", "\n", "OCTAVIUS CAESAR:\n", "You have saw to be a Jew, Orland would now welcomes no little prison! love the rest.\n", "\n", "MARTIUS CAESAR:\n", "Well, and the patience is adverthrowing me, I shall thee go thee whom I than to such and thy absolved\n", "it, the ques\n", "Weep a slighty sound,\n", "Lychorse.\n", "\n", "DESDEMONA:\n", "This let us,\n", "And so\n", "On you, follow and epitation.\n", "\n", "FALSTAFF:\n", "Tell down.\n", "These will luck'd out thee will teach of my apparis it, take of honest\n", "Then for adors; the customary shot of that's place upstarved, ruin throne.\n", "\n", "LAUNCE:\n", "None lowly death,\n", "Hath Parising upon all, bles, the live.\n", "\n", "VIOLANUS:\n", "\n" ] } ], "source": [ "gen(LM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Order 4 captures the structure of plays, mentions some characters, and generates mostly English words. But the words don't always go together to form grammatical sentences, and there is certainly no coherence or plot. \n", "\n", "## Generating Order 7 Shakespeare\n", "\n", "What if we increase it to order 7? Or 10? We find that the output gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler *n*-gram model." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Which priest, Camillo,\n", "I conjurations.\n", "\n", "GOWER:\n", "Here's Wart; thou at supper in't; and heart: but for that she shunn'd,\n", "Nature of that are so many countryman, there's\n", "another he folly could have spoke with me?\n", "\n", "VINCENTIO:\n", "You cannot cog, I can scarce any joy\n", "Did ever saw a smith stand by the earth\n", "I' the pedlar;\n", "Money's a merited. These gloves are not palter it.\n", "\n", "DUKE OF YORK:\n", "I will go or tarrying you.\n", "\n", "PRINCE HENRY:\n", "Well, young son of Herne the shore the top,\n", "And sigh a note for a king in the lean past your ambitious is the question: he swords:\n", "Thanks, Rosencrantz and gentlemen: they are comes here.\n", "Then, let grass grow dim. Fare you married Juliet?\n", "\n", "LUCIANA:\n", "Say, I would pardon. So it must be annoyance that thou shall fair praise\n", "Have I something the lowest plant;\n", "Whose in our army and unyoke.\n", "\n", "BEDFORD:\n", "'Twas very much content,\n", "Clarence, be assure you will.\n", "\n", "COMINIUS:\n", "One word more: I am out--\n", "That canst thou speak of Naples; 'twixt\n", "The flight, never\n", "Have writ to his ca\n" ] } ], "source": [ "gen(train_LM(data, order=7))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 Shakespeare" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Why answer not?\n", "\n", "ADRIANA:\n", "By thee; and put a tongue,\n", "That in civility and patience waiteth on true sorrow.\n", "And see where comes another man, if I live.\n", "\n", "TITUS ANDRONICUS:\n", "Titus, prepare yourself in Egypt?\n", "\n", "Soothsayer:\n", "Beware the foul fiend! five\n", "fiends have brought to be commander of the night gone by.\n", "\n", "OCTAVIUS CAESAR:\n", "O Antony, beg not your offence:\n", "If you accept of grace: and from thy bed, fresh lily,\n", "And when she was gone, but his judgment pattern of mine eyes;\n", "Examine other be, some Florentine,\n", "Deliver'd, both in one key,\n", "As if our hands, our lives--\n", "\n", "MACBETH:\n", "I'll call them all encircle him about the streets?\n", "\n", "First Gentleman:\n", "That is intended drift\n", "Than, by concealing it, heap on your way, which\n", "is the way to Chester; and I would I wear them o' the city.\n", "\n", "Citizens:\n", "No, no, no; not so; I did not think,--\n", "My wife is dead.\n", "Your majesty is\n", "returned with some few bands of chosen shot I had,\n", "That man may question? You seem to have said, my noble and valiant;\n", "For thou has\n" ] } ], "source": [ "gen(train_LM(data, order=10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Probabilities and Smoothing\n", "\n", "Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P*(*t* | *h*) can be computed as follows:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def P(t, h, LM=LM): \n", " \"The probability that token t follows history h.\"\"\"\n", " return LM[h][t] / sum(LM[h].values())" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.09286165508528112" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'the ')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6902654867256637" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('n', 'chai')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.30973451327433627" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('r', 'chai')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'chai')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P(' ', 'chai')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shakespeare never wrote about \"chaise longues,\" or \"chai tea\" so the probability of an `'s'` or `' '` following `'chai'` is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters `'chais'` or `'chai '` to appear in a gebnerated text, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. In this notebook we stick to the basic unsmoothed model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Starting Text\n", "\n", "One thing you may have noticed: all the generated passages start the same. Why is that? Because the training data happens to start with the line \"First Citizen:\", and so when we call `generate_tokens`, we start with an empty history, and the only thing that follows the empty history in the training data is the letter \"F\", the only thing that follows \"F\" is \"i\", and so on, until we get to a point where there are multiple choices. We could get more variety in the start of the generated text by breaking the training text up into multiple sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of tokens/characters.\n", "\n", "We can give a starting text to `generate_tokens` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference. For example, the following won't make the model generate a story about Romeo:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ROMEO:\n", "Ay, strong;\n", "A little. Prince of such matten a virtue's states:\n", "Now country's quent again,\n", "A most unacceptre her no leave-thou true shot, and, to\n", "things of mine hath done alehousand pride\n", "For be, whithee thus with of damm'st fly and came which he rank,\n", "Put not wages woman! Harform\n", "So long agon's side,\n", "Singer's me her my slinged! He wit:' quoth Dorse! art than manner\n", "straine.\n", "\n", "AGAMEMNON:\n", "I would says.\n", "\n", "BEATRICE:\n", "Between my secreason ragined our uncrowned justing in thither naturer?\n", "\n", "GLOUCESTER:\n", "The king.\n", "\n", "DOGBERRY:\n", "Do now nothink you\n", "The tired.\n", "\n", "HORATIAN:\n", "It stray\n", "To see, peers, eo me, while; this days this man carbona-mi's, and for\n", "in this late ther's loath,\n", "Diminutes, shepher, sir.\n", "\n", "BARDOLPH:\n", "So.\n", "\n", "PETRUCHIO:\n", "'Tis Angels upon me?\n", "Hath no fare on,\n", "what's the\n", "moods.\n", "\n", "KING RICHARLES:\n", "Thou odourse first Servantas, given in the life' thee, he same world I have spirit,\n", "Or lord Phoebus' Corne.\n", "\n", "KENT:\n", "Go you shafts that satisfied; leaven blession what charge of a this: for.\n", "\n", "ACHIMO:\n", "'Zou\n" ] } ], "source": [ "gen(LM, start='ROMEO')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linux Kernel C++ Code\n", "\n", "Goldberg's point is that the simple character-level n-gram model performs about as well as the more complex RNN model on Shakespearean text. \n", "\n", "But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 241465 759639 6206997 linux_input.txt\n" ] } ], "source": [ "! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt\n", "! wc linux_input.txt" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "linux = open(\"linux_input.txt\").read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 C++\n", "\n", "We'll start with an order-10 character model, and compare that to an order-20 model. We'll generate a longer text, because sometimes 1000 characters ends up being just one long comment." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel.h>\n", "#include