{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# The Effectiveness of Generative Language Models\n", "\n", "This notebook is an expansion of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level *n*-gram language models, which in turn was a response to [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recurrent neural network (RNN) language models. \n", "\n", "The term [**generative AI**](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text). \n", "\n", "In 2015 Karpathy's point was that recurrent neural networks were unreasonably effective at generating good text, even though they are at heart rather simple. Goldberg's point was that, yes, they are effective, but actually most of the magic is not in the RNNs, it is in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg and Karpathy agree that the RNN captures some aspects of C++ code that the simpler model does not. My point is to update the decade-old Python code, and make a few enhancements.\n", "\n", "\n", "## Definitions\n", "\n", "- A **generative language model** is a model that, when given an initial text, can predict what tokens come next; it can generate a continuation of a partial text. (And when the initial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*t* | *h*), the probability distribution that the next token will be *t*, given a history of previous tokens *h*. The probability distribution is estimated by looking at a training corpus of text.\n", "- A **token** is a unit of text. In a character model, \"walking\" would be 7 tokens, one for each letter, while in a word model it would be one token, and in other models it might be two tokens (\"walk\", \"ing\").\n", "- A generative model stands in contrast to a **discriminative model**, such as an email spam filter, which can discriminate between spam and non-spam, but can't be used to generate a new sample of spam.\n", "- An **n-gram model** is a generative model that estimates the probability of *n*-token sequences. For example, a 5-gram character model would be able to say that given the previous 4 characters `'chai'`, the next character might be `'r'` or `'n'` (to form `'chair'` or `'chain'`). A 5-gram model is also called a [Markov model](https://en.wikipedia.org/wiki/Markov_model) of **order** 4, because it maps from the 4 previous tokens to the next token.\n", "- A **recurrent neural network (RNN) model** is more powerful than an *n*-gram model, because it contains memory units that allow it to retain some information from more than *n* tokens in the past. See Karpathy for [details](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).\n", "- Current **large language models** such as ChatGPT, Claude, and Gemini use a more powerful model called a [transformer](https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29). Karpathy has [an introduction](https://www.youtube.com/watch?v=zjkBMFhNj_g&t=159s).\n", "\n", "## Training Data\n", "\n", "A language model learns probabilities by counting token subsequences in a corpus of text that we call the **training data**. \n", "\n", "Both Karpathy and Goldberg use the works of Shakespeare as their initial training data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Fetch the file if it does not already exist here\n", "! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4,573,338 characters and 832,301 words:\n", "\n", "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "All:\n", "Speak, speak.\n", "\n", "First Citizen:\n", "You are all resolved rather to die than to famish?\n", "\n", "All:\n", "Resolved. resolved.\n", "\n", "First Citizen:\n", "First, you...\n" ] } ], "source": [ "shakespeare: str = open(\"shakespeare_input.txt\").read()\n", "\n", "print(f'{len(shakespeare):,d} characters and {len(shakespeare.split()):,d} words:\\n\\n{shakespeare[:200]}...')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Code for *n*-Gram Language Model\n", "\n", "I'll start with some imports and simple definitions:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import random\n", "from typing import *\n", "from collections import defaultdict, Counter, deque\n", "\n", "type Token = str # Datatype to represent a token (a character or word)\n", "\n", "cat = ''.join # Function to concatenate strings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I define the class `LanguageModel`:\n", "- A `LanguageModel` is a subclass of `defaultdict` that maps a history of length *order* tokens to a `Counter` of next tokens.\n", " - The tokens in the history are concatenated together into one string to form the keys of the LanguageModel.\n", "- The `__init__` method sets the order of the model and optionally accepts tokens of training data. \n", "- The `train` method builds up the `{history: Counter(next_token)}` mapping from the training data.\n", "- The `generate` method random samples `length` tokens from the mapping. \n", "- The `gen` method is a convenience function to call `generate` and print the results." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class LanguageModel(defaultdict): \n", " \"\"\"A mapping of {'history': Counter(next_token)}.\"\"\"\n", " \n", " def __init__(self, order: int, tokens=()):\n", " \"\"\"Set the order of the model, and optionally initialize it with some tokens.\"\"\"\n", " self.order = order\n", " self.default_factory = Counter # Every history entry has a Counter of tokens\n", " self.train(tokens)\n", "\n", " def train(self, tokens):\n", " \"\"\"Go through the tokens, building the {'history': Counter(next_tokens)} mapping.\"\"\"\n", " history = deque(maxlen=self.order) # History keeps at most `order` tokens\n", " for token in tokens:\n", " self[cat(history)][token] += 1\n", " history.append(token)\n", " return self\n", "\n", " def generate(self, length=1000, start=()) -> List[Token]:\n", " \"\"\"Generate a random text of `length` tokens, from a sequence of `start` tokens.\n", " At each step, consider the previous `self.order` tokens and randomly sample the next token.\"\"\"\n", " tokens = list(start)\n", " while len(tokens) < length:\n", " history = cat(tokens[-self.order:])\n", " tokens.append(random_token(self[history]))\n", " return tokens\n", "\n", " def gen(self, length=1000, start=()) -> None:\n", " \"\"\"Call generate and print the resulting tokens.\"\"\"\n", " print(cat(self.generate(length, start)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll need a function to randomly select a next token from one of the model's Counters:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def random_token(counter: Counter) -> Token:\n", " \"\"\"Randomly sample a token from a Counter, with probability proportional to each token's count.\"\"\"\n", " return random.choices(list(counter), weights=list(counter.values()), k=1)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train a character-level language model of order 4 on the Shakespeare data. We'll call the language model `LM`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "LM = LanguageModel(4, shakespeare)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some examples of what's in the model:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'n': 78, 'r': 35})" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"chai\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `\"chai\"` is followed by either `'n'` or `'r'`. In contrast, almost any letter can follow `\"the \"`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Counter({'s': 2058,\n", " 'w': 1759,\n", " 'c': 1561,\n", " 'm': 1392,\n", " 'p': 1360,\n", " 'f': 1258,\n", " 'b': 1217,\n", " 'd': 1170,\n", " 't': 1109,\n", " 'g': 1037,\n", " 'h': 1029,\n", " 'l': 1006,\n", " 'r': 804,\n", " 'k': 713,\n", " 'e': 704,\n", " 'n': 616,\n", " 'a': 554,\n", " 'o': 530,\n", " 'v': 388,\n", " 'i': 298,\n", " 'q': 171,\n", " 'D': 146,\n", " 'y': 120,\n", " 'u': 105,\n", " 'L': 105,\n", " 'F': 103,\n", " 'T': 102,\n", " 'j': 99,\n", " 'C': 81,\n", " 'E': 77,\n", " 'G': 75,\n", " 'M': 54,\n", " 'P': 54,\n", " 'R': 45,\n", " 'S': 40,\n", " 'B': 31,\n", " 'J': 30,\n", " 'A': 29,\n", " 'K': 22,\n", " 'H': 20,\n", " 'V': 18,\n", " 'N': 18,\n", " 'I': 14,\n", " 'W': 14,\n", " \"'\": 10,\n", " 'Q': 7,\n", " 'z': 6,\n", " 'O': 3})" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"the \"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Shakespeare\n", "\n", "We cann generate a random text from the order 4 model:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Fishes come, trouble annot have the here reign's every madness, it is repart on hath than of that were is time little be faint of that came of a monstands his on and the wonderstandiscoverfluous nest ask again! thou should writ than we, I'll his good Mercules; your\n", "sonneur, my good is no make me, yet were here is very us;\n", "And, nobler the more at me not his preport,\n", "Such moved on:\n", "But not by my duke\n", "To business: pleasure no moral bed.\n", "Harry noble an end\n", "Do more that were I have do behind,\n", "I go to judgment, and he as he,'that I have come.\n", "Julius, and Penthorough at lame you helps as this for, 'tis not\n", "right i' the earth Boling it fing, sir.\n", "\n", "HAMLET:\n", "All:\n", "Their me,\n", "And yet you speak; I strong;\n", "But were subject his his pride up a throw\n", "One way;\n", "Still ye well me an enemy,\n", "I will sick, cause,\n", "And cut of they saithful necess, if God!\n", "For than you how approves compound in Gloucester tribute\n", "A grave?\n", "\n", "BURGUNDY:\n", "Thy good with lessenger with done is\n", "own deep; ha!\n", "Will meat great lady troubl\n" ] } ], "source": [ "LM.gen()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Order 4 captures the structure of plays, mentions some characters, and generates mostly English words. But the words don't always go together to form grammatical sentences, and there is certainly no coherence or plot. \n", "\n", "## Generating Order 7 Shakespeare\n", "\n", "What if we increase the model to order 7? Or 10? The output gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler *n*-gram model." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Neighbours shall we in our enemies.\n", "\n", "First Lord:\n", "Behind this crown'd, Laertes,\n", "Will your sake, speak of African;\n", "Where is three.\n", "\n", "BIRON:\n", "Why, I prithee, take me mad:\n", "Hark, how our company.\n", "\n", "MISTRESS FORD:\n", "\n", "All:\n", "Our dukedoms.\n", "The break with\n", "you.\n", "\n", "LEONTES:\n", "Say you this true at first inducement.\n", "\n", "IAGO:\n", "There is set for't! And now my deeds be grieves her question.\n", "\n", "LEONATO:\n", "Brother with thy state, and here together with usurping her still\n", "And looks, bid her contemn'd revolt: this night;\n", "For, in his tears not found the players cannot, take this.\n", "\n", "CADE:\n", "\n", "DICK:\n", "My heart that in this, till thou thus, sir. Fare you were so;\n", "To disprove to hear from the man.\n", "\n", "ARIEL:\n", "I pray you, the gates;\n", "And makes a sun and disjoin'd penitent head of thine;\n", "With fear; my master, sir, no; the presupposed\n", "Upon the fiend's wrong\n", "And fertile land of the quern\n", "And ruminate tender'd herself: he shall stay at home;\n", "And chides wrong.\n", "I will we bite our castle:\n", "She died,\n", "That were out by the painful, and \n", "CPU times: user 2.48 s, sys: 46.8 ms, total: 2.53 s\n", "Wall time: 2.53 s\n" ] } ], "source": [ "%time LanguageModel(7, shakespeare).gen()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 Shakespeare" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Woe to the hearts of men\n", "The thing I am forbid;\n", "Or study where I had and have it; and much more ease; for so I have.\n", "\n", "Second Lord:\n", "He had no other death concludes but what thou wilt.\n", "How now, Simple! where have you taste of thy abhorr'd ingredients of our loves again,\n", "Alike betwitched by the Frenchman his companion of the gesture\n", "One might interpreter, you might pardon him, sweet father, do you here? things that I bought mine own.\n", "\n", "KING RICHARD II:\n", "A lunatic lean-witted fools\n", "The way twice o'er, I'll weep. O fool, I shall be publish'd, and\n", "Her coronation-day,\n", "When Bolingbroke ascends my throne of France upon his sudden seem,\n", "I would be the first house, our story\n", "What we have o'erheard\n", "Your royal grace!\n", "\n", "DUKE VINCENTIO:\n", "Have after. To what end he gave me fresh garments must not then respective lenity,\n", "And all-to topple: pure surprised:\n", "Guard her till this osier cage of ours\n", "Were nice and continents! what mutiny!\n", "What raging of their emperor\n", "And to conclude to hate me.\n", "\n", "KI\n" ] } ], "source": [ "LanguageModel(10, shakespeare).gen()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probabilities and Smoothing\n", "\n", "Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P*(*token* | *history*) could be computed as follows:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def P(token, history, LM: LanguageModel): \n", " \"The probability that token follows history.\"\"\"\n", " return LM[history][token] / sum(LM[history].values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the probaility that the letter \"n\" follows the four letters \"chai\"?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6902654867256637" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('n', 'chai', LM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about the letter \"s\"?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'chai', LM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shakespeare never wrote about \"chaise longues,\" or \"chai tea\" so the probability of an \"s\" or space following \"chai\" is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters \"chais\" or \"chai \" to appear anywhere in a text, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. A simple type of smoothing is \"add-one smoothing\"; it assumes that if we have counted *N* tokens following a given history, then the probability of an unseen token is 1 / (*N* + 1), and the probabilities for the previously-seen tokens are reduced accordingly (dividing by *N* + 1 instead of *N*):" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def P(t, h, LM: LanguageModel): \n", " \"The probability that token t follows history h, using add-one smoothing.\"\"\"\n", " N = sum(LM[h].values())\n", " return max(1, LM[h][t]) / (N + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That gives us:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.008771929824561403" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'chai', LM)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6842105263157895" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('n', 'chai', LM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Starting Text\n", "\n", "One thing you may have noticed: all the generated passages start the word \"First\". Why is that? Because the training data happens to start with the line \"First Citizen:\", and so when we call `generate_tokens`, we start with an empty history, and the only thing that follows the empty history in the training data is the letter \"F\", the only thing that follows \"F\" is \"i\", and so on, until we get to a point where there are multiple choices. We could get more variety in the start of the generated text by breaking the training text up into multiple sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of tokens/characters.\n", "\n", "We can give a starting text to `generate_tokens` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference. For example, the following won't make the model generate a story about Romeo:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ROMEO:\n", "Is the rash they news, and\n", "the night too her never, tie here queen,\n", "And, in lady store known\n", "in practises;\n", "Whose thy master thought;\n", "Then for my rose, for nothese him that honour body and I now nights have to make with noted.\n", "\n", "ANTIPHOLUS OF EPHEN SCROOP:\n", "Gloucested, and dote: marry, were cours, I deputes our this, and hurt was. Yet, could probations, I hear me, Rosal to chard,\n", "Which thy error offer'd you parce Will bears,\n", "The nature and how prever my mont: after come to bear the that worthly to Cyprus.--Help, ho! how\n", "did:\n", "If guard this daughter,\n", "Were up ruled for sister, be hour, than with mouth, my patient you my hence Helent denuncle, la, had for thank the farth\n", "The gate:\n", "My prither business. He behold, in would not ruins. What, assage of fles, that were this,\n", "How it is not thence; for the matter. Thaisanio;\n", "Why, where Christial of a bay,\n", "But of steeds the not me save you owest be dog-day,\n", "His slain If, a black, you die against the compey, of thine own the riots do lustry, may\n" ] } ], "source": [ "LM.gen(start='ROMEO')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linux Kernel C++ Code\n", "\n", "Goldberg's point is that the simple character-level n-gram model performs about as well as the more complex RNN model on Shakespearean text. \n", "\n", "But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 241465 759639 6206997 linux_input.txt\n" ] } ], "source": [ "! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt\n", "! wc linux_input.txt" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "linux = open(\"linux_input.txt\").read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 C++\n", "\n", "We'll start with an order-10 character model, and compare that to an order-20 model. We'll generate a longer text, because sometimes 1000 characters ends up being just one long comment." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel.h>\n", "#include