{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
Peter Norvig
2019, revised 2024
Based on Yoav Goldberg's 2015 notebook
\n", "\n", "# The Effectiveness of Generative Language Models\n", "\n", "This notebook is an expansion of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level *n*-gram language models, which in turn was a response to [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recurrent neural network (RNN) language models. \n", "\n", "The term [**generative AI**](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text). \n", "\n", "In 2015 Karpathy's point was that recurrent neural networks were unreasonably effective at generating good text, even though they are at heart rather simple. Goldberg's point was that, yes, they are effective, but actually most of the magic is not in the RNNs, it is in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg and Karpathy agree that the RNN captures some aspects of C++ code that the simpler model does not. My point is to update the decade-old Python code, and make a few enhancements.\n", "\n", "\n", "## Definitions\n", "\n", "- A **generative language model** is a model that, when given an initial text, can predict what tokens come next; it can generate a continuation of a partial text. (And when the initial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*t* | *h*), the probability distribution that the next token will be *t*, given a history of previous tokens *h*. The probability distribution is estimated by looking at a training corpus of text.\n", "- A **token** is a unit of text. In a character model, \"walking\" would be 7 tokens, one for each letter, while in a word model it would be one token, and in other models it might be two tokens (\"walk\", \"ing\").\n", "- A generative model stands in contrast to a **discriminative model**, such as an email spam filter, which can discriminate between spam and non-spam, but can't be used to generate a new sample of spam.\n", "- An **n-gram model** is a generative model that estimates the probability of *n*-token sequences. For example, a 5-gram character model would be able to say that given the previous 4 characters `'chai'`, the next character might be `'r'` or `'n'` (to form `'chair'` or `'chain'`). A 5-gram model is also called a [Markov model](https://en.wikipedia.org/wiki/Markov_model) of **order** 4, because it maps from the 4 previous tokens to the next token.\n", "- A **recurrent neural network (RNN) model** is more powerful than an *n*-gram model, because it contains memory units that allow it to retain some information from more than *n* tokens in the past. See Karpathy for [details](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).\n", "- Current **large language models** such as ChatGPT, Claude, and Gemini use a more powerful model called a [transformer](https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29). Karpathy has [an introduction](https://www.youtube.com/watch?v=zjkBMFhNj_g&t=159s).\n", "\n", "## Training Data\n", "\n", "A language model learns probabilities by counting token subsequences in a corpus of text that we call the **training data**. \n", "\n", "Both Karpathy and Goldberg use the works of Shakespeare as their initial training data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Fetch the file if it does not already exist here\n", "! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4,573,338 characters and 832,301 words:\n", "\n", "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "All:\n", "Speak, speak.\n", "\n", "First Citizen:\n", "You are all resolved rather to die than to famish?\n", "\n", "All:\n", "Resolved. resolved.\n", "\n", "First Citizen:\n", "First, you...\n" ] } ], "source": [ "shakespeare: str = open(\"shakespeare_input.txt\").read()\n", "\n", "print(f'{len(shakespeare):,d} characters and {len(shakespeare.split()):,d} words:\\n\\n{shakespeare[:200]}...')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Code for *n*-Gram Language Model\n", "\n", "I'll start with some imports and simple definitions:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import random\n", "from typing import *\n", "from collections import defaultdict, Counter, deque\n", "\n", "type Token = str # Datatype to represent a token (a character or word)\n", "\n", "cat = ''.join # Function to concatenate strings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I define the class `LanguageModel`:\n", "- A `LanguageModel` is a subclass of `defaultdict` that maps a history of length *order* tokens to a `Counter` of next tokens.\n", " - The tokens in the history are concatenated together into one string to form the keys of the LanguageModel.\n", "- The `__init__` method sets the order of the model and optionally accepts tokens of training data. \n", "- The `train` method builds up the `{history: Counter(next_token)}` mapping from the training data.\n", "- The `generate` method random samples `length` tokens from the mapping. \n", "- The `gen` method is a convenience function to call `generate` and print the results." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class LanguageModel(defaultdict): \n", " \"\"\"A mapping of {'history': Counter(next_token)}.\"\"\"\n", " \n", " def __init__(self, order: int, tokens=()):\n", " \"\"\"Set the order of the model, and optionally initialize it with some tokens.\"\"\"\n", " self.order = order\n", " self.default_factory = Counter # Every history entry has a Counter of tokens\n", " self.train(tokens)\n", "\n", " def train(self, tokens):\n", " \"\"\"Go through the tokens, building the {'history': Counter(next_tokens)} mapping.\"\"\"\n", " history = deque(maxlen=self.order) # History keeps at most `order` tokens\n", " for token in tokens:\n", " self[cat(history)][token] += 1\n", " history.append(token)\n", " return self\n", "\n", " def generate(self, length=1000, start=()) -> List[Token]:\n", " \"\"\"Generate a random text of `length` tokens, from a sequence of `start` tokens.\n", " At each step, consider the previous `self.order` tokens and randomly sample the next token.\"\"\"\n", " tokens = list(start)\n", " while len(tokens) < length:\n", " history = cat(tokens[-self.order:])\n", " tokens.append(random_token(self[history]))\n", " return tokens\n", "\n", " def gen(self, length=1000, start=()) -> None:\n", " \"\"\"Call generate and print the resulting tokens.\"\"\"\n", " print(cat(self.generate(length, start)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll need a function to randomly select a next token from one of the model's Counters:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def random_token(counter: Counter) -> Token:\n", " \"\"\"Randomly sample a token from a Counter, with probability proportional to each token's count.\"\"\"\n", " return random.choices(list(counter), weights=list(counter.values()), k=1)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train a character-level language model of order 4 on the Shakespeare data. We'll call the language model `LM`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "LM = LanguageModel(4, shakespeare)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some examples of what's in the model:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'n': 78, 'r': 35})" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"chai\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `\"chai\"` is followed by either `'n'` or `'r'`. In contrast, almost any letter can follow `\"the \"`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Counter({'s': 2058,\n", " 'w': 1759,\n", " 'c': 1561,\n", " 'm': 1392,\n", " 'p': 1360,\n", " 'f': 1258,\n", " 'b': 1217,\n", " 'd': 1170,\n", " 't': 1109,\n", " 'g': 1037,\n", " 'h': 1029,\n", " 'l': 1006,\n", " 'r': 804,\n", " 'k': 713,\n", " 'e': 704,\n", " 'n': 616,\n", " 'a': 554,\n", " 'o': 530,\n", " 'v': 388,\n", " 'i': 298,\n", " 'q': 171,\n", " 'D': 146,\n", " 'y': 120,\n", " 'u': 105,\n", " 'L': 105,\n", " 'F': 103,\n", " 'T': 102,\n", " 'j': 99,\n", " 'C': 81,\n", " 'E': 77,\n", " 'G': 75,\n", " 'M': 54,\n", " 'P': 54,\n", " 'R': 45,\n", " 'S': 40,\n", " 'B': 31,\n", " 'J': 30,\n", " 'A': 29,\n", " 'K': 22,\n", " 'H': 20,\n", " 'V': 18,\n", " 'N': 18,\n", " 'I': 14,\n", " 'W': 14,\n", " \"'\": 10,\n", " 'Q': 7,\n", " 'z': 6,\n", " 'O': 3})" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"the \"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Shakespeare\n", "\n", "We cann generate a random text from the order 4 model:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Fishes come, trouble annot have the here reign's every madness, it is repart on hath than of that were is time little be faint of that came of a monstands his on and the wonderstandiscoverfluous nest ask again! thou should writ than we, I'll his good Mercules; your\n", "sonneur, my good is no make me, yet were here is very us;\n", "And, nobler the more at me not his preport,\n", "Such moved on:\n", "But not by my duke\n", "To business: pleasure no moral bed.\n", "Harry noble an end\n", "Do more that were I have do behind,\n", "I go to judgment, and he as he,'that I have come.\n", "Julius, and Penthorough at lame you helps as this for, 'tis not\n", "right i' the earth Boling it fing, sir.\n", "\n", "HAMLET:\n", "All:\n", "Their me,\n", "And yet you speak; I strong;\n", "But were subject his his pride up a throw\n", "One way;\n", "Still ye well me an enemy,\n", "I will sick, cause,\n", "And cut of they saithful necess, if God!\n", "For than you how approves compound in Gloucester tribute\n", "A grave?\n", "\n", "BURGUNDY:\n", "Thy good with lessenger with done is\n", "own deep; ha!\n", "Will meat great lady troubl\n" ] } ], "source": [ "LM.gen()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Order 4 captures the structure of plays, mentions some characters, and generates mostly English words. But the words don't always go together to form grammatical sentences, and there is certainly no coherence or plot. \n", "\n", "## Generating Order 7 Shakespeare\n", "\n", "What if we increase the model to order 7? Or 10? The output gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler *n*-gram model." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Neighbours shall we in our enemies.\n", "\n", "First Lord:\n", "Behind this crown'd, Laertes,\n", "Will your sake, speak of African;\n", "Where is three.\n", "\n", "BIRON:\n", "Why, I prithee, take me mad:\n", "Hark, how our company.\n", "\n", "MISTRESS FORD:\n", "\n", "All:\n", "Our dukedoms.\n", "The break with\n", "you.\n", "\n", "LEONTES:\n", "Say you this true at first inducement.\n", "\n", "IAGO:\n", "There is set for't! And now my deeds be grieves her question.\n", "\n", "LEONATO:\n", "Brother with thy state, and here together with usurping her still\n", "And looks, bid her contemn'd revolt: this night;\n", "For, in his tears not found the players cannot, take this.\n", "\n", "CADE:\n", "\n", "DICK:\n", "My heart that in this, till thou thus, sir. Fare you were so;\n", "To disprove to hear from the man.\n", "\n", "ARIEL:\n", "I pray you, the gates;\n", "And makes a sun and disjoin'd penitent head of thine;\n", "With fear; my master, sir, no; the presupposed\n", "Upon the fiend's wrong\n", "And fertile land of the quern\n", "And ruminate tender'd herself: he shall stay at home;\n", "And chides wrong.\n", "I will we bite our castle:\n", "She died,\n", "That were out by the painful, and \n", "CPU times: user 2.48 s, sys: 46.8 ms, total: 2.53 s\n", "Wall time: 2.53 s\n" ] } ], "source": [ "%time LanguageModel(7, shakespeare).gen()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 Shakespeare" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Woe to the hearts of men\n", "The thing I am forbid;\n", "Or study where I had and have it; and much more ease; for so I have.\n", "\n", "Second Lord:\n", "He had no other death concludes but what thou wilt.\n", "How now, Simple! where have you taste of thy abhorr'd ingredients of our loves again,\n", "Alike betwitched by the Frenchman his companion of the gesture\n", "One might interpreter, you might pardon him, sweet father, do you here? things that I bought mine own.\n", "\n", "KING RICHARD II:\n", "A lunatic lean-witted fools\n", "The way twice o'er, I'll weep. O fool, I shall be publish'd, and\n", "Her coronation-day,\n", "When Bolingbroke ascends my throne of France upon his sudden seem,\n", "I would be the first house, our story\n", "What we have o'erheard\n", "Your royal grace!\n", "\n", "DUKE VINCENTIO:\n", "Have after. To what end he gave me fresh garments must not then respective lenity,\n", "And all-to topple: pure surprised:\n", "Guard her till this osier cage of ours\n", "Were nice and continents! what mutiny!\n", "What raging of their emperor\n", "And to conclude to hate me.\n", "\n", "KI\n" ] } ], "source": [ "LanguageModel(10, shakespeare).gen()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probabilities and Smoothing\n", "\n", "Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P*(*token* | *history*) could be computed as follows:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def P(token, history, LM: LanguageModel): \n", " \"The probability that token follows history.\"\"\"\n", " return LM[history][token] / sum(LM[history].values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the probaility that the letter \"n\" follows the four letters \"chai\"?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6902654867256637" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('n', 'chai', LM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about the letter \"s\"?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'chai', LM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shakespeare never wrote about \"chaise longues,\" or \"chai tea\" so the probability of an \"s\" or space following \"chai\" is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters \"chais\" or \"chai \" to appear anywhere in a text, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. A simple type of smoothing is \"add-one smoothing\"; it assumes that if we have counted *N* tokens following a given history, then the probability of an unseen token is 1 / (*N* + 1), and the probabilities for the previously-seen tokens are reduced accordingly (dividing by *N* + 1 instead of *N*):" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def P(t, h, LM: LanguageModel): \n", " \"The probability that token t follows history h, using add-one smoothing.\"\"\"\n", " N = sum(LM[h].values())\n", " return max(1, LM[h][t]) / (N + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That gives us:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.008771929824561403" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'chai', LM)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6842105263157895" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('n', 'chai', LM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Starting Text\n", "\n", "One thing you may have noticed: all the generated passages start the word \"First\". Why is that? Because the training data happens to start with the line \"First Citizen:\", and so when we call `generate_tokens`, we start with an empty history, and the only thing that follows the empty history in the training data is the letter \"F\", the only thing that follows \"F\" is \"i\", and so on, until we get to a point where there are multiple choices. We could get more variety in the start of the generated text by breaking the training text up into multiple sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of tokens/characters.\n", "\n", "We can give a starting text to `generate_tokens` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference. For example, the following won't make the model generate a story about Romeo:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ROMEO:\n", "Is the rash they news, and\n", "the night too her never, tie here queen,\n", "And, in lady store known\n", "in practises;\n", "Whose thy master thought;\n", "Then for my rose, for nothese him that honour body and I now nights have to make with noted.\n", "\n", "ANTIPHOLUS OF EPHEN SCROOP:\n", "Gloucested, and dote: marry, were cours, I deputes our this, and hurt was. Yet, could probations, I hear me, Rosal to chard,\n", "Which thy error offer'd you parce Will bears,\n", "The nature and how prever my mont: after come to bear the that worthly to Cyprus.--Help, ho! how\n", "did:\n", "If guard this daughter,\n", "Were up ruled for sister, be hour, than with mouth, my patient you my hence Helent denuncle, la, had for thank the farth\n", "The gate:\n", "My prither business. He behold, in would not ruins. What, assage of fles, that were this,\n", "How it is not thence; for the matter. Thaisanio;\n", "Why, where Christial of a bay,\n", "But of steeds the not me save you owest be dog-day,\n", "His slain If, a black, you die against the compey, of thine own the riots do lustry, may\n" ] } ], "source": [ "LM.gen(start='ROMEO')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linux Kernel C++ Code\n", "\n", "Goldberg's point is that the simple character-level n-gram model performs about as well as the more complex RNN model on Shakespearean text. \n", "\n", "But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 241465 759639 6206997 linux_input.txt\n" ] } ], "source": [ "! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt\n", "! wc linux_input.txt" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "linux = open(\"linux_input.txt\").read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 C++\n", "\n", "We'll start with an order-10 character model, and compare that to an order-20 model. We'll generate a longer text, because sometimes 1000 characters ends up being just one long comment." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel.h>\n", "#include \n", "#include \n", "\n", "struct snapshot_device_available space if\n", " * such as list traversal.\n", " */\n", "struct sigpending *list, siginfo_t info;\n", "\tif (!ret)\n", "\t\trb->aux_priv = event->pmu = pmu;\n", "\n", "\treturn BUF_PAGE_SIZE(slots)\t\t\t\\\n", "\t(offsetof(struct printk_log) + msg->text_len;\n", "\tif (path)\n", "\t\taudit_log_n_untrustedstring(ab, \"remove_rule\");\n", "\t\t\tlist_add(&nt->entry, list);\n", "\tcpu_buffer->irq_work.work, rb_wake_up_waiters - wake up the first child if it\n", "\t\t * is tracked on a waitqueue_head_t *bit_waitqueue_head_init(&rt_rq->rt_nr_boosted = 0;\n", "\tret = test_jprobe();\n", "\tif (err)\n", "\t\t\tgoto process\n", " * @len: max length to calculate_period(event);\n", "\tprint_ip_sym(s, *p, flags);\n", "\tvoid (*write_delay,\n", "\t.writeunlock\t= torture_random(trsp) % (cxt.nrealwriters_stress = cxt.nrealwriters_stress >= 0)\n", "\t\t\tmark_reg_unknown_value(regs);\n", "\n", "\tcurrent->lockdep_depth = curr->lockdep_depth; i++) {\n", "\t\tif (KDB_FLAG(CMD_INTERRUPT)) {\n", "\t\t/* We need to keep it from\n", "\t\t * the current thread will park without risk of ENOMEM.\n", " */\n", "struct cgroup *cgrp;\n", "\n", "\tif (kprobe_unused(&op->kp));\n", "\t\tlist_del(&use->target->name);\n", "\t\tif (ret) {\n", "\t\tpr_info(\"%sconsole [%s%d] disabled\\n\",\n", "\t\t\t\t\t\tt_sector(ent),\n", "\t\t\t MAJOR(n->rdev));\n", "\tif (!new_base->cpu_base->clock_base =\n", "\t{\n", "\t\t{\n", "\t\t\t.index = HRTIMER_NORESTART;\n", "}\n", "\n", "void task_numa_env *env)\n", "{\n", "\tif (autogroup_kref_put(prev);\n", "}\n", "\n", "static void ptrace_notify(int signr)\n", "{\n", "\tstruct module *mod = list_entries); /* Out-of-bounds values?\n", "\t */\n", "\tfor (i = 0; i < __nenv; i++) {\n", "\t\t\t\tif (s2->pid == pid)\n", "\t\treturn;\n", "\t}\n", "\n", "\tif (attr == &dev_attr_current_device *dev, void *regs)\n", "{\n", "\tint oldstate;\n", "\tint\t\t\tindex;\n", "};\n", "\n", "/**\n", " * platform_mode)\n", "{\n", "\tif (p->policy == SETPARAM_POLICY)\n", "\t\tpolicy = policy;\n", "\n", "\tif (dl_policy(int policy)\n", "{\n", "\treturn -ENODEV;\n", "\n", "\treturn 1;\n", "\t\t}\n", "\t}\n", "\n", "\ti = (unsigned long pid;\n", "\t\tif (get_user(parser, ubuf, cnt)) {\n", "\t\tINIT_LIST_HEAD(&event->list, list) {\n", "\t\tif (context->proctitle.len;\n", "out:\n", "\tmutex_unlock(&autosleep_init(void)\n", "{\n", "\trcu_lockdep_assert_held(&env->src_rq, as\n", " * part of the buffer pages */\n", "\tINIT_LIST_HEAD(&image->dest_pages, lru) {\n", "\t\taddr = last_addr;\n", "\t\tradix = last_radix = radix;\n", "\n", "\tif (bytes == 0)\n", "\t\treturn 0;\n", "\n", "\ts = pending->signal.sig;\n", "\tm = mask->sig;\n", "\n", "\t/*\n", "\t * It is possible to record\n", " * @iter: The iterator in this case we don't even try to\n", "\t * accumalator (@total) and @count\n", " */\n", "\n", "static const struct filter_pred_fn_t) (struct timespec __user *rmtp = restart->nanosleep.expires);\n", "\t\t\tbreak;\n", "\t\t\tprintk_deferred(\" PADATA_CPU_PARALLEL;\n", "\tret = 0;\n", "\t\t\tif (count >= PAGE_SIZE);\n", "\n", "#define TRACE_FTRACE_MAX);\n", "\tif (!pd->squeue, cpu);\n", "\t}\n", "}\n", "\n", "static int cgroup_wq_init(void)\n", "{\n", "#if defined(CONFIG_RCU_FANOUT_3\n", "# define NUM_RCU_LVL_3 + NUM_RCU_LVL_4\t 0\n", "#endif\n", "\n", "/* Location of the percpu refcnts are configured by writing them to write to.\n", " */\n", "\n", "static int min_offline = delta;\n", "\t\t}\n", "\t} while (sg != env->sd->nr_balance_failed)\n", "\t\tforce_sig_info(sig, info, p, 1, 0);\n", "\t\t\tgoto err;\n", "#endif\n", "\n", "\traw_spin_lock_init(void)\n", "{\n", "\ttrace_latency_header(s);\n", "}\n", "#else\n", "static int __init setup_relax_\n" ] } ], "source": [ "LanguageModel(10, linux).gen(length=3000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Order 20 C++" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel/irq/autoprobe.c\n", " *\n", " * Copyright (C) 2009 Red Hat, Inc., Ingo Molnar \n", " *\n", " * Contributors at various stages not listed above:\n", " * Jason Wessel ( jason.wessel@windriver.com>\n", " *\n", " * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar\n", " *\n", " * This file contains macros used solely by rtmutex.c. Debug version.\n", " */\n", "\n", "extern void\n", "rt_mutex_deadlock_account_lock(lock, proxy_owner);\n", "\trt_mutex_set_owner(lock);\n", "\t\tmutex_acquire(&lock->dep_map, subclass, 0, nest_lock, ip);\n", "\n", "\tif (mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ctx)\n", "{\n", "\tstruct mutex_waiter *waiter);\n", "extern void debug_rt_mutex_init_waiter(&waiter);\n", "\tRB_CLEAR_NODE(&rt_waiter.tree_entry);\n", "\n", "\traw_spin_lock(&this_rq->lock);\n", "\n", "\tupdate_blocked_averages(cpu);\n", "\n", "\trcu_read_lock();\n", "\tp = find_task_by_vpid(pid);\n", "\t\tif (p)\n", "\t\t\terr = posix_cpu_clock_get,\n", "\t.timer_create\t= pc_timer_create,\n", "\t.timer_set\t= pc_timer_settime,\n", "\t.timer_del\t= pc_timer_delete,\n", "\t.timer_get\t= pc_timer_gettime,\n", "};\n", "#ifndef _CONSOLE_CMDLINE_H\n", "#define _CONSOLE_CMDLINE_H\n", "\n", "struct console_cmdline\n", "{\n", "\tchar\tname[16];\t\t\t/* Name of the driver\t */\n", "\tint\tindex;\t\t\t\t/* Minor dev. to use\t */\n", "\tchar\t*options;\t\t\t/* Options for braille driver */\n", "#endif\n", "};\n", "\n", "#endif\n", "/* audit_watch.c -- watching inodes\n", " *\n", " * Copyright 2003-2007 Red Hat Inc., Durham, North Carolina.\n", " * Copyright 2005 Hewlett-Packard Development Company, L.P.\n", " *\n", " * Authors: Waiman Long \n", " */\n", "#include \n", "\n", "/*\n", " * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and\n", " * CONFIG_RCU_FANOUT;\n", "\t} else {\n", "\t\tint ccur;\n", "\t\tint cprv;\n", "\n", "\t\tcprv = nr_cpu_ids;\n", "\t\tfor (i = rcu_num_lvls - 1; i >= 0; i--) {\n", "\t\tif (pipesummary[i] != 0)\n", "\t\t\tbreak;\n", "\t}\n", "\n", "\tpr_alert(\"%s%s \", torture_type, TORTURE_FLAG);\n", "\tpr_cont(\"Free-Block Circulation: \");\n", "\tfor (i = 0; i < TVN_SIZE; i++) {\n", "\t\tmigrate_timer_list(new_base, old_base->tv1.vec + i);\n", "\tfor (i = 0; i < length; i++) {\n", "\t\t\tunsigned long rlim_rtprio =\n", "\t\t\t\t\ttask_rlimit(p, RLIMIT_RTPRIO);\n", "\n", "\t\t\t/* can't set/change the rt policy */\n", "\t\t\tif (policy != p->policy && !rlim_rtprio)\n", "\t\t\t\treturn -EPERM;\n", "\t\t}\n", "\t}\n", "\t/* nothing invalid, do the changes */\n", "\tfor (i = 0; i < csn; i++) {\n", "\t\tstruct cpuset *a = csa[i];\n", "\t\tint apn = a->pn;\n", "\n", "\t\tfor (j = 0; j < chain->depth - 1; j++, i++) {\n", "\t\t\tint lock_id = curr->held_locks[i].class_idx - 1;\n", "\t\t\tchain_hlocks[chain->base + i];\n", "}\n", "\n", "/*\n", " * Look up a dependency chain. If the key is not present yet then\n", " * add it and return 1 - in this case the new dependency does not connect a hardirq-safe\n", "\t * lock with a hardirq-unsafe lock (in the full forwards-subgraph starting at :\n", "\t */\n", "\tif (!check_usage(curr, prev, next, bit,\n", "\t\t\t exclusive_bit(bit), state_name(bit)))\n", "\t\treturn 0;\n", "\n", "\tret = eligible_child(wo, p);\n", "\tif (!ret)\n", "\t\treturn 0;\n", "\n", "\t/*\n", "\t * Clockevents returns -ETIME, when the event was scheduled out.\n", "\t */\n", "\tif (ctx->task && cpuctx->task_ctx != ctx)\n", "\t\treturn -EINVAL;\n", "\n", "\t/* same value, noting to do */\n", "\tif (timer == pmu->hrtimer_interval_ms)\n", "\t\treturn count;\n", "\n", "\tpmu->hrtimer_interval_ms = timer;\n", "\n", "\n" ] } ], "source": [ "LanguageModel(20, linux).gen(length=3000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis of Generated Linux Text\n", "\n", "As Goldberg says, \"Order 10 is pretty much junk.\" But order 20 is much better. Most of the comments have a start and an end; most of the open parentheses are balanced with a close parentheses; but the braces are not as well balanced. That shouldn't be surprising. If the span of an open/close parenthesis pair is less than 20 characters then it can be represented within the model, but if the span of an open/close brace is more than 20 characters, then it cannot be represented by the model. Goldberg notes that Karpathy's RRN seems to have learned to devote some of its long short-term memory (LSTM) to representing nesting level, as well as things like whether we are currently within a string or a comment. It is indeed impressive, as Karpathy says, that the model learned to do this on its own, without any input from the human engineer.\n", "\n", "## Character Models versus Word and Token Models\n", "\n", "Karpathy and Goldberg both used character models, because the exact formatting of characters (especially indentation and line breaks) is important in the format of plays and C++ programs. But if you are interested in generating paragraphs of text that don't have any specific format, it is common to use a **word** model, which represents the probability of the next word given the previous words, or a **token** model in which tokens can be words, punctuation, or parts of words. For example, the text `\"Spiderman!\"` might be broken up into the three tokens `\"Spider\"`, `\"man\"`, and `\"!\"`. \n", "\n", "One simple way of tokenizing a text is to break it up into alternating strings of word and non-word characters; the function `tokenize` does that:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "tokenize = re.compile(r'\\w+|\\W+').findall # Find all alternating word- or non-word strings" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "assert tokenize('Soft! who comes here?') == [\n", " 'Soft', '! ', 'who', ' ', 'comes', ' ', 'here', '?']\n", "\n", "assert tokenize('wherefore art thou ') == [\n", " 'wherefore', ' ', 'art', ' ', 'thou', ' ']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can train a token language model on the Shakespeare data. A model of order 6 keeps a history of up to three word and three non-word tokens. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "TLM = LanguageModel(6, tokenize(shakespeare))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'Romeo': 1})" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TLM['wherefore art thou ']" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'stars': 1, 'Grecian': 1})" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TLM['not in our ']" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'life': 1, 'business': 1, 'dinner': 1, 'time': 1})" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TLM['end of my ']" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({' ': 2})" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TLM[' end of my']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see below that the quality of the token models is similar to character models, and improves from 6 tokens to 8:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "TIMON:\n", "Freely, good father.\n", "\n", "Old Athenian:\n", "Thou hast a sister by the mother's, from the top to toe?\n", "\n", "MARCELLUS:\n", "My lord, upon the platform where we watch'd.\n", "\n", "HAMLET:\n", "Did you not tell me, Griffith, as thou led'st me,\n", "That the great body of our kingdom\n", "How foul it is; what rank diseases grow\n", "And with what zeal! for, now he has crack'd the league, and hath attach'd\n", "Our merchants' goods at Bourdeaux.\n", "\n", "ABERGAVENNY:\n", "Is it therefore\n", "The ambassador is silenced?\n", "\n", "NORFOLK:\n", "Marry, is't.\n", "\n", "ABERGAVENNY:\n", "A proper title of a peace; and purchased\n", "At a superfluous rate!\n", "\n", "BUCKINGHAM:\n", "Why, all this business\n", "Our reverend cardinal carried.\n", "\n", "NORFOLK:\n", "Like it your grace,\n", "The Breton navy is dispersed by tempest:\n", "Richmond, in Yorkshire, sent out a boat\n", "Unto the shore, to ask those on the banks\n", "If they were known, as the suspect is great,\n", "Would make thee quickly hop without thy head.\n", "Give me my horse, you\n", "rogues; give me my gown; or else keep it in your age.\n", "Then, in the name of something holy, sir, why stand you\n", "In this strange stare?\n", "\n", "ALONSO:\n", "O, \n" ] } ], "source": [ "TLM.gen(400)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "All:\n", "Peace, ho! Hear Antony. Most noble Antony!\n", "\n", "ANTONY:\n", "Why, friends, you go to do you know not what:\n", "Wherein hath Caesar thus deserved your loves?\n", "Alas, you know not: I must tell you that,\n", "Before my daughter told me--what might you,\n", "Or my dear majesty your queen here, think,\n", "If I had play'd the desk or table-book,\n", "Or given my heart a winking, mute and dumb,\n", "Or look'd upon this love with idle sight;\n", "What might you think? No, I went round to work,\n", "And my young mistress thus I did bespeak:\n", "'Lord Hamlet is a prince, out of thy star;\n", "This must not be:' and then I precepts gave her,\n", "That she should lock herself from his resort,\n", "Admit no messengers, receive no tokens.\n", "Which done, she took the fruits of my advice;\n", "And he, repulsed--a short tale to make--\n", "Fell into a sadness, then into a fast,\n", "Thence to a watch, thence into a weakness,\n", "Thence to a lightness, and, by this declension,\n", "Into the madness wherein now he raves,\n", "And all we mourn for.\n", "\n", "KING CLAUDIUS:\n", "Do you think 'tis this?\n", "\n", "\n" ] } ], "source": [ "LanguageModel(8, tokenize(shakespeare)).gen(400)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## C++ Token Model\n", "\n", "Similar remarks hold for token models trained on C++ data:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel/irq/autoprobe.c\n", " *\n", " * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar\n", " *\n", " * This file contains the private data structure and API definitions.\n", " */\n", "\n", "#ifndef __KERNEL_RTMUTEX_COMMON_H\n", "#define __KERNEL_RTMUTEX_COMMON_H\n", "\n", "#include \n", "\n", "/*\n", " * The rtmutex in kernel tester is independent of rtmutex debugging. We\n", " * call schedule_rt_mutex_test() instead of schedule() for the tasks which\n", " * belong to the tester. That way we can delay the wakeup path of those\n", " * threads to provoke lock stealing and testing of complex boosting scenarios.\n", " */\n", "#ifdef CONFIG_RT_MUTEX_TESTER\n", "\n", "extern void schedule_rt_mutex_test(struct rt_mutex *mutex)\n", "{\n", "\tint tid, op, dat;\n", "\tstruct test_thread_data *td;\n", "\n", "\t/* We have to lookup the task */\n", "\tfor (tid = 0; tid < MAX_RT_TEST_THREADS; tid++) {\n", "\t\tif (threads[tid] == current)\n", "\t\t\tbreak;\n", "\t}\n", "\n", "\tBUG_ON(tid == MAX_RT_TEST_THREADS);\n", "\n", "\ttd = &thread_data[tid];\n", "\n", "\top = td->opcode;\n", "\tdat = td->opdata;\n", "\n", "\tswitch (op) {\n", "\tcase RTTEST_LOCK:\n", "\tcase RTTEST_LOCKINT:\n", "\tcase RTTEST_LOCKNOWAIT:\n", "\tcase RTTEST_LOCKINTNOWAIT:\n", "\t\tif (mutex != &mutexes[dat])\n", "\t\t\tbreak;\n", "\n", "\t\tif (td->mutexes[dat] != 2)\n", "\t\t\treturn;\n", "\n", "\t\ttd->mutexes[dat] = 3;\n", "\t\ttd->event = atomic_add_return(1, &rttest_event);\n", "\t\tbreak;\n", "\n", "\tcase RTTEST_LOCKNOWAIT:\n", "\tcase RTTEST_LOCKINTNOWAIT:\n", "\t\tif (mutex != &mutexes[dat])\n", "\t\t\tbreak;\n", "\n", "\t\tif (td->mutexes[dat] != 1)\n", "\t\t\tbreak;\n", "\n", "\t\ttd->mutexes[dat] = 2;\n", "\t\ttd->event = atomic_add_return(1, &rttest_event);\n", "\t\tbreak;\n", "\n", "\tdefault:\n", "\t\tbreak;\n", "\t}\n", "\n", "\tschedule();\n", "\n", "\n", "\tswitch (op) {\n", "\tcase RTTEST_LOCK:\n", "\tcase RTTEST_LOCKINT:\n", "\t\tif (mutex != &mutexes[dat])\n", "\t\t\treturn;\n", "\n", "\t\tif (td->mutexes[dat] != 2)\n", "\t\t\treturn;\n", "\n", "\t\ttd->mutexes[dat] = 1;\n", "\t\ttd->event = atomic_add_return(1, &rttest_event);\n", "\t\trt_mutex_lock(&mutexes[id]);\n", "\t\ttd->event = atomic_add_return(1, &rttest_event);\n", "\t\tbreak;\n", "\n", "\tcase RTTEST_LOCKNOWAIT:\n", "\tcase RTTEST_LOCKINTNOWAIT:\n", "\t\tif (mutex != &mutexes[dat])\n", "\t\t\treturn;\n", "\n", "\t\tif (td->mutexes[dat] != 2)\n", "\t\t\treturn;\n", "\n", "\t\ttd->mutexes[dat] = 3;\n", "\t\ttd->event = atomic_add_return(1, &rttest_event);\n", "\t\trt_mutex_unlock(&mutexes[id]);\n", "\t\ttd->event = atomic_add_return(1, &rttest_event);\n", "\t\treturn 0;\n", "\n", "\tcase RTTEST_RESET:\n", "\t\tfor (i = 0; i < AUDIT_BITMASK_SIZE; i++)\n", "\t\tentry->rule.mask[i] = rule->mask[i];\n", "\n", "\tfor (i = 0; i < ALARM_NUMTYPE; i++) {\n", "\t\tstruct alarm_base *base = &alarm_bases[alarm->type];\n", "\tunsigned long flags;\n", "\n", "\tspin_lock_irqsave(&callback_lock, flags);\n", "\trcu_read_lock();\n", "\tguarantee_online_cpus(task_cs(tsk), pmask);\n", "\trcu_read_unlock();\n", "\tspin_unlock_irqrestore(&callback_lock, flags);\n", "}\n", "\n", "void cpuset_cpus_allowed_fallback(struct task_struct *tsk)\n", "{\n", "\trcu_read_lock();\n", "\tdo_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);\n", "\trcu_read_unlock();\n", "\n", "\t/*\n", "\t * We own tsk->cpus_allowed, nobody can change it under us.\n", "\t *\n", "\t * But we used cs && cs->cpus_allowed lockless and thus can\n", "\t * race with cgroup_attach_task() or update_cpumask() and get\n", "\t * the wrong tsk->cpus_allowed. However, both cases imply the\n", "\t * subsequent cpuset_change_cpumask()->set_cpus_allowed_ptr()\n", "\t * which takes task_rq_lock().\n", "\t *\n", "\t * If we are called after it dropped the lock we must see all\n", "\t * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary\n", "\t * set any mask even if it is interleaved with any other text.\n", "\t */\n", "\tif (!KDB_STATE(PRINTF_LOCK)) {\n", "\t\tKDB_STATE_SET(PRINTF_LOCK);\n", "\t\tspin_lock_irqsave(&kdb_printf_lock, flags);\n", "\t\tgot_printf_lock = 1;\n", "\t\tatomic_inc(&kdb_event);\n", "\t} else {\n", "\t\t__acquire(kdb_printf_lock);\n", "\t}\n", "\n", "\tdiag = kdbgetintenv(\"LINES\", &linecount);\n", "\tif (diag || linecount <= 1)\n", "\t\tlinecount = 24;\n", "\n", "\tdiag = kdbgetintenv(\"COLUMNS\", &colcount);\n", "\tif (diag || colcount <= 1)\n", "\t\tcolcount = 80;\n", "\n", "\tdiag = kdbgetintenv(\"LOGGING\", &logging);\n", "\tif (!diag && logging) {\n", "\t\tconst char *setargs[] = { \"set\", \"LOGGING\", \"0\" };\n", "\t\tkdb_set(2, setargs);\n", "\t}\n", "\n", "\tkmsg_dump_rewind_nolock(&dumper);\n", "\twhile (kmsg_dump_get_line_nolock(&dumper, 1, NULL, 0, NULL))\n", "\t\tn++;\n", "\n", "\tif (lines < 0) {\n", "\t\tif (adjust >= n)\n", "\t\t\tkdb_printf(\"buffer only contains %d lines, nothing \"\n", "\t\t\t\t \"printed\\n\", n);\n", "\t\telse if (adjust - lines >= n)\n", "\t\t\tkdb_printf(\"buffer only contains %d lines, first \"\n", "\t\t\t\t \"%d lines printed\\n\", n, lines);\n", "\t\t}\n", "\t} else {\n", "\t\tlines = n;\n", "\t}\n", "\n", "\tif (skip >= n || skip < 0)\n", "\t\treturn 0;\n", "\n", "\tkmsg_dump_rewind_nolock(&dumper);\n", "\twhile (kmsg_dump_get_line_nolock(&dumper, 1, NULL, 0, NULL))\n", "\t\tn++;\n", "\n", "\tif (lines < 0) {\n", "\t\tif (adjust >= n)\n", "\t\t\tkdb_printf\n" ] } ], "source": [ "LanguageModel(8, tokenize(linux)).gen(1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }