{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
Peter Norvig
2019, revised Jan 2024
Based on Yoav Goldberg's 2015 notebook
\n", "\n", "# Generative Character-Level Language Models\n", "\n", "This is a variant of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level language models, which in turn was a response to [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recursive neural network (RNN) language models. The term [generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text) based on a model learned from training data. Back in 2015 generative AI was just starting to take off, and Karpathy's point was that the RNNs were unreasonably effective at generating good text, even though they are at heart quite simple. Goldberg's point was that, yes, that's true, but actually most of the magic is not in the RNNs, it is in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg did agree with Karpathy that the RNN captures some aspects of C++ code that the character-level model does not.\n", "\n", "My implementation is similar to Goldberg's, but I updated his code to use Python 3 instead of Python 2, and made some additional changes for simplicity and clarity. (This makes the code less efficient than it could be, but plenty fast enough.) \n", "\n", "## Definition\n", "\n", "What do we mean by a **generative character-level language model**? It means a model that, when given a sequence of characters, can predict what character comes next; it can generate a continuation of a partial text. (And when the partial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*c* | *h*), the probability distribution that the next character will be *c*, given a history of previous characters *h*. For example, given the previous characters `'chai'`, a character-level model should learn to predict that the next character is probably `'r'` or `'n'` (to form the word `'chair'` or `'chain'`). Goldberg calls this a model of order 4 (because it considers histories of length 4) while other authors call it an *n*-gram model with *n* = 5 (because it represents the probabilities of sequences of 5 characters).\n", "\n", "## Training Data\n", "\n", "How does the language model learn these probabilities? By observing a sequence of characters that we call the **training data**. Both Karpathy and Goldberg use the complete works of Shakespeare as their initial training data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 167204 832301 4573338 shakespeare_input.txt\n" ] } ], "source": [ "! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt\n", "! wc shakespeare_input.txt # Print the number of lines, words, and characters" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "All:\n", "Speak, speak.\n", "\n", "First Citizen:\n", "You are all resolved rather to die than to famish?\n", "\n", "All:\n" ] } ], "source": [ "! head shakespeare_input.txt # First 10 lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Code\n", "\n", "There are four main parts to the code:\n", "\n", "- `LanguageModel` is a `defaultdict` that maps a history *h* to a `Counter` of the number of times each character *c* appears immediately following *h* in the training data. \n", "- `train_LM` takes a string of training `data` and an `order`, and builds a language model, formed by counting the times each character *c* occurs and storing that under the entry for the history *h* of characters that precede *c*. \n", "- `generate_text` generates a random text, given a language model, a desired length, and an optional start of the text. At each step it looks at the previous `order` characters and chooses a new character at random from the language model's counter for those previous characters.\n", "- `random_sample` randomly chooses a single character from a counter, with each possibility chosen in proportion to the character's count." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import random\n", "from collections import defaultdict, Counter\n", "\n", "class LanguageModel(defaultdict): \"\"\"A mapping of {history: Counter(characters)}.\"\"\"\n", "\n", "def train_LM(data: str, order: int) -> LanguageModel:\n", " \"\"\"Train a character-level language model of given `order` on the training `data`.\"\"\"\n", " LM = LanguageModel(Counter)\n", " LM.order = order\n", " history = ''\n", " for c in data:\n", " LM[history][c] += 1\n", " history = (history + c)[-order:] # add c to history; truncate history to length `order`\n", " return LM\n", "\n", "def generate_text(LM: LanguageModel, length=1000, text='') -> str:\n", " \"\"\"Generate a random text of `length` characters, with an optional start, from `LM`.\"\"\"\n", " while len(text) < length:\n", " history = text[-LM.order:]\n", " text = text + random_sample(LM[history])\n", " return text\n", "\n", "def random_sample(counter: Counter) -> str:\n", " \"\"\"Randomly sample from the counter, proportional to each entry's count.\"\"\"\n", " i = random.randint(1, sum(counter.values()))\n", " cumulative = 0\n", " for c in counter:\n", " cumulative += counter[c]\n", " if cumulative >= i: \n", " return c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train a model of order 4 on the Shakespeare data. We'll call the model `LM`, and we'll do some queries of it:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = open(\"shakespeare_input.txt\").read()\n", "\n", "LM = train_LM(data, order=4)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'n': 78, 'r': 35})" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"chai\"]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Counter({'p': 1360,\n", " 's': 2058,\n", " 'l': 1006,\n", " 'o': 530,\n", " 'g': 1037,\n", " 'c': 1561,\n", " 'a': 554,\n", " 'C': 81,\n", " 'r': 804,\n", " 'h': 1029,\n", " 'R': 45,\n", " 'd': 1170,\n", " 'w': 1759,\n", " 'b': 1217,\n", " 'm': 1392,\n", " 'v': 388,\n", " 't': 1109,\n", " 'f': 1258,\n", " 'i': 298,\n", " 'n': 616,\n", " 'V': 18,\n", " 'e': 704,\n", " 'u': 105,\n", " 'L': 105,\n", " 'y': 120,\n", " 'A': 29,\n", " 'H': 20,\n", " 'k': 713,\n", " 'M': 54,\n", " 'T': 102,\n", " 'j': 99,\n", " 'q': 171,\n", " 'K': 22,\n", " 'D': 146,\n", " 'P': 54,\n", " 'S': 40,\n", " 'G': 75,\n", " 'I': 14,\n", " 'B': 31,\n", " 'W': 14,\n", " 'E': 77,\n", " 'F': 103,\n", " 'O': 3,\n", " \"'\": 10,\n", " 'z': 6,\n", " 'J': 30,\n", " 'N': 18,\n", " 'Q': 7})" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LM[\"the \"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So `\"chai\"` is followed by either `'n'` or `'r'`, and almost any letter can follow `\"the \"`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Shakespeare\n", "\n", "Let's try to generate random text based on character language models of various orders, starting with order 4." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First\n", "five men crown tribunes an aunt, holden a suddenly daught.\n", "\n", "HECTOR CAIUS:\n", "And drawn by your confess, the jangle such a conferritoriest an make a cost, as you were the world,--\n", "\n", "BENVOLIO:\n", "Where shorter:\n", "A stom old been;\n", "Get you may parts food;\n", "I serve memory her. He is come fire to the\n", "skirted great knowledges,\n", "monster Ajax, thou do thy heart to spend theat--unhappy in and so! There shall spectar? Goodman! we are mine an heart in then\n", "The stomach times bear too: the emperformane\n", "And least,\n", "And then you are my grate and as\n", "A woman, I cannon down!' 'Course of my\n", "love,\n", "The tillo, away heard\n", "You soul issue us comes hand,\n", "To Julius, that pattering teach thither\n", "that for come in\n", "a fathere growned from far\n", "Crying, from yoursed into this.\n", "\n", "SILVIA:\n", "Wilt before.\n", "\n", "PAULINA:\n", "Might him: but Marshall be my fail age in fat, remember than arms? calls and the compulsion liar came him. If thy is bled this ever; or your tempts,\n", "Open an of think it from him our changentleman, more ther titless them th\n" ] } ], "source": [ "print(generate_text(LM))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Order 4 captures the structure of plays, mentions some characters, and generates mostly English words, although they don't always go together to form grammatical sentences, and there is certainly no coherence or plot. \n", "\n", "## Generating Order 7 Shakespeare\n", "\n", "What if we increase it to order 7? Or more? We find that it gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler model." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Clown:\n", "Like an ass of France to kill horns;\n", "And Brutus, and in such a one as he weariness does any strange fish! Were I adore.' When we arrives him not the time, whither argument?\n", "\n", "MARIA:\n", "Get thee on.\n", "\n", "SIR TOBY BELCH:\n", "Why, 'tis well.\n", "\n", "CLOTEN:\n", "Sayest trusts to your royal graces,\n", "I will draw his heinous and holiness\n", "Than are to breath.\n", "\n", "ISABELLA:\n", "Madam, pardon me: teach you, sirs, be it lying so, yet but the 'ever' last?\n", "\n", "EDWARD:\n", "An oath in it to bid you. You a lover dearly to our roses;\n", "For intercepted pardon him,\n", "And even now\n", "In any branches, wherefore let us go seek him:\n", "There's a good master; thyself a wise men,\n", "Let him when you are\n", "going to his entering\n", "into so quickly.\n", "Which all bosom as a bell,\n", "Remember thee who I am. Good Paulina more.' And in an hour?\n", "\n", "ORLANDO:\n", "As I wear\n", "In the eastern gate, horse!\n", "Do but he hath astonish thee apt;\n", "And this pardon me, I conjure them:\n", "To show more offering in saying them, whose beauty starves the night\n", "Did Jessica:\n", "Besides, Antony. But art \n" ] } ], "source": [ "print(generate_text(train_LM(data, order=7)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 Shakespeare" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "That cannot go but thirty miles to ride yet ere day.\n", "\n", "PUCK:\n", "Now the pleasure of the realm in farm.\n", "\n", "LORD WILLOUGHBY:\n", "And daily graced by an inkhorn mate,\n", "We and our power,\n", "Let us see:\n", "Write, 'Lord have mercies\n", "More than all his creature in her, you may\n", "say they be not take my plight shall lie\n", "His old betrothed lord.\n", "\n", "URSULA:\n", "She's limed, I warrant;\n", "speciously on him;\n", "Lose not so near:\n", "I had rather be at a breakfast to the abject rear,\n", "O'er-run and trampled on: then what is this law?\n", "\n", "First Murderer:\n", "What speech, my lord\n", "For certain, and is gone aboard a\n", "new ship to purge him of the affected.\n", "\n", "PRINCE:\n", "Give me a copy of the forlorn French!\n", "Him I forgive thee,\n", "Unnatural though the very life\n", "Of my dear friend Leonato hath\n", "invited you all. I tell him we shall have 'em\n", "Talk us to silence.\n", "\n", "ANNE:\n", "You can do better yet\n", "And show the increasing in love?\n", "\n", "LUCETTA:\n", "That they travail for, if it were not virtue, not\n", "For such proceeding by the way\n", "Should have both the parties of suspic\n" ] } ], "source": [ "print(generate_text(train_LM(data, order=10)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Probabilities\n", "\n", "Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P(c* | *h)* can be computed as follows:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def P(c, h, LM=LM): \n", " \"The probability that character c follows history h.\"\"\"\n", " return LM[h][c] / sum(LM[h].values())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.09286165508528112" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'the ')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6902654867256637" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('n', 'chai')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.30973451327433627" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('r', 'chai')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P('s', 'chai')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shakespeare never wrote about \"chaise longues,\" so the probability of an `'s'` following `'chai'` is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters `'chais'` to appear, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. But in this notebook we stick to the simple unsmoothed model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Starting Text\n", "\n", "One thing you may have noticed: all the generated passages start with \"F\". Why is that? Because the training data happens to start with the line \"First Citizen:\", and so when we call `generate_text`, we start with an empty history, and the only thing that follows the empty history is the letter \"F\". We could get more variety in the generated text by breaking the training text up into separate sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of characters.\n", "\n", "We can give a starting text to `generate_text` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ROMEO:\n", "The kill not my come.\n", "\n", "FALSTAFF:\n", "No, good, sister, whereforeson, and strive merry Blance; and by the like a heart of mind;\n", "And him sough they bodied;\n", "The could thy made know are corona's and proved,\n", "To fresh are did call'd Messenge you into\n", "termity.\n", "On they found\n", "they.\n", "\n", "Firstling our such a score you a\n", "touch'd,\n", "I make you them compossessed in dead,\n", "And when the hadst be, thy lament:\n", "Your to doth due that ring, and quiet not be fetch hear: but stop. All\n", "the map o'erween attery seat most wonder of beat imprison want me hear to the general we wick outward's he gentre, doth receive doom; and forth\n", "Do you know'd cond,\n", "And root us madam, yours of with her of it is.\n", "\n", "SHALLOW:\n", "'Swound and cry a bravel your land, crystally carry with than and present they display\n", "Is no remembrass, each and monkey, thrive does wife countain,\n", "We will we marry, I\n", "shall have I never ask, the reason thes:\n", "He's good for me: thou and me good to bedded:\n", "Again, thee, death made best.\n", "\n", "MARK ANTONIO:\n", "Amen, stays to\n" ] } ], "source": [ "print(generate_text(LM, text='ROMEO:'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Linux Kernel C++ Code\n", "\n", "Goldberg's point is that the simple character-level model performs about as well as the much more complex RNN model on Shakespearean text. But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 241465 759639 6206997 linux_input.txt\n" ] } ], "source": [ "! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt\n", "linux = open(\"linux_input.txt\").read()\n", "! wc linux_input.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Order 10 C++\n", "\n", "We'll start with an order-10 model, and compare that to an order-20 model. WEe'll generate a longer text, because sometimes a 1000-character text ends up being just one long comment." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel.h>\n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "#include \n", "#include \t\t/* try_to_freeze();\n", "\n", "\t\tif (hlock->references) {\n", "\t\thlock_curr;\n", "\tint cpu;\n", "\n", "\t/* initiate RCU priority unchanged. Otherwise just see\n", "\t * if we get it wrong the load-balancer moves */\n", "\tupdate_sched_clock_stable()) {\n", "\t\t*(char **)kp->arg));\t\t\t\\\n", "\t}\t\t\t\t\t\t\t\t\\\n", "\tfor ((cmd) = kdb_base_commands, list) {\n", "\t\tfor (i = 0; i < length; i++)\n", "\t\tseq_printf(m, \"Per CPU device: %d\\n\", ret);\n", "\t\treturn error;\n", "\t}\n", "\n", "\tif (skip_equal && f->op != Audit_equal)\n", "\t\t\treturn 0;\n", "}\n", "\n", "static bool migrated to a second choice node will lead to deadlock detection for find_existing_css_set(struct gcov_iterator *iter)\n", "{\n", "\tif (iter->idx > i)\n", "\t\treturn;\n", "\n", "\tif (graph)\n", "\t\tret = rb_head_page_activate(struct tick_device *tick_get_tick_dev(struct delayed_work(&req->work);\n", "\n", "\treturn 0;\n", "}\n", "\n", "static bool rcu_preempt_qs(void)\n", "{\n", "\tstruct rt_bandwidth;\n", "\n", "\tif (entry->class == data;\n", "}\n", "\n", "/**\n", " * pm_qos_update_request_timeout(\n", "\tvoid *word, int bit)\n", "{\n", "\t__wake_up(wait_queue(¤t->signal->thread_head = (cmd_head == cmd_tail)\n", "\t\treturn -ENOMEM;\n", "\n", "\tmutex_init(&session->stat_root, node) {\n", "\n", "\t\t\t/* exclude other factors [XXX].\n", " *\n", " * The init function_set_filter_inodes(tsk, context, for example, \"1\" if the CPU is\n", " * not large enough to allow coalescing,\n", " * allocators\n", "\t * in the owner.\n", "\t\t * refcount of the sample period a kick. */\n", "\t\t\traw_spin_unlock_commit(struct nosave_region *region;\n", "\n", "\tif (kobj) {\n", "\t\tmk = to_module_attribute *klp_patch_attrs[] = {\n", "\t&dev_attr_pid)\n", "\t\t\tq->blk_trace;\n", "\tstruct file *filp, const char *devname,\n", "\t\t\t int cpu, struct rcu_head *nocb_follower_head);\n", "\n", "\taux_head = local_read(&cpu_buffer->pages = reqd_free_pages;\n", "\twhile ((*nl) != NULL) {\n", "\t\tmemcpy(&entry->caller[6], __entry->caller[7]),\n", "\n", "\tFILTER_OTHER\n", ");\n", "\n", "FTRACE_ENTRIES];\n", "\tsector_t cur_swap;\n", "\treturn retval;\n", "}\n", "\n", "/*\n", " * Load the comments */\n", "\tif (se->on_rq)\n", "\t\taccount_steal_ticks(unsigned long cpu0_err;\n", "\n", "static inline int desc_node(desc), NULL);\n", "\tif (ret) {\n", "\t\tif (mode & HRTIMER_STATS\n", "\tif (likely(uprobe_is_active(uprobe))) {\n", "\t\tret = unregister_ftrace_function_stack_trace_init_module);\n", "\n", "static int __sprint_symbols_seq(struct sched_param param = { .sched_priority == 0 when\n", "\t * part of the group hasn't been updated, and\n", "\t *\tthe other group-sibling):\n", "\t\t */\n", "\t\titer->cpu = cpumask_first(sched_domain *sd)\n", "{\n", "\treturn single_open(file, tracing_cpumask_notifier - unregister(struct task_struct *p)\n", "{\n", "\tstruct cpuset *parent_css = cgroup_parent(cgrp);\n", "\tunsigned long active_timers--;\n", "\t(void)catchup_timer_jiffies;\n", "\tzalloc_cpumask_var(&mask, GFP_KERNEL);\n", "\tif (!buf)\n", "\t\tgoto out;\n", "\t\tif ((unsigned long) (t)->tv_usec)) ? -EFAULT : 0;\n", "\n", "\treturn 0;\n", "}\n", "\n", "DEFINE_PER_CPU(struct swap_map_handle *handle, void *buf,\n", "\t\t\t sizeof(files_stat,\n", "\t\t.maxlen\t\t= sizeof(int)))\n", "\t\t\treturn\n" ] } ], "source": [ "print(generate_text(train_LM(linux, order=10), length=3000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Order 20 C++" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel/irq/manage.c\n", " *\n", " * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar\n", " * Copyright(C) 2007, Red Hat, Inc., Ingo Molnar \n", " * Guillaume Chazarain \n", " *\n", " *\n", " * What:\n", " *\n", " * cpu_clock(i) -- can be used from any context, including NMI.\n", " * local_clock() -- is cpu_clock() on the current cpu.\n", " *\n", " * sched_clock_cpu(i)\n", " *\n", " * How:\n", " *\n", " * The implementation either uses sched_clock() when\n", " * !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK\n", "static struct static_key __sched_clock_stable);\n", "}\n", "\n", "static void __maybe_unused rcu_try_advance_all_cbs())\n", "\t\tinvoke_rcu_core(); /* force nohz to see update. */\n", "\t\trdtp->tick_nohz_enabled_snap = tne;\n", "\t\treturn;\n", "\t}\n", "\tif (!needreport)\n", "\t\treturn;\n", "\tif (*firstreport) {\n", "\t\tpr_err(\"INFO: rcu_tasks detected stalls on CPUs/tasks:\",\n", "\t rsp->name);\n", "\tprint_cpu_stall_info(struct rcu_state *rsp, struct rcu_node *rnp_leaf)\n", "{\n", "\tlong mask;\n", "\tstruct rcu_node *rnp);\n", "#ifdef CONFIG_HOTPLUG_CPU\n", "\tbuffer->cpu_notify.notifier_call = rb_cpu_notify;\n", "\tbuffer->cpu_notify.priority = 0;\n", "\t__register_cpu_notifier(struct notifier_block *nb)\n", "{\n", "\treturn raw_notifier_chain_unregister(\n", "\t\t\t\t&munmap_notifier, n);\n", "\t\tbreak;\n", "\t}\n", "\n", "\treturn cpu;\n", "}\n", "\n", "#define RT_PUSH_IPI_RESTART;\n", "\t\trt_rq->push_cpu = src_rq->cpu;\n", "\t}\n", "\n", "\tcpu = find_next_push_cpu(struct rq *rq)\n", "{\n", "\tif (rq->rt.overloaded)\n", "\t\treturn 0;\n", "\n", "\tnext_task = pick_next_pushable_task(rq);\n", "\t\tif (task_cpu(next_task) == rq->cpu && task == next_task) {\n", "\t\t\t/*\n", "\t\t\t * The first thread which returns from do_signal_stop()\n", "\t\t\t * will take ->siglock, notice SIGNAL_CLD_MASK, and\n", "\t\t\t * notify its parent. See get_signal_to_deliver().\n", "\t\t\t */\n", "\t\t\tsignal->flags = why | SIGNAL_STOP_CONTINUED because\n", "\t\t * an intervening stop signal is required to cause two\n", "\t\t * continued events regardless of ptrace.\n", "\t\t */\n", "\t\tif (!(sig->flags & SIGNAL_STOP_STOPPED))\n", "\t\t\tsig->group_exit_code;\n", "\telse if (!thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {\n", "\t\ttsk->flags |= PF_EXITING;\n", "\n", "\tthreadgroup_change_end(current);\n", "\tdelayacct_tsk_free(p);\n", "bad_fork_cleanup_signal;\n", "\tretval = copy_namespaces(clone_flags, p);\n", "\tif (retval)\n", "\t\tgoto out;\n", "\n", "\tretval = sched_setaffinity(pid, new_mask);\n", "\tfree_cpumask_var(cs->cpus_allowed);\n", "\treturn 0;\n", "}\n", "\n", "/*\n", " * kdb_md - This function implements the 'defcmd'\n", " *\tcommand which defines one command as a set of other commands,\n", " *\tterminated by endefcmd. kdb_defcmd processes the initial\n", " *\t'defcmd' command, kdb_defcmd2 is invoked from kdb_parse for\n", " *\tthe following commands until 'endefcmd'.\n", " * Inputs:\n", " *\targc\targument count\n", " *\targv\tArgument vector\n", " * Outputs:\n", " *\tNone.\n", " * Returns:\n", " *\tNone.\n", " * Locking:\n", " *\tNone.\n", " * Remarks:\n", " *\n", " *\tbp\tSet breakpoint on all cpus. Only use hardware assist if need.\n", " *\tbph\tSet breakpoint on all cpus. Force hardware register\n", " */\n", "\n", "static int kdb_flags_stack[4], kdb_flags_index;\n", "\n", "void kdb_save_flags(void)\n", "{\n", "\tBUG_ON(kdb_flags_index >= ARRAY_SIZE(kdb_flags_stack));\n", "\tkdb_flags_stack[kdb_flags_index++] = kdb_flags;\n", "}\n", "\n", "void kdb_restore_flags(void)\n", "{\n", "\tBUG_ON(kdb_flags_index <= \n" ] } ], "source": [ "print(generate_text(train_LM(linux, order=20), length=3000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis\n", "\n", "As Goldberg says, \"Order 10 is pretty much junk.\" But order 20 is much better. Most of the comments have a start and an end; most of the open parentheses are balanced with a close parentheses; but the braces are not as well balanced. That shouldn't be surprising. If the span of an open/close parenthesis pair is less than 20 characters then it is represented within the model, but if the span of an open/close brace is more than 20 characters, then it cannot be represented by the model. Goldberg notes that Karpathy's RRN seems to have learned to devote some of its long short-term memory (LSTM) to representing nesting level, as well as things like whether we are currently within a string or a comment. It is indeed impressive, as Karpathy says, that the model learned to do this on its own, without any input from the human engineer.\n", "\n", "## Token Models versus Character Models\n", "\n", "Karpathy and Goldberg both used character models, because the exact formatting of characters (especially indentation and line breaks) is important in the format of plays and C++ programs. But if you are just interested in running paragraphs of text, it is more common to use a **word** model, which represents the probability of the next word given the previous words, or a **token** model, where a token is something similar to a word. Sometimes a word is broken into several tokens; the word \"dogcatcher\" might become two tokens, \"dog\" and \"catcher.\" One or more characters of punctuation can also form a token. In the implementation below, `train_token_LM` and `generate_token_text` are almost the same as their charac ter-model counterparts, but they deal with a list of tokens rather than a string of characters (however, in the Counters that make up the model, the keys are formed by concatenating the tokens together, in part because lists can't be keys of dicts).\n", "\n", "One simple way of tokenizing a text is to break it up into alternating word and non-word characters; the function `tokenize` does that.But other tokenizers could be used if desired." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "TokenLanguageModel = LanguageModel # e.g. {'wherefore art thou ': Counter({'Romeo': 1})\n", "\n", "cat = ''.join\n", "\n", "def train_token_LM(tokens, order: int) -> TokenLanguageModel:\n", " \"\"\"Train a character-level token language model of given order on the given tokens.\"\"\"\n", " LM = TokenLanguageModel(Counter)\n", " LM.order = order\n", " history = []\n", " for token in tokens:\n", " LM[cat(history)][token] += 1\n", " history = (history + [token])[-order:] \n", " return LM\n", "\n", "def generate_token_text(LM: TokenLanguageModel, length=1000, tokens=()) -> str:\n", " \"\"\"Generate a random text of `length` tokens, with an optional start, from `LM`.\"\"\"\n", " tokens = list(tokens)\n", " while len(tokens) < length:\n", " history = cat(tokens[-LM.order:])\n", " tokens.append(random_sample(LM[history]))\n", " return cat(tokens)\n", "\n", "def tokenize(text: str) -> list: \n", " \"\"\"Break text up into alternating word-character and non-word-character strings.\"\"\"\n", " return re.findall(r'\\w+|\\W+', text)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "assert tokenize('wherefore art thou Romeo?') == ['wherefore', ' ', 'art', ' ', 'thou', ' ', 'Romeo', '?']\n", "assert tokenize(''' */\n", "int probe_irq_off(unsigned long val)\n", "{''') == [' */\\n', 'int', ' ', 'probe_irq_off', '(', 'unsigned', ' ', 'long', ' ', 'val', ')\\n{']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can train a token model on the Shakespeare data. A model of order 6 keeps a history of three word and 3 non-word tokens (all concatenated together):" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "TLM = train_token_LM(tokenize(data), 6)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'Romeo': 1})" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TLM['wherefore art thou ']" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'stars': 1, 'Grecian': 1})" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TLM['not in our ']" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'life': 1, 'business': 1, 'dinner': 1, 'time': 1})" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TLM['end of my ']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the quality of the token models is similar to character models, and improves from 6 tokens to 8:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Before we proceed any further, hear me speak.\n", "\n", "CORIOLANUS:\n", "Cut me to pieces, Volsces; men and lads,\n", "Stain all your edges on me. Boy! false hound!\n", "If you have told Diana's altar to protest\n", "For aye austerity and single life.\n", "\n", "DEMETRIUS:\n", "Relent, sweet Hermia: and, Lysander, yield\n", "Thy crazed title to my certain right.\n", "\n", "LYSANDER:\n", "You have her father's eyes up close as oak-\n", "He thought 'twas witchcraft--but I am heart-burned an hour after.\n", "\n", "HERO:\n", "He is the half part of a blessed man,\n", "Left to be finished by such as she;\n", "And she again wants nothing, to name want,\n", "If want it be not gone already,\n", "Even at that news he dies; and then the hearts\n", "Of all the world was of my counsel\n", "In my whole course of love, the tidings of her death:\n", "And here he comes in the habit of a light wench: and thereof\n", "comes that the wenches say 'God damn me;' that's as\n", "much to say 'God make me a light. Know we this face or no?\n", "Alas my friend and my dear hap to tell.\n", "\n", "FRIAR LAURENCE:\n", "The grey-eyed morn smiles on the frowning night,\n", "Chequering the eastern clouds with streaks of light,\n", "And flecked darkness like a dream than an assurance\n", "That my remembrance warrants. Had I not reason to prefer mine own?\n", "\n", "VALENTINE:\n", "And I will chain these legs and arms of thine,\n", "That hast by tyranny these many years, and yet\n", "I know 'tis done,\n", "Howe'er my haps, my joys were ne'er acquainted with their wards\n", "Many a bounteous year must be employ'd?\n", "\n", "TITUS ANDRONICUS:\n", "Tut, I have lost my life betimes\n", "Than bring a burthen of dishonour home\n", "By staying there so long till all were told,\n", "The words would add more anguish than the wounds.\n", "O valiant lord, the Duke of Buckingham; now, poor Edward Bohun:\n", "Yet I am doubtful that you have said, my lord.\n", "\n", "FLAVIUS:\n", "\n", "TIMON:\n", "Go you, sir, to spare me, till I have issue o' my body; for\n", "they say barnes are blessings.\n", "\n", "COUNTESS:\n", "Tell me thy reason why thou wilt marry.\n", "\n", "Clown:\n", "My poor body, madam, requires it: I am driven on\n", "by the flesh; and he must needs go in;\n", "Her father will be angry: what hast thou done?\n", "\n", "HAMLET:\n", "Nay, I know not the contents:\n", "Phebe did write it.\n", "\n", "ROSALIND:\n", "Come, come, you are a rare parrot-teacher.\n", "\n", "BEATRICE:\n", "A bird of my tongue.\n", "\n", "Second Lord:\n", "This is your devoted friend, sir, the manifold\n", "linguist and the armipotent soldier.\n", "\n", "BERTRAM:\n", "I could endure any thing before but a cat, and now\n", "he's a mad yeoman that sees his son a gentleman\n", "before him.\n", "\n", "KING LEAR:\n", "To have a thousand loves,\n", "A mother and a brother,\n", "In quest of them, unhappy, lose myself.\n", "Here comes the prince and Claudio hastily.\n", "\n", "DON PEDRO:\n", "Good den, brother.\n", "\n", "DON JOHN:\n", "If it please you: yet Count Claudio may hear; for\n", "what I would \n" ] } ], "source": [ "print(generate_token_text(TLM))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Citizen:\n", "Give me the Lord preserved me long:\n", "To build me the fortunes, beyond his heart,\n", "To stay the villain!\n", "\n", "Second Messenger\n", "That no manner was I crept out of my place beneath.\n", "\n", "HAMLET:\n", "Sir, here that be?\n", "\n", "Clown:\n", "I would you have worn a visor! what costs they have engaol'd my tongue,\n", "That more respective lenity,\n", "To seem to under-bear. O, that's dragon-like, awhile.\n", "\n", "Hostess:\n", "A pair so famous college of\n", "wit-crackers of\n", "manners, as you do assistance be only mean\n", "For power,\n", "Bending the ancient trade than you do this?\n", "\n", "BORACHIO:\n", "Yea, every idle, nice custom 'gainst it:\n", "We are to me a stool and dead men's son, sir.\n", "\n", "TITUS ANDRONICUS:\n", "Follow thee again,\n", "And make thee after. When shall thinking too liberal arts\n", "With thought? I have but pinn'd with some mischance he's hurt i' the half-achieved,\n", "As to be cut, and as leaky as an\n", "unstanched thirst\n", "York and Tartar's bower,\n", "Whose wanton lust the senate.\n", "But I shall they\n", "yet look thee, my boy! thy face,\n", "While you perceive\n", "no truth and heath\n" ] } ], "source": [ "print(generate_token_text(train_token_LM(data, 8)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## C++ Token Model\n", "\n", "Similar remarks hold for token models trained on C++ data:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/*\n", " * linux/kernel.h>\n", "#include \n", "#include \n", "#include chip->irq_cpu_online(cpu)) {\n", "\t\t/* created separates the end of the sym_name != '_')\n", "\t\t\treturn;\n", "}\n", "\n", "#ifdef COMPAT_RLIM_INFINITY) {\n", "\t\trt_clear_integral.\n", " *\n", " * This gets called when:\n", " * - an unknown object\\n\");\n", "\t\treturn;\n", "\n", " again:\n", "\tadd_time_stats(class);\n", "\tprintk(\"\\nstack backtrace_seq_puts(s, \"\\ttype_len = 0;\n", "\tint dead_cpu: Callback list\n", " * - we are done.\n", "\t */\n", "\tif (!(torture_cleanup has been done or not.\n", " *\n", " * This is used to put_user(r.rlim_max;\n", "}\n", "\n", "/* Clean it up and exit (not in\n", " * case.\n", "\t */\n", "\tdown_write operations = {\n", "\t.func\t\t\t= stack_trace(&trace_ops *op;\n", "\tint len1;\n", "\tint i;\n", "\tint i;\n", "\n", "\tfor (thr = 0; thr < nr_threads; thr++) {\n", "\t\t*q++ = ':';\n", "\t*q++ = ':';\n", "\t*q++ = hex_asc[(error % 10)];\n", "\tpkt[3] = '\\0';\n", "\t\t\tparse_error(ps, FILT_ERR_FIELD_NOT_FOUND:\n", "\t\tif (DST != SRC) {\n", "\t\t\titer++;\n", "\t}\n", "\terr2 = hib_wait_on_bio_chain);\n", "\t\tswitch (m->opcode) {\n", "\tcase CPU_DOWN_PREPARE, %CPU_ONLINE, &cs->flags &= ~PF_USED_ASYNC;\n", "\n", "\tif (e == FALLTHROUGH, env);\n", "\n", "\tif (!axp || axp->pid_count]);\n", "\t\tcommand = s->command[s->count)\n", "\t\treturn 1;\n", "}\n", "\n", "int\n", "_braille_register_begin(). If at least one signal. */\n", "\tacct_account_idle_task(task, NULL);\n", "\n", "\tmutex_unlock();\n", "\n", "\t/* find program is distributed in the same CPU;\n", "\t\t\t * other entities belongs\n", " */\n", "int hrtick_update_event_probe_ops = {\n", "\t.handler_t trigger ops associated with interrupt.h>\n", "#include \n", "#include hash_entry(swap, offset)\n", "{\n", "\tcycle_t cycle_t T0, T1, delta;\n", "}\n", "\n", "/*\n", " * Old cruft\n", " */\n", "SYSCALL_DEFINE3(sched_granularity(void)\n", "{\n", "}\n", "\n", "#else\n", "static inline int throttled_clock for running callbacks(struct perf_event_trigger_data: Trigger-specific module '%s' (not found, NULL);\n", "\t\tbreak;\n", "\tcase KDB_DB_SSBPT,\t/* Breakpoint error;\n", "}\n", "\n", "/*\n", " * kernel\t\t\t\tuser->seq = log_first_idx;\n", "\telem->next = env->src_cpu, src_nid) * 3 / 4);\n", "}\n", "\n", "ssize_t tbl_size, GFP_KERNEL);\n", "\tif (error)\n", "\t\terror = PR_TIMING_STATISTICAL)\n", "\t\t\terr = -EPERM;\n", "\t\tif (strncmp(cmp, \"no\", 2)) {\n", "\t\tint same = 1;\n", "\t}\n", "\n", "\t/* No-CBs CPU, then\n", "\t\t * PPS freq wander (ns/s) */\n", "\n", "/* Resending the column corresponding elements[r].dl))\n", "\t\t\tlargest representing\n", " * to limit memory bitmaps - free the sigevent kevent;\n", "\n", "\t/* No need to handle failure.\n", " */\n", "static inline int is_hardlockup_panic)\n", "\t\t\tpanic(\"Attempt to move forward\n", " * @nodemask;\n", "\tif (percpu_irq(unsigned int cpuset change.\n", " */\n", "void *trigger_ops,\n", "\t\t.spd_release,\n", "\t};\n", "\tstruct rq *rq = rq_of_rt_rq(rt_se);\n", "\n", "\tif (bp->bp_enable_print,\n", "};\n", "EXPORT_SYMBOL(msecs_to_jiffies and returns filter.\n", " *\n", " * Create links from the\n", "\t\t * not then unload or not.\n", " *\n", " * Those are no failure means event_trigger_free,\n", "};\n", "\n", "static inline long __start_an\n" ] } ], "source": [ "print(generate_token_text(train_token_LM(linux, 8), length=3000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Character Models *are* Token Models\n", "\n", "Although it was pedagogically simpler to present the character models first; we could have skipped that and just shown the code for the token models. Then a character model is just a token model where the data has been \"tokenized\" so that each character is a separate token. We can show that the resulting models are exactly equal:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_LM(data, 4) == train_token_LM(data, 4) " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }