{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: right\" align=\"right\"><i>Peter Norvig<br> 2019, revised 2024<br>Based on <a href=\"https://nbviewer.org/gist/yoavg/d76121dfde2618422139\">Yoav Goldberg's 2015 notebook</a></i></div> \n",
    "\n",
    "# The Effectiveness of Generative Language Models\n",
    "\n",
    "This notebook is an expansion of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level *n*-gram language models, which in turn was a response to  [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recurrent neural network (RNN) language models. \n",
    "\n",
    "The term [**generative AI**](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text). \n",
    "\n",
    "In 2015 Karpathy's point was that recurrent neural networks were unreasonably effective at generating good text, even though they are at heart rather simple. Goldberg's point was that, yes, they are effective, but actually most of the magic is not in the RNNs, it is  in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg and Karpathy agree that the RNN captures some aspects of C++ code that the simpler model does not.\n",
    "\n",
    "\n",
    "## Definitions\n",
    "\n",
    "- A **generative language model** is a model that, when given an initial text, can predict what tokens come next; it can generate a continuation of a partial text. (And when the initial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*t* | *h*), the probability distribution that the next token will be *t*, given a history of previous tokens *h*. The probability distribution is estimated by looking at a training corpus of text.\n",
    "\n",
    "- A **token** is a unit of text. It can be a single character (as covered by Karpathy and Goldberg) or more generally it can be a word or a part of a word (as allowed in my implementation).\n",
    "\n",
    "- A generative model stands in contrast to a **discriminative model**, such as an email spam filter, which can discriminate between spam and non-spam, but can't be used to generate a new sample of spam.\n",
    "\n",
    "\n",
    "- An ***n*-gram model** is a generative model that estimates the probability of *n*-token sequences. For example, a 5-gram character model would be able to say that given the previous 4 characters `'chai'`, the next character might be `'r'` or `'n'` (to form `'chair'` or `'chain'`). A 5-gram model is also called a model of **order** 4, because it maps from the 4 previous tokens to the next token.\n",
    "\n",
    "- A **recurrent neural network (RNN) model** is more powerful than an *n*-gram model, because it contains memory units that allow it to retain information from more than *n* tokens. See Karpathy for [details](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).\n",
    "\n",
    "- Current **large language models** such as Chat-GPT, Claude, Llama, and Gemini use an even more powerful model called a [transformer](https://en.wikipedia.org/wiki/Transformer_%28deep_learning_architecture%29).  Karpathy has [an introduction](https://www.youtube.com/watch?v=zjkBMFhNj_g&t=159s).\n",
    "\n",
    "## Training Data\n",
    "\n",
    "A language model learns probabilities by observing a corpus of text that we call the **training data**. \n",
    "\n",
    "Both Karpathy and Goldberg use the works of Shakespeare (about 800,000 words) as their initial training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  167204  832301 4573338 shakespeare_input.txt\n"
     ]
    }
   ],
   "source": [
    "! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt\n",
    "! wc shakespeare_input.txt \n",
    "# Print the number of lines, words, and characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizen:\n",
      "Before we proceed any further, hear me speak.\n",
      "\n",
      "All:\n",
      "Speak, speak.\n",
      "\n",
      "First Citizen:\n",
      "You are all resolved rather to die than to famish?\n"
     ]
    }
   ],
   "source": [
    "! head -8 shakespeare_input.txt \n",
    "# First 8 lines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python Code for n-Gram Model\n",
    "\n",
    "I do some imports and then define two data types:\n",
    "- A `Token` is an individual unit of text, a string of one or more characters.\n",
    "- A `LanguageModel` is a subclass of `defaultdict` that maps a history of length `order` tokens to a `Counter` of the number of times each token appears immediately following the history in the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "from typing import *\n",
    "from collections import defaultdict, Counter\n",
    "\n",
    "Token = str # Datatype to represent a token\n",
    "\n",
    "class LanguageModel(defaultdict): \n",
    "    \"\"\"A mapping of {'history': Counter(next_token)}.\"\"\"\n",
    "    def __init__(self, order):\n",
    "        self.order = order\n",
    "        super().__init__(Counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I define two main functions that do essentially all the work:\n",
    "\n",
    "- `train_LM` takes a sequence of tokens (the training data) and an integer `order`, and builds a language model, formed by counting the times each token *t* occurs and storing that under the entry for the history *h* of `order` tokens that precede *t*. \n",
    "- `generate_tokens` generates a random sequence of tokens. At each step it looks at the history of previously generated tokens and chooses a new token at random from the language model's counter for that history."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_LM(tokens, order: int) -> LanguageModel:\n",
    "    \"\"\"Create and train a language model of given order on the given tokens.\"\"\"\n",
    "    LM = LanguageModel(order)\n",
    "    history = []\n",
    "    for token in tokens:\n",
    "        LM[cat(history)][token] += 1\n",
    "        history = (history + [token])[-order:] \n",
    "    return LM\n",
    "\n",
    "def generate_tokens(LM: LanguageModel, length=1000, start=()) -> List[Token]:\n",
    "    \"\"\"Generate a random text of `length` tokens, with an optional start, from `LM`.\"\"\"\n",
    "    tokens = list(start)\n",
    "    while len(tokens) < length:\n",
    "        history = cat(tokens[-LM.order:])\n",
    "        tokens.append(random_sample(LM[history]))\n",
    "    return tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are three auxiliary functions:\n",
    "- `gen` is a convenience function to call `generate_tokens`, concatenate the resulting tokens, and print them.\n",
    "- `random_sample` randomly chooses a single token from a Counter, with probability in proportion to its count.\n",
    "- `cat` is a utility function to concatenate strings (tokens) into one big string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gen(LM: LanguageModel, length=1000, start=()) -> None:\n",
    "    \"\"\"Call generate_tokens and print the resulting tokens.\"\"\"\n",
    "    print(cat(generate_tokens(LM, length, start)))\n",
    "    \n",
    "def random_sample(counter: Counter) -> Token:\n",
    "    \"\"\"Randomly sample a token from the counter, proportional to each token's count.\"\"\"\n",
    "    return random.choices(list(counter), weights=list(counter.values()), k=1)[0]\n",
    "\n",
    "cat = ''.join # Function to join strings together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train a character-level language model of order 4 on the Shakespeare data. We'll call the model `LM`.  (Note that saying `tokens=data` means that the sequence of tokens is equal to the sequence of characters in `data`; in other words each character is a token.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = open(\"shakespeare_input.txt\").read()\n",
    "\n",
    "LM = train_LM(tokens=data, order=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some examples of what's in the model, for various 4-character histories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'n': 78, 'r': 35})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "LM[\"chai\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'n'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_sample(LM[\"chai\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'p': 1360,\n",
       "         's': 2058,\n",
       "         'l': 1006,\n",
       "         'o': 530,\n",
       "         'g': 1037,\n",
       "         'c': 1561,\n",
       "         'a': 554,\n",
       "         'C': 81,\n",
       "         'r': 804,\n",
       "         'h': 1029,\n",
       "         'R': 45,\n",
       "         'd': 1170,\n",
       "         'w': 1759,\n",
       "         'b': 1217,\n",
       "         'm': 1392,\n",
       "         'v': 388,\n",
       "         't': 1109,\n",
       "         'f': 1258,\n",
       "         'i': 298,\n",
       "         'n': 616,\n",
       "         'V': 18,\n",
       "         'e': 704,\n",
       "         'u': 105,\n",
       "         'L': 105,\n",
       "         'y': 120,\n",
       "         'A': 29,\n",
       "         'H': 20,\n",
       "         'k': 713,\n",
       "         'M': 54,\n",
       "         'T': 102,\n",
       "         'j': 99,\n",
       "         'q': 171,\n",
       "         'K': 22,\n",
       "         'D': 146,\n",
       "         'P': 54,\n",
       "         'S': 40,\n",
       "         'G': 75,\n",
       "         'I': 14,\n",
       "         'B': 31,\n",
       "         'W': 14,\n",
       "         'E': 77,\n",
       "         'F': 103,\n",
       "         'O': 3,\n",
       "         \"'\": 10,\n",
       "         'z': 6,\n",
       "         'J': 30,\n",
       "         'N': 18,\n",
       "         'Q': 7})"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "LM[\"the \"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So `\"chai\"` is followed by either `'n'` or `'r'`, and almost any letter  can follow `\"the \"`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating Shakespeare\n",
    "\n",
    "We cann generate a random text from the order 4 character model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First.\n",
      "\n",
      "ROSALIND:\n",
      "Warwick, but all house, stillow far into my ghost\n",
      "prescience,\n",
      "And will make the emperous in you gods!--One: that\n",
      "Is ther,\n",
      "that we stranglish, where assure, sir; a be king I will be accusers kind.\n",
      "\n",
      "SIR HUGH EVANS:\n",
      "Four grace\n",
      "By so ye\n",
      "As the admit trustice; but my wantony of thou did; and the shape Hectors:\n",
      "Borest long.\n",
      "Thou Marchs of him: his honour?\n",
      "I this? hath is feed.\n",
      "\n",
      "PANDARUS:\n",
      "You anon her for you.\n",
      "\n",
      "OCTAVIUS CAESAR:\n",
      "You have saw to be a Jew, Orland would now welcomes no little prison! love the rest.\n",
      "\n",
      "MARTIUS CAESAR:\n",
      "Well, and the patience is adverthrowing me, I shall thee go thee whom I than to such and thy absolved\n",
      "it, the ques\n",
      "Weep a slighty sound,\n",
      "Lychorse.\n",
      "\n",
      "DESDEMONA:\n",
      "This let us,\n",
      "And so\n",
      "On you, follow and epitation.\n",
      "\n",
      "FALSTAFF:\n",
      "Tell down.\n",
      "These will luck'd out thee will teach of my apparis it, take of honest\n",
      "Then for adors; the customary shot of that's place upstarved, ruin throne.\n",
      "\n",
      "LAUNCE:\n",
      "None lowly death,\n",
      "Hath Parising upon all, bles, the live.\n",
      "\n",
      "VIOLANUS:\n",
      "\n"
     ]
    }
   ],
   "source": [
    "gen(LM)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Order 4 captures the structure of plays, mentions some characters, and generates mostly English words. But the words don't always go together to form grammatical sentences, and there is certainly no coherence or plot. \n",
    "\n",
    "## Generating Order 7 Shakespeare\n",
    "\n",
    "What if we increase it to order 7? Or 10? We find that the output gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler *n*-gram model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizen:\n",
      "Which priest, Camillo,\n",
      "I conjurations.\n",
      "\n",
      "GOWER:\n",
      "Here's Wart; thou at supper in't; and heart: but for that she shunn'd,\n",
      "Nature of that are so many countryman, there's\n",
      "another he folly could have spoke with me?\n",
      "\n",
      "VINCENTIO:\n",
      "You cannot cog, I can scarce any joy\n",
      "Did ever saw a smith stand by the earth\n",
      "I' the pedlar;\n",
      "Money's a merited. These gloves are not palter it.\n",
      "\n",
      "DUKE OF YORK:\n",
      "I will go or tarrying you.\n",
      "\n",
      "PRINCE HENRY:\n",
      "Well, young son of Herne the shore the top,\n",
      "And sigh a note for a king in the lean past your ambitious is the question: he swords:\n",
      "Thanks, Rosencrantz and gentlemen: they are comes here.\n",
      "Then, let grass grow dim. Fare you married Juliet?\n",
      "\n",
      "LUCIANA:\n",
      "Say, I would pardon. So it must be annoyance that thou shall fair praise\n",
      "Have I something the lowest plant;\n",
      "Whose in our army and unyoke.\n",
      "\n",
      "BEDFORD:\n",
      "'Twas very much content,\n",
      "Clarence, be assure you will.\n",
      "\n",
      "COMINIUS:\n",
      "One word more: I am out--\n",
      "That canst thou speak of Naples; 'twixt\n",
      "The flight, never\n",
      "Have writ to his ca\n"
     ]
    }
   ],
   "source": [
    "gen(train_LM(data, order=7))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating Order 10 Shakespeare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizen:\n",
      "Why answer not?\n",
      "\n",
      "ADRIANA:\n",
      "By thee; and put a tongue,\n",
      "That in civility and patience waiteth on true sorrow.\n",
      "And see where comes another man, if I live.\n",
      "\n",
      "TITUS ANDRONICUS:\n",
      "Titus, prepare yourself in Egypt?\n",
      "\n",
      "Soothsayer:\n",
      "Beware the foul fiend! five\n",
      "fiends have brought to be commander of the night gone by.\n",
      "\n",
      "OCTAVIUS CAESAR:\n",
      "O Antony, beg not your offence:\n",
      "If you accept of grace: and from thy bed, fresh lily,\n",
      "And when she was gone, but his judgment pattern of mine eyes;\n",
      "Examine other be, some Florentine,\n",
      "Deliver'd, both in one key,\n",
      "As if our hands, our lives--\n",
      "\n",
      "MACBETH:\n",
      "I'll call them all encircle him about the streets?\n",
      "\n",
      "First Gentleman:\n",
      "That is intended drift\n",
      "Than, by concealing it, heap on your way, which\n",
      "is the way to Chester; and I would I wear them o' the city.\n",
      "\n",
      "Citizens:\n",
      "No, no, no; not so; I did not think,--\n",
      "My wife is dead.\n",
      "Your majesty is\n",
      "returned with some few bands of chosen shot I had,\n",
      "That man may question? You seem to have said, my noble and valiant;\n",
      "For thou has\n"
     ]
    }
   ],
   "source": [
    "gen(train_LM(data, order=10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aside: Probabilities and Smoothing\n",
    "\n",
    "Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P*(*t* | *h*) can be computed as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def P(t, h, LM=LM): \n",
    "    \"The probability that token t follows history h.\"\"\"\n",
    "    return LM[h][t] / sum(LM[h].values())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.09286165508528112"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P('s', 'the ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6902654867256637"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P('n', 'chai')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.30973451327433627"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P('r', 'chai')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P('s', 'chai')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P(' ', 'chai')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Shakespeare never wrote about \"chaise longues,\" or \"chai tea\" so the probability of an `'s'` or `' '` following `'chai'` is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters `'chais'` or `'chai '` to appear in a gebnerated text, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. In this notebook we stick to the basic unsmoothed model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aside: Starting Text\n",
    "\n",
    "One thing you may have noticed: all the generated passages start the same. Why is that? Because the training data happens to start with the line \"First Citizen:\", and so when we call `generate_tokens`, we start with an empty history, and the only thing that follows the empty history in the training data is the letter \"F\", the only thing that follows \"F\" is \"i\", and so on, until we get to a point where there are multiple choices. We could get more variety in the start of the generated text by breaking the training text up into multiple sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of tokens/characters.\n",
    "\n",
    "We can give a starting text to `generate_tokens` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference. For example, the following won't make the model generate a story about Romeo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROMEO:\n",
      "Ay, strong;\n",
      "A little. Prince of such matten a virtue's states:\n",
      "Now country's quent again,\n",
      "A most unacceptre her no leave-thou true shot, and, to\n",
      "things of mine hath done alehousand pride\n",
      "For be, whithee thus with of damm'st fly and came which he rank,\n",
      "Put not wages woman! Harform\n",
      "So long agon's side,\n",
      "Singer's me her my slinged! He wit:' quoth Dorse! art than manner\n",
      "straine.\n",
      "\n",
      "AGAMEMNON:\n",
      "I would says.\n",
      "\n",
      "BEATRICE:\n",
      "Between my secreason ragined our uncrowned justing in thither naturer?\n",
      "\n",
      "GLOUCESTER:\n",
      "The king.\n",
      "\n",
      "DOGBERRY:\n",
      "Do now nothink you\n",
      "The tired.\n",
      "\n",
      "HORATIAN:\n",
      "It stray\n",
      "To see, peers, eo me, while; this days this man carbona-mi's, and for\n",
      "in this late ther's loath,\n",
      "Diminutes, shepher, sir.\n",
      "\n",
      "BARDOLPH:\n",
      "So.\n",
      "\n",
      "PETRUCHIO:\n",
      "'Tis Angels upon me?\n",
      "Hath no fare on,\n",
      "what's the\n",
      "moods.\n",
      "\n",
      "KING RICHARLES:\n",
      "Thou odourse first Servantas, given in the life' thee, he same world I have spirit,\n",
      "Or lord Phoebus' Corne.\n",
      "\n",
      "KENT:\n",
      "Go you shafts that satisfied; leaven blession what charge of a this: for.\n",
      "\n",
      "ACHIMO:\n",
      "'Zou\n"
     ]
    }
   ],
   "source": [
    "gen(LM, start='ROMEO')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linux Kernel C++ Code\n",
    "\n",
    "Goldberg's point is that the simple character-level n-gram model performs about as well as the  more complex RNN model on Shakespearean text. \n",
    "\n",
    "But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  241465  759639 6206997 linux_input.txt\n"
     ]
    }
   ],
   "source": [
    "! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt\n",
    "! wc linux_input.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "linux = open(\"linux_input.txt\").read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating Order 10 C++\n",
    "\n",
    "We'll start with an order-10 character model, and compare that to an order-20 model. We'll generate a longer text, because sometimes 1000 characters ends up being just one long comment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel.h>\n",
      "#include <linux/security.h>\n",
      "#include <linux/syscalls.h>\n",
      "#include <linux/ptrace.h>\n",
      "#include <linux/ftrace_event_file *file)\n",
      "{\n",
      "\tstruct tracer *t)\n",
      "{\n",
      "\twhile (!list_empty(&op->list, &ap->list) && kprobe_disarmed(p))\n",
      "\t\treturn;\n",
      "\n",
      "\t/*\n",
      "\t * Actually logging the members for stats sorting */\n",
      "\tint\t\t\twork_color = next_color;\n",
      "\t\t\t\tatomic_inc(&bt->dropped, 0);\n",
      "\tINIT_LIST_HEAD(&rp->free_instance *pinst, int cpu)\n",
      "{\n",
      "\tstruct list_head *timers = tsk->cpu_timers_exit.\n",
      " */\n",
      "void __weak arch_show_interrupt(), \"Trying to figure out\n",
      "\t * to the pending nohz_balance_enter_idle(void) { return -EAGAIN;\n",
      "}\n",
      "\n",
      "/**\n",
      " * irq_domain *domain;\n",
      "\n",
      "\tdomain = irq_data_to_desc(data);\n",
      "\n",
      "\tif (unlikely(iter->cache_reader_page->list, pages);\n",
      "\tswsusp_show_speed(start, end = ULLONG_MAX;\n",
      "\t\treturn;\n",
      "\n",
      "preempt:\n",
      "\tresched_curr(rq);\n",
      "#endif\n",
      "\t\tsiginitset(&flush, sigmask(SIGSTOP));\n",
      "\t__set_current_state(TASK_INTERRUPTS)\n",
      "\t\treturn -EINVAL;\n",
      "\t\terror = prepare_image(&orig_bm, &copy_bm);\n",
      "\n",
      "\twhile (ptr < buf + buf_len) {\n",
      "\t\tab = audit_log_lost(const char __user *uss, stack_t __user *, uset,\n",
      "\t\tcompat_size_t __user *, out)\n",
      "{\n",
      "\tstruct ring_buffer_event_data(event);\n",
      "\tuserpg->size = cnt;\n",
      "\n",
      "\tif (trace_probe_ops = {\n",
      "\t.start\t\t= enum_map_next,\n",
      "\t.show\t= ls_show,\n",
      "};\n",
      "\n",
      "static inline int try_force_unload(unsigned long symbols have no timeslice manager)\n",
      " * 3. give it to the parent interval.  Besides, a forking task.\n",
      "\t */\n",
      "\ttsk->flags |= FTRACE_UPDATE_MAKE_NOP;\n",
      "}\n",
      "\n",
      "/**\n",
      " * ktime_get_fast_ns(&tk_fast_mono ____cacheline_aligned_in_smp;\n",
      "\t\t\t\t\t/* did other CPUs / NUMA node within the range\n",
      " *\tof pages statically initialized\n",
      " */\n",
      "static int rcu_is_cpu_rrupt_from_idle(void)\n",
      "{\n",
      "\tif (!ab)\n",
      "\t\treturn;\n",
      "\t}\n",
      "#endif\n",
      "\treturn 1;\n",
      "}\n",
      "__setup(\"nohz_full=\", tick_nohz_irq_exit(void)\n",
      "{\n",
      "\tstruct kmsg_dump_get_line);\n",
      "\n",
      "/**\n",
      " * clockevents_shutdown(bc);\n",
      "\t} else {\n",
      "\t\tif (ret)\n",
      "\t\t\t\tbreak;\n",
      "\t\t\t\t}\n",
      "\t\t\t}\n",
      "\t\t}\n",
      "\t}\n",
      "\n",
      "\t/*\n",
      "\t * First we add the event header)\n",
      " *\n",
      " * Returns the event name */\n",
      "\t\tkallsyms_expand_symbol(const char *s)  \\\n",
      "{            / _-----=> irqs-off        \\n\");\n",
      "}\n",
      "\n",
      "static int __ftrace_graph_function(&trace_ops,\n",
      "\t\t * it cannot be lesser than the corresponding to the\n",
      " * func_stack, struct k_sigaction *action;\n",
      "\tstruct ftrace_func_entry *ent, *next = buff;\n",
      "\tstruct css_set is dead. So we can share the same order on entry: not idle task\n",
      " * @prio: prio value (kernel-internal declarations,\n",
      " * signal watchdog clocksource_max_adjustment caused clocksources\n",
      " */\n",
      "struct audit_krule *r, *nextr;\n",
      "\tstruct pid *find_pid_ns(nr, ns);\n",
      "\t\tif (!op->trampoline call the subsystem *system = NULL;\n",
      "\t\treturn;\n",
      "\t\t}\n",
      "\t\treturn 0;\n",
      "\t\t}\n",
      "\t\tunlock_timer_bases) =\n",
      "{\n",
      "\n",
      "\t.lock = __MUTEX_INITIALIZED.\n",
      " * (Initialization\n",
      " * to reset tstamp_enabled accounting. Freeze the statistics), fill\n",
      " * an iteration and returns the least surprise, but since kill is not\n",
      "\t * run yet) will take care of permission check was made by the (now remove its dependency(struct trace_event_read_unlock(&pinst->lock);\n",
      "\t\treturn WALK_PRED_DEFAULT,\n",
      "\t.cap_inheritable;\n",
      "\tax->old_pcap.effective_cpus, GFP_KERNEL);\n",
      "\tif (!ctx->prio) {\n",
      "\t\t\tsp.s\n"
     ]
    }
   ],
   "source": [
    "gen(train_LM(linux, order=10), length=3000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Order 20 C++"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/irq/pm.c\n",
      " *\n",
      " * Copyright (C) 2004 Pavel Machek <pavel@ucw.cz>\n",
      " * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com>\n",
      " * Copyright (C) 2014 Seth Jennings <sjenning@redhat.com>\n",
      " * Copyright (C) 2005-2006 Thomas Gleixner\n",
      " *\n",
      " *  No idle tick implementation for gcov data files. Release resources allocated\n",
      " * by open().\n",
      " */\n",
      "static int gcov_seq_release(struct inode *inode, struct file *file)\n",
      "{\n",
      "\tput_event(file->private_data);\n",
      "\t}\n",
      "\n",
      "\treturn 0;\n",
      "\n",
      "free:\n",
      "\tklp_free_funcs_limited(obj, NULL);\n",
      "\t\tkobject_put(obj->kobj);\n",
      "\t}\n",
      "}\n",
      "\n",
      "static void print_other_cpu_stall(rsp, gpnum);\n",
      "\t}\n",
      "}\n",
      "\n",
      "/**\n",
      " * rcu_cpu_stall_reset(void)\n",
      "{\n",
      "\tstruct tick_sched *ts)\n",
      "{\n",
      "\tktime_t now = ktime_get();\n",
      "\n",
      "\tif (ts->idle_active && nr_iowait_cpu(cpu) > 0) {\n",
      "\t\t\tktime_t delta = ktime_sub(now, alarm->node.expires,\n",
      "\t\t\t\t\t\t\tincr*overrun);\n",
      "\n",
      "\t\tif (alarm->node.expires,\n",
      "\t\t\t\t\t\t\tincr*overrun);\n",
      "\n",
      "\t\tif (alarm->node.expires,\n",
      "\t\t\t\tHRTIMER_MODE_ABS);\n",
      "\t\tif (!hrtimer_active(&t.timer))\n",
      "\t\tt.task = NULL;\n",
      "\n",
      "\tif (likely(!audit_ever_enabled))\n",
      "\t\treturn;\n",
      "\n",
      "\t/*\n",
      "\t * if we're unable to extend our runtime we resched so that the active\n",
      "\t * hierarchy can be throttled\n",
      "\t */\n",
      "\tif (!assign_cfs_rq_runtime(cfs_rq);\n",
      "\n",
      "\ttg->cfs_rq[cpu] = cfs_rq;\n",
      "\ttg->se[cpu] = se;\n",
      "\n",
      "\t/* se could be NULL for root_task_group */\n",
      "\tif (!se)\n",
      "\t\treturn;\n",
      "\n",
      "\tif (!parent) {\n",
      "\t\tse->cfs_rq = &rq->cfs;\n",
      "\t\tse->depth = 0;\n",
      "\t} else {\n",
      "\t\tse->cfs_rq = parent->my_q;\n",
      "\t\tse->depth = parent->depth + 1;\n",
      "\t}\n",
      "\n",
      "\tse->my_q = cfs_rq;\n",
      "\t/* guarantee group entities always have weight */\n",
      "\tupdate_load_set(&se->load, weight);\n",
      "\n",
      "\tif (se->on_rq)\n",
      "\t\taccount_entity_enqueue(cfs_rq, se);\n",
      "\tupdate_cfs_shares(cfs_rq);\n",
      "\n",
      "#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER */\n",
      "\n",
      "long prctl_get_seccomp(void)\n",
      "{\n",
      "\treturn current->seccomp.mode is non-zero, it may not be changed.\n",
      " *\n",
      " * Returns 0 on success.\n",
      " */\n",
      "int proc_dostring(struct ctl_table *table, int write,\n",
      "\t\t  void __user *buffer, size_t *lenp,\n",
      "\t\t   loff_t *ppos)\n",
      "{\n",
      "\tint err = 0;\n",
      "\tbool first = 1;\n",
      "\tsize_t left = *lenp;\n",
      "\tunsigned long bitmap_size)\n",
      "{\n",
      "\tint i, j;\n",
      "\tunsigned long all_timers;\n",
      "\tint cpu;\n",
      "\tstruct trace_array_cpu *data);\n",
      "\n",
      "struct trace_entry *ent)\n",
      "{\n",
      "\tconst __u64 *val = pdu_start(ent);\n",
      "\t__u64 sector_from = __r->sector_from;\n",
      "\n",
      "\tr->device_from = be32_to_cpu(__r->device_from);\n",
      "\tr->device_to   = be32_to_cpu(__r->device_to);\n",
      "\tr->sector_from = be64_to_cpu(sector_from);\n",
      "}\n",
      "\n",
      "typedef void (blk_log_action_t) (struct trace_iterator *iter;\n",
      "\tstruct seq_file *m = file->private_data;\n",
      "\n",
      "\tstruct ring_buffer_event *\n",
      "trace_event_buffer_lock_reserve(buffer, 10);\n",
      "\t\t\tif (!event) {\n",
      "\t\t\t\tmissed++;\n",
      "\t\t\t} else {\n",
      "\t\t\t\tverbose(\"invalid bpf_context access off=%d size=%d\\n\", off, size);\n",
      "\treturn -EACCES;\n",
      "}\n",
      "\n",
      "/* check whether memory at (regno + off) is accessible for t = (read | write)\n",
      " * if t==write, value_regno is a register which value is stored into memory\n",
      " * if t==read && value_regno==-1, some unknown value is stored into memory\n",
      " * if t==read && value_regno==-1, don't care what we read from memory\n",
      " */\n",
      "static int check_prlimit_permission(tsk);\n",
      "\tif (ret) {\n",
      "\t\trcu_read_unlock();\n",
      "\n",
      "\tif (!child)\n",
      "\t\tret\n"
     ]
    }
   ],
   "source": [
    "gen(train_LM(linux, order=20), length=3000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis of Generated Linux Text\n",
    "\n",
    "As Goldberg says, \"Order 10 is pretty much junk.\" But order 20 is much better. Most of the comments have a start and an end; most of the open parentheses are balanced with a close parentheses; but the braces are not as well balanced. That shouldn't be surprising. If the span of an open/close parenthesis pair is less than 20 characters then it can be represented within the model, but if the span of an open/close brace is more than 20 characters, then it cannot be represented by the model. Goldberg notes that Karpathy's RRN seems to have learned to devote some of its long short-term memory (LSTM) to representing nesting level, as well as things like whether we are currently within a string or a comment. It is indeed impressive, as Karpathy says, that the model learned to do this on its own, without any input from the human engineer.\n",
    "\n",
    "## Token Models versus Character Models\n",
    "\n",
    "Karpathy and Goldberg both used character models, because the exact formatting of characters (especially indentation and line breaks) is important in the format of plays and C++ programs. But if you are interested in generating paragraphs of text that don't have any specific format, it is  common to use a **word** model, which represents the probability of the next word given the previous words, or a **token** model in which tokens can be words, punctuation, or parts of words. For example, the text `\"Spiderman!\"` might be broken up into the three tokens `\"Spider\"`, `\"man\"`, and `\"!\"`. \n",
    "\n",
    "One simple way of tokenizing a text is to break it up into alternating word and non-word characters; the function `tokenize` does that by default:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def tokenize(text: str, regex=r'\\w+|\\W+') -> List[Token]: \n",
    "    \"\"\"Break text up into tokens using regex; \n",
    "    by default break into alternating word-character and non-word-character tokens.\"\"\"\n",
    "    return re.findall(regex, text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert tokenize('Soft! who comes here?') == [\n",
    "    'Soft', '! ', 'who', ' ', 'comes', ' ', 'here', '?']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can train a token model on the Shakespeare data. A model of order 6 keeps a history of up to three word and three non-word tokens. The keys of the language model dictionary consist of strings formed by concatenating the 6 tokens together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "TLM = train_LM(tokenize(data), order=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'Romeo': 1})"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "TLM['wherefore art thou ']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'stars': 1, 'Grecian': 1})"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "TLM['not in our ']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'life': 1, 'business': 1, 'dinner': 1, 'time': 1})"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "TLM['end of my ']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({' ': 2})"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "TLM[' end of my']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see below that the quality of the token models is similar to character models, and improves from 6 tokens to 8:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizen:\n",
      "Before we proceed any further, hear me speak.\n",
      "\n",
      "CASSIO:\n",
      "Madam, not now: I am glad to hear you tell my worth\n",
      "Than you much willing to be counted wise\n",
      "In spending your wit in the praise.\n",
      "\n",
      "AJAX:\n",
      "I do hate him\n",
      "As rootedly as I. Burn but his books.\n",
      "He has brave utensils,--for so he calls me: now I feed myself\n",
      "With most delicious poison. Think on me,\n",
      "That cropp'd the golden prime of this sweet prince,\n",
      "And if your grace mark every circumstance,\n",
      "You have great reason to do Richard right;\n",
      "Especially for those occasions\n",
      "At Eltham Place I told your majesty as much before:\n",
      "This proveth Edward's love and Warwick's, and must have my will;\n",
      "If one good deed in a naughty world.\n",
      "\n",
      "NERISSA:\n",
      "When the moon shone, we did not see your grace.\n",
      "\n",
      "DUKE:\n",
      "I am sorry, madam, I have hurt your kinsman:\n",
      "But, had it been a carbuncle\n",
      "Of Phoebus' wheel, and might so safely, had it\n",
      "Been all the worth of man's flesh taken from a man\n",
      "Is not so estimable, profitable neither,\n",
      "As flesh of muttons, beefs, or goats. I say,\n",
      "To buy his favour, I extend this friendship:\n",
      "If he will touch the gate.\n",
      "You are a saucy fellow:\n",
      "Deserve we no more reverence?\n",
      "\n",
      "GRIFFITH:\n",
      "You are to blame,\n",
      "Knowing she will not, I\n",
      "will; for good things should be praised.\n",
      "\n",
      "SPEED:\n",
      "'Item: She is curst.'\n",
      "\n",
      "LAUNCE:\n",
      "Well, the best is, he lives not in them.\n",
      "\n",
      "LUCIO:\n",
      "Friar, thou knowest not the duke so well as plain-dealing, which will not show without knocking.\n",
      "The man's undone forever; for if Hector break not his\n",
      "neck i' the combat, he'll break 't himself in\n",
      "vain-glory. He knows not me: I said 'Good morrow,\n",
      "Ajax;' and he replies 'Thanks, Agamemnon.' What think\n",
      "you of this fair company\n",
      "Clapp'd wings to me.\n",
      "\n",
      "Chamberlain:\n",
      "You are young, Sir Harry Guildford.\n",
      "\n",
      "SANDS:\n",
      "Sir Thomas Lovell, take't of my soul,\n",
      "That I must bear it.\n",
      "\n",
      "SIR HUGH EVANS:\n",
      "Got's will, and his passion of my heart\n",
      "Her eyes, her hair, her cheek, her lip,\n",
      "Nay, her foot speaks; her wanton spirits look out\n",
      "At every joint and motive of her body stands Ireland?\n",
      "\n",
      "DROMIO OF SYRACUSE:\n",
      "Faith, I saw it in his pocket too; and this, it seems,\n",
      "Roderigo meant to have sent this damned villain;\n",
      "But that belike Iago in the interim\n",
      "Came in and satisfied him.\n",
      "\n",
      "OTHELLO:\n",
      "O the pernicious caitiff!\n",
      "How came you, Cassio, by that handkerchief\n",
      "That was my wife a girl;\n",
      "Your precious self had then not cross'd the Hellespont.\n",
      "\n",
      "PROTEUS:\n",
      "That's a lie in thy sheet of\n",
      "paper, although the sheet were big enough for him otherwise he might put on a hat,\n",
      "a muffler and a kerchief, and so escape.\n",
      "\n",
      "FALSTAFF:\n",
      "Good hearts, devise something: any extremity rather\n",
      "than a mischief.\n",
      "\n",
      "MISTRESS FORD:\n",
      "My maid's aunt, the fat woman of Brentford; he swears\n",
      "she's a \n"
     ]
    }
   ],
   "source": [
    "gen(TLM)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizen:\n",
      "Before we proceed any further, hear me speak.\n",
      "\n",
      "All:\n",
      "Peace, ho! Hear Antony. Most noble Antony!\n",
      "\n",
      "ANTONY:\n",
      "Why, friends, you go to do you know not what:\n",
      "Wherein hath Caesar thus deserved your loves?\n",
      "Alas, you know not: I must tell you then:\n",
      "You have forgot the will I told you of.\n",
      "\n",
      "All:\n",
      "Most true. The will! Let's stay and hear the will.\n",
      "\n",
      "ANTONY:\n",
      "Here is the will, and under Caesar's seal.\n",
      "To every Roman citizen he gives,\n",
      "To every several man, seventy-five drachmas.\n",
      "\n",
      "Second Citizen:\n",
      "Most noble Caesar! We'll revenge his death.\n",
      "\n",
      "Third Citizen:\n",
      "O royal Caesar!\n",
      "\n",
      "ANTONY:\n",
      "Hear me with patience.\n",
      "\n",
      "IMOGEN:\n",
      "Talk thy tongue weary; speak\n",
      "I have heard I am a strumpet; and mine ear\n",
      "Therein false struck, can take no greater wound,\n",
      "Nor tent to bottom that. But speak.\n",
      "\n",
      "PISANIO:\n",
      "Then, madam,\n",
      "I thought you would not back again.\n",
      "\n",
      "IMOGEN:\n",
      "Most like;\n",
      "Bringing me here to kill me.\n",
      "\n",
      "PISANIO:\n",
      "Not so, neither:\n",
      "But if I were as tedious as go o'er:\n",
      "Strange things I have in head, that will to hand;\n",
      "Which must be acted ere they may be scann'd.\n",
      "\n",
      "LADY MACBETH:\n",
      "You lack the season of all natures, sleep.\n",
      "\n",
      "MACBETH:\n",
      "Come, we'll to sleep. My strange and self-abuse\n",
      "Is the initiate fear that wants hard use:\n",
      "We are yet but young in deed.\n",
      "\n",
      "First Witch:\n",
      "Why, how now, Hecate! you look angerly.\n",
      "\n",
      "HECATE:\n",
      "Have I not reason to prefer mine own?\n",
      "\n",
      "VALENTINE:\n",
      "And I will help thee to prefer her too:\n",
      "She shall be dignified with this high honour--\n",
      "To bear my lady's train, lest the base earth\n",
      "Should from her vesture chance to steal a kiss\n",
      "And, of so great a favour growing proud,\n",
      "Disdain to root the summer-swelling flower\n",
      "And make rough winter everlastingly.\n",
      "\n",
      "PROTEUS:\n",
      "Why, Valentine, what braggardism is this?\n",
      "\n",
      "VALENTINE:\n",
      "Pardon me, Proteus: all I can is nothing\n",
      "To her whose worth makes other worthies nothing;\n",
      "She is alone.\n",
      "\n",
      "PROTEUS:\n",
      "Then let her alone.\n",
      "\n",
      "VALENTINE:\n",
      "Not for the world: why, man, she is mine own,\n",
      "And I as rich in having such a jewel\n",
      "As twenty seas, if all their sand were pearl,\n",
      "The water nectar and the rocks pure gold.\n",
      "Forgive me that I do not.\n",
      "\n",
      "PROSPERO:\n",
      "Twelve year since, Miranda, twelve year since,\n",
      "Thy father was the Duke of Milan and\n",
      "A prince of power.\n",
      "\n",
      "MIRANDA:\n",
      "Sir, are not you my husband?\n",
      "\n",
      "ANTIPHOLUS OF EPHESUS:\n",
      "No; I say nay to that.\n",
      "\n",
      "ANTIPHOLUS OF SYRACUSE:\n",
      "And so do I too, lieutenant.\n",
      "\n",
      "CASSIO:\n",
      "Ay, but, by your leave, not before me; the\n",
      "lieutenant is to be saved before the ancient. Let's\n",
      "have no more of life than may suffice\n",
      "To give my tongue that heat to ask your help;\n",
      "Which if you shall refuse, when I am dead,\n",
      "For that I am prepared and full resolved.\n",
      "Foul-spoken coward, that thunder'st with thy tongue,\n",
      "And with thy weapon nothing darest perform!\n",
      "\n",
      "AARON:\n",
      "Away, I say!\n",
      "Now, \n"
     ]
    }
   ],
   "source": [
    "gen(train_LM(tokenize(data), 8))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## C++ Token Model\n",
    "\n",
    "Similar remarks hold for token models trained on C++ data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/irq/autoprobe.c\n",
      " *\n",
      " * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar\n",
      " *\n",
      " * This file contains spurious interrupt handling.\n",
      " */\n",
      "\n",
      "#include <linux/jiffies.h>\n",
      "#include <linux/math64.h>\n",
      "#include <linux/timex.h>\n",
      "#include <linux/export.h>\n",
      "#include <linux/spinlock.h>\n",
      "#include <linux/smp.h>\n",
      "#include <linux/bug.h>\n",
      "#include <linux/cpumask.h>\n",
      "#include <linux/cpu.h>\n",
      "#include <linux/err.h>\n",
      "#include <linux/hrtimer.h>\n",
      "#include <linux/tick.h>\n",
      "#include <linux/cpu.h>\n",
      "#include <linux/err.h>\n",
      "#include <linux/init.h>\n",
      "#include <linux/errno.h>\n",
      "#include <linux/module.h>\n",
      "#include <linux/proc_fs.h>\n",
      "#include <linux/seq_file.h>\n",
      "#include <linux/time.h>\n",
      "#include <linux/hrtimer.h>\n",
      "#include <linux/interrupt.h>\n",
      "#include <linux/device.h>\n",
      "#include <linux/clocksource.h>\n",
      "#include <linux/ring_buffer.h>\n",
      "#include <generated/utsrelease.h>\n",
      "#include <linux/stacktrace.h>\n",
      "#include <linux/debug_locks.h>\n",
      "#include <linux/perf_event.h>\n",
      "\n",
      "/*\n",
      " * The run state of the lockup detectors.\n",
      "\t */\n",
      "\tif (!write) {\n",
      "\t\t*watchdog_param = (watchdog_enabled & which) != 0;\n",
      "\t\terr = proc_dointvec_minmax(table, write, buffer, lenp, ppos);\n",
      "\tput_uts(table, write, uts_table.data);\n",
      "\n",
      "\tif (write)\n",
      "\t\tproc_sys_poll_notify(table->poll);\n",
      "\n",
      "\treturn r;\n",
      "}\n",
      "#else\n",
      "#define proc_do_uts_string NULL\n",
      "#endif\n",
      "\n",
      "static DEFINE_CTL_TABLE_POLL(hostname_poll);\n",
      "static DEFINE_CTL_TABLE_POLL(domainname_poll);\n",
      "\n",
      "static struct ctl_table uts_kern_table[] = {\n",
      "\t{\n",
      "\t\t.procname\t= \"ostype\",\n",
      "\t\t.data\t\t= init_uts_ns.name.sysname,\n",
      "\t\t.maxlen\t\t= sizeof(init_uts_ns.name.version),\n",
      "\t\t.mode\t\t= 0444,\n",
      "\t\t.proc_handler\t= proc_do_uts_string,\n",
      "\t},\n",
      "\t{\n",
      "\t\t.procname\t= \"hostname\",\n",
      "\t\t.data\t\t= init_uts_ns.name.nodename,\n",
      "\t\t.maxlen\t\t= sizeof(init_uts_ns.name.release),\n",
      "\t\t.mode\t\t= 0444,\n",
      "\t\t.proc_handler\t= proc_do_uts_string,\n",
      "\t},\n",
      "\t{\n",
      "\t\t.procname\t= \"version\",\n",
      "\t\t.data\t\t= init_uts_ns.name.version,\n",
      "\t\t.maxlen\t\t= sizeof(init_uts_ns.name.sysname),\n",
      "\t\t.mode\t\t= 0444,\n",
      "\t\t.proc_handler\t= proc_do_uts_string,\n",
      "\t},\n",
      "\t{\n",
      "\t\t.procname\t= \"version\",\n",
      "\t\t.data\t\t= init_uts_ns.name.version,\n",
      "\t\t.maxlen\t\t= sizeof(init_uts_ns.name.version),\n",
      "\t\t.mode\t\t= 0444,\n",
      "\t\t.proc_handler\t= proc_do_uts_string,\n",
      "\t},\n",
      "\t{\n",
      "\t\t.procname\t= \"osrelease\",\n",
      "\t\t.data\t\t= init_uts_ns.name.release,\n",
      "\t\t.maxlen\t\t= sizeof(init_uts_ns.name.domainname),\n",
      "\t\t.mode\t\t= 0644,\n",
      "\t\t.proc_handler\t= proc_do_uts_string,\n",
      "\t\t.poll\t\t= &domainname_poll,\n",
      "\t},\n",
      "\t{}\n",
      "};\n",
      "\n",
      "static struct ctl_table uts_root_table[] = {\n",
      "\t{\n",
      "\t\t.procname\t= \"kernel\",\n",
      "\t\t.mode\t\t= 0555,\n",
      "\t\t.child\t\t= uts_kern_table,\n",
      "\t},\n",
      "\t{}\n",
      "};\n",
      "\n",
      "#ifdef CONFIG_PROC_SYSCTL\n",
      "/*\n",
      " * Notify userspace about a change in a certain entry of uts_kern_table,\n",
      " * identified by the parameter proc.\n",
      " */\n",
      "void uts_proc_notify(enum uts_proc proc)\n",
      "{\n",
      "\tstruct ctl_table *table = &uts_kern_table[proc];\n",
      "\n",
      "\tproc_sys_poll_notify(table->poll);\n",
      "}\n",
      "#endif\n",
      "\n",
      "static int __init utsname_sysctl_init(void)\n",
      "{\n",
      "\tregister_sysctl_table(uts_root_table);\n",
      "\treturn 0;\n",
      "}\n",
      "\n",
      "device_initcall(utsname_sysctl_init);\n",
      "/*\n",
      " * Public API and common code for kernel->userspace relay file support.\n",
      " *\n",
      " * See Documentation/filesystems/relay.txt for an overview.\n",
      " *\n",
      " * Copyright (C) 2002-2005 - Tom Zanussi (zanussi@us.ibm.com), IBM Corp\n",
      " * Copyright (C) 1999-2005 - Karim Yaghmour (karim@opersys.com)\n",
      " *\n",
      " * Moved to kernel/relay.c by Paul Mundt, 2006.\n",
      " * November 2006 - CPU hotplug support by Mathieu Desnoyers\n",
      " * \t(mathieu.desnoyers@polymtl.ca)\n",
      " *\n",
      " * This file is released under the GPLv2.\n",
      " *\n",
      " */\n",
      "\n",
      "#include <linux/suspend.h>\n",
      "#include <linux/syscalls.h>\n",
      "\n",
      "#include <asm/uaccess.h>\n",
      "#include <asm/io.h>\n",
      "#endif\n",
      "#ifdef CONFIG_SPARC\n",
      "#endif\n",
      "\n",
      "#ifdef __hppa__\n",
      "extern int pwrsw_enabled;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW\n",
      "extern int unaligned_enabled;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_IA64\n",
      "extern int unaligned_dump_stack;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN\n",
      "extern int no_unaligned_warning;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_PROC_SYSCTL\n",
      "\n",
      "#define SYSCTL_WRITES_LEGACY\t-1\n",
      "#define SYSCTL_WRITES_WARN\t 0\n",
      "#define SYSCTL_WRITES_STRICT\t 1\n",
      "\n",
      "static int sysctl_writes_strict = SYSCTL_WRITES_WARN;\n",
      "\n",
      "static int proc_do_cad_pid(struct ctl_table *table, int write,\n",
      "\t\t  void __user *buffer, size_t *lenp, loff_t *ppos)\n",
      "{\n",
      "\treturn -ENOSYS;\n",
      "}\n",
      "\n",
      "int proc_dointvec_jiffies(struct ctl_table *table, int write,\n",
      "\t\t  void __user *buffer, size_t *lenp, loff_t *ppos);\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_PRINTK\n",
      "static int ten_thousand = 10000;\n",
      "#endif\n",
      "\n",
      "/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */\n",
      "static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;\n",
      "\n",
      "/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */\n",
      "static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;\n",
      "\n",
      "/* this is needed for proc_doulongvec_minmax of sysctl_hung_task_timeout_secs */\n",
      "#ifdef CONFIG_DETECT_HUNG_TASK\n",
      "static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_INOTIFY_USER\n",
      "#include <linux/inotify.h>\n",
      "#endif\n",
      "#ifdef CONFIG_SPARC\n",
      "#endif\n",
      "\n",
      "#ifdef __hppa__\n",
      "extern int pwrsw_enabled;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW\n",
      "extern int unaligned_enabled;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_IA64\n",
      "extern int unaligned_dump_stack;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN\n",
      "extern int no_unaligned_warning;\n",
      "#endif\n",
      "\n",
      "#ifdef CONFIG_PROC_SYSCTL\n",
      "\n",
      "#define SYSCTL_WRITES_LEGACY\t-1\n",
      "#define SYSCTL_WRITES_WARN\t 0\n",
      "#define SYSCTL_WRITES_STRICT\t 1\n",
      "\n",
      "static int sysctl_writes_strict = SYSCTL_WRITES_WARN;\n",
      "\n",
      "static int proc_do_cad_pid(struct ctl_table *table, int write,\n",
      "\t\tvoid __user *buffer, size_t *lenp, loff_t *ppos)\n",
      "{\n",
      "\treturn do_proc_dointvec(table, write, buffer, lenp, ppos);\n",
      "}\n",
      "#endif\n",
      "\n",
      "struct do_proc_dointvec_minmax_conv_param {\n",
      "\tint *min;\n",
      "\tint *max;\n",
      "};\n",
      "\n",
      "static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,\n",
      "\t\t\t\t\t    int *valp,\n",
      "\t\t\t\t\t    int write, void *data)\n",
      "{\n",
      "\tstruct do_proc_dointvec_minmax_conv_param *param = data;\n",
      "\tif (write) {\n",
      "\t\tint val = *negp ? -*lvalp : *lvalp;\n",
      "\t\tif ((param->min && *param->min > val) ||\n",
      "\t\t    (param->max && *param->max < val))\n",
      "\t\t\treturn -EINVAL;\n",
      "\t\t*valp = val;\n",
      "\t} else {\n",
      "\t\tint val = *valp;\n",
      "\t\tunsigned long lval;\n",
      "\t\tif (val < 0) {\n",
      "\t\t\t*negp = true;\n",
      "\t\t\t*lvalp = (unsigned long)-val;\n",
      "\t\t} else {\n",
      "\t\t\t*negp = false;\n",
      "\t\t\tlval = (unsigned long)val;\n",
      "\t\t}\n",
      "\t\t*lvalp = jiffies_to_msecs(lval);\n",
      "\t}\n",
      "\treturn 0;\n",
      "}\n",
      "\n",
      "/**\n",
      " * proc_dointvec_jiffies - read a vector of integers\n",
      " * @table: the sysctl table\n",
      " * @write: %TRUE if this is a write to the sysctl file\n",
      " * @buffer: the user buffer\n",
      " * @lenp: the size of the user buffer\n",
      " * @ppos: file position\n",
      " *\n",
      " * Reads/writes a string from/to the user buffer, treated as an ASCII string. \n",
      " *\n",
      " * Returns 0 on success.\n",
      " */\n",
      "int proc_dointvec_jiffies(struct ctl_table *table, int write,\n",
      "\t\t\t void __user *buffer, size_t *lenp, loff_t *ppos)\n",
      "{\n",
      "    return do_proc_dointvec(table,write,buffer,lenp,ppos,\n",
      "\t\t    \t    NULL,NULL);\n",
      "}\n",
      "\n",
      "/*\n",
      " * Taint values can only be increased by someone holding\n",
      "\t * cgroup_lock, and that's us. The worst that can happen is that we\n",
      "\t * have some link structures left over\n",
      "\t */\n",
      "\tret = allocate_cgrp_cset_links(css_set_count, &tmp_links);\n",
      "\tif (ret)\n",
      "\t\tgoto cancel_ref;\n",
      "\n",
      "\troot->kf_root = kernfs_create_root(&cgroup_kf_syscall_ops,\n",
      "\t\t\t\t\t   KERNFS_ROOT_CREATE_DEACTIVATED,\n",
      "\t\t\t\t\t   root_cgrp);\n",
      "\tif (IS_ERR(root->kf_root)) {\n",
      "\t\tret = PTR_ERR(root->kf_root);\n",
      "\t\tgoto exit_root_id;\n",
      "\t}\n",
      "\troot_cgrp->kn = root->kf_root->kn;\n",
      "\n",
      "\tif (root == &cgrp_dfl_root)\n",
      "\t\tbase_files = cgroup_dfl_base_files;\n",
      "\telse\n",
      "\t\tbase_files = cgroup_legacy_base_files;\n",
      "\n",
      "\tret = cgroup_addrm_files(cgrp, base_files, true);\n",
      "\tif (ret)\n",
      "\t\tgoto destroy_root;\n",
      "\n",
      "\t/*\n",
      "\t * There must be no failure case after here, since rebinding takes\n",
      "\t * care of subsystems' refcounts, which are explicitly dropped in\n",
      "\t * the failure exit path.\n",
      "\t */\n",
      "\tlist_add(&root->root_list, &cgroup_roots);\n",
      "\tcgroup_root_count++;\n",
      "\n",
      "\t/*\n",
      "\t * Link the root cgroup in this hierarchy into all the css_set\n",
      "\t * objects.\n",
      "\t */\n",
      "\tdown_write(&css_set_rwsem);\n",
      "\thash_for_each(css_set_table, i, cset, hlist)\n",
      "\t\tlink_css_set(&tmp_links, cset, root_cgrp);\n",
      "\tup_write(&css_set_rwsem);\n",
      "\n",
      "\tBUG_ON(!list_empty(&root_cgrp->self.children));\n",
      "\tBUG_ON(atomic_read(&root->nr_cgrps) != 1);\n",
      "\n",
      "\tkernfs_activate(root_cgrp->kn);\n",
      "\tret = 0;\n",
      "\tgoto out;\n",
      "\n",
      "destroy_root:\n",
      "\tkernfs_destroy_root(root->kf_root);\n",
      "\troot->kf_root = NULL;\n",
      "exit_root_id:\n",
      "\tcgroup_exit_root_id(root);\n",
      "cancel_ref:\n",
      "\tpercpu_ref_exit(&root_cgrp->self.refcnt);\n",
      "out:\n",
      "\tfree_cgrp_cset_links(&tmp_links);\n",
      "\treturn ret;\n",
      "}\n",
      "\n",
      "static struct event_subsystem *\n",
      "create_new_subsystem(const char *name)\n",
      "{\n",
      "\tstruct module_param_attrs *new_mp;\n",
      "\tstruct attribute **new_attrs;\n",
      "\tunsigned int i;\n",
      "\n",
      "\t/* We don't bother to find any names.\n",
      " */\n",
      "int kallsyms_lookup_size_offset(unsigned long addr, unsigned int n)\n",
      "{\n",
      "\treturn addr + (n * sizeof(long));\n",
      "}\n",
      "#endif\n",
      "\n",
      "static unsigned long get_user_stack_nth(struct pt_regs *regs, unsigned long bp_vaddr)\n",
      "{\n",
      "\tstruct uprobe_task *utask;\n",
      "\tunsigned long orig_ret_vaddr, trampoline_vaddr;\n",
      "\tbool chained = false;\n",
      "\n",
      "\tif (!get_xol_area())\n",
      "\t\treturn;\n",
      "\n",
      "\tutask = get_utask();\n",
      "\tif (!utask)\n",
      "\t\treturn -ENOMEM;\n",
      "\n",
      "\txol_vaddr = xol_get_insn_slot(uprobe);\n",
      "\tif (!xol_vaddr)\n",
      "\t\treturn -ENOMEM;\n",
      "\n",
      "\tutask->xol_vaddr = xol_vaddr;\n",
      "\tutask->vaddr = bp_vaddr;\n",
      "\n",
      "\terr = arch_uprobe_pre_xol(&uprobe->arch, regs);\n",
      "\tif (unlikely(err)) {\n",
      "\t\txol_free_insn_slot(current);\n",
      "\t\treturn err;\n",
      "\t}\n",
      "\n",
      "\tutask->active_uprobe = uprobe;\n",
      "\tutask->state = UTASK_SSTEP;\n",
      "\treturn 0;\n",
      "}\n",
      "\n",
      "/*\n",
      " * If we are singlestepping, then ensure this thread is not connected to\n",
      " * non-fatal signals until completion of singlestep.  When xol insn itself\n",
      " * triggers the signal,  restart the original insn even if the task is increasing\n",
      "\t\t * or lowering its prio, so...\n",
      "\t\t */\n",
      "\t\tif (!rq->dl.overloaded)\n",
      "\t\t\tpull_dl_task(rq);\n",
      "\n",
      "\t\t/*\n",
      "\t\t * If we now have a earlier deadline task than p,\n",
      "\t\t * then reschedule, provided p is still on the\n",
      "\t * reader page.\n",
      "\t */\n",
      "\tif (rb_is_head_page(cpu_buffer, next_page, &tail_page->list)) {\n",
      "\n",
      "\t\t/*\n",
      "\t\t * If the commit is not on the instruction boundary */\n",
      "\t\tif ((unsigned long)p->addr != ftrace_addr)\n",
      "\t\t\treturn -EILSEQ;\n",
      "\t\tp->flags |= KPROBE_FLAG_FTRACE;\n",
      "#else\t/* !CONFIG_KPROBES_ON_FTRACE */\n",
      "\t\treturn -EINVAL;\n",
      "#endif\n",
      "\t}\n",
      "\treturn 0;\n",
      "}\n",
      "\n",
      "static int return_handler(struct kretprobe_instance *ri, struct pt_regs *regs)\n",
      "{\n",
      "\tunsigned long ret = regs_return_value(regs);\n",
      "\n",
      "\tif (ret != (rand1 / div_factor)) {\n",
      "\t\thandler_errors++;\n",
      "\t\tpr_err(\"incorrect value in kretprobe handler2\\n\");\n",
      "\t}\n",
      "\tif (krph_val == 0) {\n",
      "\t\thandler_errors++;\n",
      "\t\tpr_err(\"call to kretprobe entry handler failed\\n\");\n",
      "\t}\n",
      "\n",
      "\tkrph_val = rand1;\n",
      "\treturn 0;\n",
      "}\n",
      "\n",
      "static struct kimage *do_kimage_alloc_init(void)\n",
      "{\n",
      "\tstruct kimage *image;\n",
      "\n",
      "\t/* Allocate a controlling structure */\n",
      "\timage = do_kimage_alloc_init();\n",
      "\tif (!image)\n",
      "\t\treturn -ENOMEM;\n",
      "\n",
      "\timage->start = entry;\n",
      "\n",
      "\tret = copy_user_segment_list(image, nr_segments, segments);\n",
      "\tif (ret)\n",
      "\t\tgoto out_free_image;\n",
      "\n",
      "\tret = sanity_check_segment_list(image);\n",
      "\tif (ret)\n",
      "\t\tgoto out_free_image;\n",
      "\n",
      "\tret = sanity_check_segment_list(image);\n",
      "\tif (ret)\n",
      "\t\tgoto out_free_post_load_bufs;\n",
      "\n",
      "\tret = -ENOMEM;\n",
      "\timage->control_code_page = kimage_alloc_control_pages(image,\n",
      "\t\t\t\t\t   get_order(KEXEC_CONTROL_PAGE_SIZE));\n",
      "\tif (!image->control_code_page) {\n",
      "\t\tpr_err(\"Could not allocate swap buffer\\n\");\n",
      "\t\t\tgoto out_free_control_pages;\n",
      "\t\t}\n",
      "\t}\n",
      "\n",
      "\t*rimage = image;\n",
      "\treturn 0;\n",
      "out_free_control_pages:\n",
      "\tkimage_free_page_list(&image->control_pages);\n",
      "out_free_post_load_bufs:\n",
      "\tkimage_file_post_load_cleanup(image);\n",
      "out_free_image:\n",
      "\tkfree(image);\n",
      "\treturn ret;\n",
      "}\n",
      "\n",
      "#ifdef CONFIG_KEXEC_FILE\n",
      "static int kexec_calculate_store_digests(struct kimage *image);\n",
      "#endif\n",
      "\n",
      "/* Location of the reserved area for the crash kernel */\n",
      "struct resource crashk_res = {\n",
      "\t.name  = \"Crash kernel\",\n",
      "\t.start = 0,\n",
      "\t.end   = 0,\n",
      "\t.flags = IORESOURCE_BUSY | IORESOURCE_MEM\n",
      "};\n",
      "\n",
      "int kexec_should_crash(struct task_struct *p)\n",
      "{\n",
      "\tif (p)\n",
      "\t\tprintk(\"%16s:%5d [%p, %3d]\", p->comm, task_pid_nr(p), p, p->prio);\n",
      "\telse\n",
      "\t\tprintk(\"<none>\");\n",
      "}\n",
      "\n",
      "static void printk_lock(struct rt_mutex *lock, int print_owner)\n",
      "{\n",
      "\tif (lock->name)\n",
      "\t\tprintk(\" [%p] {%s}\\n\",\n",
      "\t\t\tlock, lock->name);\n",
      "\telse\n",
      "\t\tprintk(\" [%p] {%s:%d}\\n\",\n",
      "\t\t\tlock, lock->file, lock->line);\n",
      "\n",
      "\tif (print_owner && rt_mutex_owner(lock)) {\n",
      "\t\tprintk(\".. ->owner: %p\\n\", lock->owner);\n",
      "\t\tprintk(\".. held by:  \");\n",
      "\t\tprintk_task(rt_mutex_owner(lock));\n",
      "\t\tprintk(\"\\n\");\n",
      "\t}\n",
      "}\n",
      "\n",
      "void rt_mutex_debug_task_free(struct task_struct *task)\n",
      "{\n",
      "\tDEBUG_LOCKS_WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));\n",
      "\tDEBUG_LOCKS_WARN_ON(task->pi_blocked_on);\n",
      "}\n",
      "\n",
      "/*\n",
      " * We fill out the fields in the waiter to store the information about\n",
      " * the deadlock. We print when we return. act_waiter can be NULL in\n",
      " * case of a remove waiter operation.\n",
      " */\n",
      "void debug_rt_mutex_deadlock(enum rtmutex_chainwalk chwalk,\n",
      "\t\t\t\t    struct rt_mutex_waiter *waiter,\n",
      "\t\t\t\t    struct rt_mutex *lock);\n",
      "extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,\n",
      "\t\t\t\t       struct task_struct *proxy_owner);\n",
      "extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,\n",
      "\t\t\t      struct hrtimer_sleeper *timeout)\n",
      "{\n",
      "\tmight_sleep();\n",
      "\n",
      "\treturn rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout,\n",
      "\t\t\t\t       RT_MUTEX_FULL_CHAINWALK,\n",
      "\t\t\t\t       rt_mutex_slowlock);\n",
      "}\n",
      "\n",
      "/**\n",
      " * rt_mutex_timed_lock - lock a rt_mutex interruptible\n",
      " *\t\t\tthe timeout structure is provided\n",
      " *\t\t\tby the caller\n",
      " *\n",
      " * @lock:\t\tthe rt_mutex to take\n",
      " * @state:\t\t the state the task should block in (TASK_INTERRUPTIBLE\n",
      " * \t\t\t or TASK_UNINTERRUPTIBLE)\n",
      " * @timeout:\t\t the pre-initialized and started timer, or NULL for none\n",
      " * @waiter:\t\t the pre-initialized rt_mutex_waiter\n",
      " *\n",
      " * lock->wait_lock must be held by the caller. A call to unqueue_me() must\n",
      " * be paired with exactly one earlier call to queue_me().\n",
      " *\n",
      " * Return:\n",
      " *   1 - if the futex_q was still queued (and we removed unqueued it);\n",
      " *   0 - if the futex_q was already removed by\n"
     ]
    }
   ],
   "source": [
    "gen(train_LM(tokenize(linux), 8), length=3000)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}