<div style="text-align: right" align="right"><i>Peter Norvig<br> 2019, revised Jan 2024<br>Based on <a href="https://nbviewer.org/gist/yoavg/d76121dfde2618422139https://nbviewer.org/gist/yoavg/d76121dfde2618422139">Yoav Goldberg's 2015 notebook</a></i></div> 

# Generative Character-Level Language Models

This is a variant of [**Yoav Goldberg's 2015 notebook**](https://nbviewer.org/gist/yoavg/d76121dfde2618422139) on character-level language models, which in turn was a response to  [**Andrej Karpathy's 2015 blog post**](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) on recursive neural network (RNN) language models. The term [generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligencehttps://en.wikipedia.org/wiki/Generative_artificial_intelligence) is all the rage these days; it refers to computer programs that can *generate* something new (such as an image or a piece of text) based on a model learned from training data. Back in 2015 generative AI was just starting to take off, and Karpathy's point was that the RNNs were unreasonably effective at generating good text, even though they are at heart quite simple. Goldberg's point was that, yes, that's true, but actually most of the magic is not in the RNNs, it is  in the training data itself, and an even simpler model (with no neural nets) does just as well at generating English text. Goldberg did agree with Karpathy that the RNN captures some aspects of C++ code that the character-level model does not.

My implementation is similar to Goldberg's, but I updated his code to use Python 3 instead of Python 2, and made some additional changes for simplicity and clarity. (This makes the code less efficient than it  could be, but plenty fast enough.) 

## Definition

What do we mean by a **generative character-level language model**? It means a model that, when given a sequence of characters, can predict what character comes next; it can generate a continuation of a partial text. (And when the partial text is empty, it can generate the whole text.) In terms of probabilities, the model represents *P*(*c* | *h*), the probability distribution that the next character will be *c*, given a history of previous characters *h*. For example, given the previous characters `'chai'`, a character-level model should learn to predict that the next character is probably `'r'` or `'n'` (to form the word `'chair'` or `'chain'`). Goldberg calls this a model of order 4 (because it considers histories of length 4) while other authors call it an *n*-gram model with *n* = 5 (because it represents the probabilities of sequences of 5 characters).

## Training Data

How does the language model learn these probabilities? By observing a sequence of characters that we call the **training data**. Both Karpathy and Goldberg use the complete works of Shakespeare as their initial training data:

In [1]:
! [ -f shakespeare_input.txt ] || curl -O https://norvig.com/ngrams/shakespeare_input.txt
! wc shakespeare_input.txt # Print the number of lines, words, and characters

  167204  832301 4573338 shakespeare_input.txt


In [2]:
! head shakespeare_input.txt # First 10 lines

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:


## Python Code

There are four main parts to the code:

- `LanguageModel` is a `defaultdict` that maps a history *h* to a `Counter` of the number of times each character *c* appears immediately following *h* in the training data. 
- `train_LM` takes a string of training `data` and an `order`, and builds a language model, formed by counting the times each character *c* occurs and storing that under the entry for the history *h* of characters that precede *c*. 
- `generate_text` generates a random text, given a language model, a desired length, and an optional start of the text. At each step it looks at the previous `order` characters and chooses a new character at random from the language model's counter for those previous characters.
- `random_sample` randomly chooses a single character from a counter, with each possibility chosen in proportion to the character's count.

In [3]:
import random
from collections import defaultdict, Counter

class LanguageModel(defaultdict): """A mapping of {history: Counter(characters)}."""

def train_LM(data: str, order: int) -> LanguageModel:
    """Train a character-level language model of given `order` on the training `data`."""
    LM = LanguageModel(Counter)
    LM.order = order
    history = ''
    for c in data:
        LM[history][c] += 1
        history = (history + c)[-order:] # add c to history; truncate history to length `order`
    return LM

def generate_text(LM: LanguageModel, length=1000, text='') -> str:
    """Generate a random text of `length` characters, with an optional start, from `LM`."""
    while len(text) < length:
        history = text[-LM.order:]
        text = text + random_sample(LM[history])
    return text

def random_sample(counter: Counter) -> str:
    """Randomly sample from the counter, proportional to each entry's count."""
    i = random.randint(1, sum(counter.values()))
    cumulative = 0
    for c in counter:
        cumulative += counter[c]
        if cumulative >= i: 
            return c

Let's train a model of order 4 on the Shakespeare data. We'll call the model `LM`, and we'll do some queries of it:

In [4]:
data = open("shakespeare_input.txt").read()

LM = train_LM(data, order=4)

In [5]:
LM["chai"]

Counter({'n': 78, 'r': 35})

In [6]:
LM["the "]

Counter({'p': 1360,
         's': 2058,
         'l': 1006,
         'o': 530,
         'g': 1037,
         'c': 1561,
         'a': 554,
         'C': 81,
         'r': 804,
         'h': 1029,
         'R': 45,
         'd': 1170,
         'w': 1759,
         'b': 1217,
         'm': 1392,
         'v': 388,
         't': 1109,
         'f': 1258,
         'i': 298,
         'n': 616,
         'V': 18,
         'e': 704,
         'u': 105,
         'L': 105,
         'y': 120,
         'A': 29,
         'H': 20,
         'k': 713,
         'M': 54,
         'T': 102,
         'j': 99,
         'q': 171,
         'K': 22,
         'D': 146,
         'P': 54,
         'S': 40,
         'G': 75,
         'I': 14,
         'B': 31,
         'W': 14,
         'E': 77,
         'F': 103,
         'O': 3,
         "'": 10,
         'z': 6,
         'J': 30,
         'N': 18,
         'Q': 7})

So `"chai"` is followed by either `'n'` or `'r'`, and almost any letter can follow `"the "`.

## Generating Shakespeare

Let's try to generate random text based on character language models of various orders, starting with order 4.

In [7]:
print(generate_text(LM))

First
five men crown tribunes an aunt, holden a suddenly daught.

HECTOR CAIUS:
And drawn by your confess, the jangle such a conferritoriest an make a cost, as you were the world,--

BENVOLIO:
Where shorter:
A stom old been;
Get you may parts food;
I serve memory her. He is come fire to the
skirted great knowledges,
monster Ajax, thou do thy heart to spend theat--unhappy in and so! There shall spectar? Goodman! we are mine an heart in then
The stomach times bear too: the emperformane
And least,
And then you are my grate and as
A woman, I cannon down!' 'Course of my
love,
The tillo, away heard
You soul issue us comes hand,
To Julius, that pattering teach thither
that for come in
a fathere growned from far
Crying, from yoursed into this.

SILVIA:
Wilt before.

PAULINA:
Might him: but Marshall be my fail age in fat, remember than arms? calls and the compulsion liar came him. If thy is bled this ever; or your tempts,
Open an of think it from him our changentleman, more ther titless them th

Order 4 captures the structure of plays, mentions some characters, and generates mostly English words, although they don't always go together to form grammatical sentences, and there is certainly no coherence or plot. 

## Generating Order 7 Shakespeare

What if we increase it to order 7? Or more? We find that it gets a bit better, roughly as good as the RNN models that Karpathy shows, and all from a much simpler model.

In [8]:
print(generate_text(train_LM(data, order=7)))

First Clown:
Like an ass of France to kill horns;
And Brutus, and in such a one as he weariness does any strange fish! Were I adore.' When we arrives him not the time, whither argument?

MARIA:
Get thee on.

SIR TOBY BELCH:
Why, 'tis well.

CLOTEN:
Sayest trusts to your royal graces,
I will draw his heinous and holiness
Than are to breath.

ISABELLA:
Madam, pardon me: teach you, sirs, be it lying so, yet but the 'ever' last?

EDWARD:
An oath in it to bid you. You a lover dearly to our roses;
For intercepted pardon him,
And even now
In any branches, wherefore let us go seek him:
There's a good master; thyself a wise men,
Let him when you are
going to his entering
into so quickly.
Which all bosom as a bell,
Remember thee who I am. Good Paulina more.' And in an hour?

ORLANDO:
As I wear
In the eastern gate, horse!
Do but he hath astonish thee apt;
And this pardon me, I conjure them:
To show more offering in saying them, whose beauty starves the night
Did Jessica:
Besides, Antony. But art 

## Generating Order 10 Shakespeare

In [9]:
print(generate_text(train_LM(data, order=10)))

First Citizen:
That cannot go but thirty miles to ride yet ere day.

PUCK:
Now the pleasure of the realm in farm.

LORD WILLOUGHBY:
And daily graced by an inkhorn mate,
We and our power,
Let us see:
Write, 'Lord have mercies
More than all his creature in her, you may
say they be not take my plight shall lie
His old betrothed lord.

URSULA:
She's limed, I warrant;
speciously on him;
Lose not so near:
I had rather be at a breakfast to the abject rear,
O'er-run and trampled on: then what is this law?

First Murderer:
What speech, my lord
For certain, and is gone aboard a
new ship to purge him of the affected.

PRINCE:
Give me a copy of the forlorn French!
Him I forgive thee,
Unnatural though the very life
Of my dear friend Leonato hath
invited you all. I tell him we shall have 'em
Talk us to silence.

ANNE:
You can do better yet
And show the increasing in love?

LUCETTA:
That they travail for, if it were not virtue, not
For such proceeding by the way
Should have both the parties of suspic

## Aside: Probabilities

Sometimes we'd rather see probabilities, not raw counts. Given a language model `LM`, the probability *P(c* | *h)* can be computed as follows:

In [10]:
def P(c, h, LM=LM): 
    "The probability that character c follows history h."""
    return LM[h][c] / sum(LM[h].values())

In [11]:
P('s', 'the ')

0.09286165508528112

In [12]:
P('n', 'chai')

0.6902654867256637

In [13]:
P('r', 'chai')

0.30973451327433627

In [14]:
P('s', 'chai')

0.0

Shakespeare never wrote about "chaise longues," so the probability of an `'s'` following `'chai'` is zero, according to our language model. But do we really want to say it is absolutely impossible for the sequence of letters `'chais'`  to appear, just because we didn't happen to see it in our training data? More sophisticated language models use [**smoothing**](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing) to assign non-zero (but small) probabilities to previously-unseen sequences. But in this notebook we stick to the simple unsmoothed model.

## Aside: Starting Text

One thing you may have noticed: all the generated passages start with "F". Why is that? Because the training data happens to start with the line "First Citizen:", and so when we call `generate_text`, we start with an empty history, and the only thing that follows the empty history is the letter "F". We could get more variety in the generated text by breaking the training text up into separate sections, so that each section would contribute a different possible starting point. But that would require some knowledge of the structure of the training text; right now the only assumption is that it is a sequence of characters.

We can give a starting text to `generate_text` and it will continue from there. But since the models only look at a few characters of history (just 4 for `LM`), this won't make much difference.

In [15]:
print(generate_text(LM, text='ROMEO:'))

ROMEO:
The kill not my come.

FALSTAFF:
No, good, sister, whereforeson, and strive merry Blance; and by the like a heart of mind;
And him sough they bodied;
The could thy made know are corona's and proved,
To fresh are did call'd Messenge you into
termity.
On they found
they.

Firstling our such a score you a
touch'd,
I make you them compossessed in dead,
And when the hadst be, thy lament:
Your to doth due that ring, and quiet not be fetch hear: but stop. All
the map o'erween attery seat most wonder of beat imprison want me hear to the general we wick outward's he gentre, doth receive doom; and forth
Do you know'd cond,
And root us madam, yours of with her of it is.

SHALLOW:
'Swound and cry a bravel your land, crystally carry with than and present they display
Is no remembrass, each and monkey, thrive does wife countain,
We will we marry, I
shall have I never ask, the reason thes:
He's good for me: thou and me good to bedded:
Again, thee, death made best.

MARK ANTONIO:
Amen, stays to

# Linux Kernel C++ Code

Goldberg's point is that the simple character-level model performs about as well as the much more complex RNN model on Shakespearean text. But Karpathy also trained an RNN on 6 megabytes of Linux-kernel C++ code. Let's see what we can do with that training data.

In [16]:
! [ -f linux_input.txt ] || curl -O https://norvig.com/ngrams/linux_input.txt
linux = open("linux_input.txt").read()
! wc linux_input.txt

  241465  759639 6206997 linux_input.txt


## Generating Order 10 C++

We'll start with an order-10 model, and compare that to an order-20 model. WEe'll generate a longer text, because sometimes a 1000-character text ends up being just one long comment.

In [17]:
print(generate_text(train_LM(linux, order=10), length=3000))

/*
 * linux/kernel.h>
#include <linux/vmalloc.h>
#include <linux/seq_file.h>
#include <linux/hrtimer.h>
#include <linux/string.h>
#include <linux/stat.h>
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/kallsyms.h>
#include <linux/splice.h>
#include <linux/list.h>

#include <linux/mm.h>
#include <linux/rmap.h>		/* try_to_freeze();

		if (hlock->references) {
		hlock_curr;
	int cpu;

	/* initiate RCU priority unchanged. Otherwise just see
	 * if we get it wrong the load-balancer moves */
	update_sched_clock_stable()) {
		*(char **)kp->arg));			\
	}								\
	for ((cmd) = kdb_base_commands, list) {
		for (i = 0; i < length; i++)
		seq_printf(m, "Per CPU device: %d\n", ret);
		return error;
	}

	if (skip_equal && f->op != Audit_equal)
			return 0;
}

static bool migrated to a second choice node will lead to deadlock detection for find_existing_css_set(struct gcov_iterator *iter)
{
	if (iter->idx > i)
		return;

	if (graph)
		ret = rb_head_page_activate(struct tick_devi

## Order 20 C++

In [18]:
print(generate_text(train_LM(linux, order=20), length=3000))

/*
 * linux/kernel/irq/manage.c
 *
 * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar
 * Copyright(C) 2007, Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *   Guillaume Chazarain <guichaz@gmail.com>
 *
 *
 * What:
 *
 * cpu_clock(i)       -- can be used from any context, including NMI.
 * local_clock()      -- is cpu_clock() on the current cpu.
 *
 * sched_clock_cpu(i)
 *
 * How:
 *
 * The implementation either uses sched_clock() when
 * !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
static struct static_key __sched_clock_stable);
}

static void __maybe_unused rcu_try_advance_all_cbs())
		invoke_rcu_core(); /* force nohz to see update. */
		rdtp->tick_nohz_enabled_snap = tne;
		return;
	}
	if (!needreport)
		return;
	if (*firstreport) {
		pr_err("INFO: rcu_tasks detected stalls on CPUs/tasks:",
	       rsp->name);
	print_cpu_stall_info(struct rcu_state *rsp, struct rcu_node *rnp_leaf)
{
	long mask;
	struct rcu_node *rnp);
#ifdef CONFIG_HOTPLUG_CPU
	buffer->cpu_notify.notifier_call = rb_cp

## Analysis

As Goldberg says, "Order 10 is pretty much junk." But order 20 is much better. Most of the comments have a start and an end; most of the open parentheses are balanced with a close parentheses; but the braces are not as well balanced. That shouldn't be surprising. If the span of an open/close parenthesis pair is less than 20 characters then it is represented within the model, but if the span of an open/close brace is more than 20 characters, then it cannot be represented by the model. Goldberg notes that Karpathy's RRN seems to have learned to devote some of its long short-term memory (LSTM) to representing nesting level, as well as things like whether we are currently within a string or a comment. It is indeed impressive, as Karpathy says, that the model learned to do this on its own, without any input from the human engineer.

## Token Models versus Character Models

Karpathy and Goldberg both used character models, because the exact formatting of characters (especially indentation and line breaks) is important in the format of plays and C++ programs. But if you are just interested in running paragraphs of text, it is more common to use a **word** model, which represents the probability of the next word given the previous words, or a **token** model, where a token is something similar to a word. Sometimes a word is broken into several tokens; the word "dogcatcher" might become two tokens, "dog" and "catcher." One or more characters of punctuation can also form a token. In the implementation below, `train_token_LM` and `generate_token_text` are almost the same as their charac ter-model counterparts, but they deal with a list of tokens rather than a string of characters (however, in the Counters that make up the model, the keys are formed by concatenating the tokens together, in part because lists can't be keys of dicts).

One simple way of tokenizing a text is to break it up into alternating word and non-word characters; the function `tokenize` does that.But other tokenizers could be used if desired.

In [19]:
import re

TokenLanguageModel = LanguageModel # e.g. {'wherefore art thou ': Counter({'Romeo': 1})

cat = ''.join

def train_token_LM(tokens, order: int) -> TokenLanguageModel:
    """Train a character-level token language model of given order on the given tokens."""
    LM = TokenLanguageModel(Counter)
    LM.order = order
    history = []
    for token in tokens:
        LM[cat(history)][token] += 1
        history = (history + [token])[-order:] 
    return LM

def generate_token_text(LM: TokenLanguageModel, length=1000, tokens=()) -> str:
    """Generate a random text of `length` tokens, with an optional start, from `LM`."""
    tokens = list(tokens)
    while len(tokens) < length:
        history = cat(tokens[-LM.order:])
        tokens.append(random_sample(LM[history]))
    return cat(tokens)

def tokenize(text: str) -> list: 
    """Break text up into alternating word-character and non-word-character strings."""
    return re.findall(r'\w+|\W+', text)

In [20]:
assert tokenize('wherefore art thou Romeo?') == ['wherefore', ' ', 'art', ' ', 'thou', ' ', 'Romeo', '?']
assert tokenize(''' */
int probe_irq_off(unsigned long val)
{''') == [' */\n', 'int', ' ', 'probe_irq_off', '(', 'unsigned', ' ', 'long', ' ', 'val', ')\n{']

We can train a token model on the Shakespeare data. A model of order 6 keeps a history of three word and 3 non-word tokens (all concatenated together):

In [21]:
TLM = train_token_LM(tokenize(data), 6)

In [22]:
TLM['wherefore art thou ']

Counter({'Romeo': 1})

In [23]:
TLM['not in our ']

Counter({'stars': 1, 'Grecian': 1})

In [24]:
TLM['end of my ']

Counter({'life': 1, 'business': 1, 'dinner': 1, 'time': 1})

We see that the quality of the token models is similar to character models, and improves from 6 tokens to 8:

In [25]:
print(generate_token_text(TLM))

First Citizen:
Before we proceed any further, hear me speak.

CORIOLANUS:
Cut me to pieces, Volsces; men and lads,
Stain all your edges on me. Boy! false hound!
If you have told Diana's altar to protest
For aye austerity and single life.

DEMETRIUS:
Relent, sweet Hermia: and, Lysander, yield
Thy crazed title to my certain right.

LYSANDER:
You have her father's eyes up close as oak-
He thought 'twas witchcraft--but I am heart-burned an hour after.

HERO:
He is the half part of a blessed man,
Left to be finished by such as she;
And she again wants nothing, to name want,
If want it be not gone already,
Even at that news he dies; and then the hearts
Of all the world was of my counsel
In my whole course of love, the tidings of her death:
And here he comes in the habit of a light wench: and thereof
comes that the wenches say 'God damn me;' that's as
much to say 'God make me a light. Know we this face or no?
Alas my friend and my dear hap to tell.

FRIAR LAURENCE:
The grey-eyed morn smiles o

In [26]:
print(generate_token_text(train_token_LM(data, 8)))

First Citizen:
Give me the Lord preserved me long:
To build me the fortunes, beyond his heart,
To stay the villain!

Second Messenger
That no manner was I crept out of my place beneath.

HAMLET:
Sir, here that be?

Clown:
I would you have worn a visor! what costs they have engaol'd my tongue,
That more respective lenity,
To seem to under-bear. O, that's dragon-like, awhile.

Hostess:
A pair so famous college of
wit-crackers of
manners, as you do assistance be only mean
For power,
Bending the ancient trade than you do this?

BORACHIO:
Yea, every idle, nice custom 'gainst it:
We are to me a stool and dead men's son, sir.

TITUS ANDRONICUS:
Follow thee again,
And make thee after. When shall thinking too liberal arts
With thought? I have but pinn'd with some mischance he's hurt i' the half-achieved,
As to be cut, and as leaky as an
unstanched thirst
York and Tartar's bower,
Whose wanton lust the senate.
But I shall they
yet look thee, my boy! thy face,
While you perceive
no truth and heath

## C++ Token Model

Similar remarks hold for token models trained on C++ data:

In [27]:
print(generate_token_text(train_token_LM(linux, 8), length=3000))

/*
 * linux/kernel.h>
#include <linux/module.h>
#include <linux/kdb.h>
#include <asm/irq_regs()));
	profile_lock held.
 */
LIST_HEAD(ftrace_start_proxy_lock(&probe_list.next, type);

		data->chip->irq_cpu_online(cpu)) {
		/* created separates the end of the sym_name != '_')
			return;
}

#ifdef COMPAT_RLIM_INFINITY) {
		rt_clear_integral.
 *
 * This gets called when:
 * - an unknown object\n");
		return;

 again:
	add_time_stats(class);
	printk("\nstack backtrace_seq_puts(s, "\ttype_len = 0;
	int dead_cpu: Callback list
 *  - we are done.
	 */
	if (!(torture_cleanup has been done or not.
 *
 * This is used to put_user(r.rlim_max;
}

/* Clean it up and exit (not in
 * case.
	 */
	down_write operations = {
	.func			= stack_trace(&trace_ops *op;
	int len1;
	int i;
	int i;

	for (thr = 0; thr < nr_threads; thr++) {
		*q++ = ':';
	*q++ = ':';
	*q++ = hex_asc[(error % 10)];
	pkt[3] = '\0';
			parse_error(ps, FILT_ERR_FIELD_NOT_FOUND:
		if (DST != SRC) {
			iter++;
	}
	err2 = hib_wait_on_bio_

## Character Models *are* Token Models

Although it was pedagogically simpler to present the character models first; we could have skipped that and just shown the code for the token models. Then a character model is just a token model where the data has been "tokenized" so that each character is a separate token. We can show that the resulting models are exactly equal:

In [28]:
train_LM(data, 4) == train_token_LM(data, 4) 

True