{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
pairs: | Ø, wird w, ird | wi, rd | wir, d | wird, Ø | Notes: (a, b) pair\n",
" | deletions: | Ø+ird | w+rd | wi+d | wir+Ø | Delete first char of b\n",
" | transpositions: | Ø+iwrd | w+rid | wi+dr | Swap first two chars of b\n",
" | replacements: | Ø+?ird | w+?rd | wi+?d | wir+? | Replace char at start of b\n",
" | insertions: | Ø+?+wird | w+?+ird | wi+?+rd | wir+?+d | wird+?+Ø | Insert char between a and b\n",
" | |
\n",
"\n",
"\n",
"Below we define `Pwords` to compute the product of the individual word probabilities:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"def Pwords(words: List[Word]) -> float:\n",
" \"Probability of a sequence of words, assuming each word is independent of others.\"\n",
" return Π(Pword(w) for w in words)\n",
"\n",
"def Π(nums) -> float:\n",
" \"Multiply the numbers together. (Like `sum`, but with multiplication.)\"\n",
" result = 1\n",
" for num in nums:\n",
" result *= num\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.1925826058314966e-20"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Pwords(tokens('this is a test of the good words'))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.3963205369077944e-30"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Pwords(tokens('here lies another sentence with occasionally unusual terms'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although both sentences have eight words, the second is judged to be 10 billion times less probable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Task: Word Segmentation\n",
"\n",
"**Task**: *given a sequence of characters with no spaces separating words, recover the sequence of words.*\n",
" \n",
"Why? Some languages have no word delimiters: [不带空格的词](http://translate.google.com/#auto/en/%E4%B8%8D%E5%B8%A6%E7%A9%BA%E6%A0%BC%E7%9A%84%E8%AF%8D)\n",
"\n",
"In hastily-written text there might be spelling errors that include [wordsruntogether](https://www.google.com/search?q=wordsruntogether). \n",
"\n",
"Some specialized sub-genres have no word delimiters; an example is domain names in URLs. Sometimes this can go [poorly](https://www.boredpanda.com/worst-domain-names/?utm_source=google&utm_medium=organic&utm_campaign=organic), as in the website `choosespain.com` which once was a tourist information site encouraging you to \"choose Spain\" for your travel, but then they noticed that the domain name could also be parsed as \"chooses pain\".\n",
"\n",
"To find the best segmentation, we can again take advantage of the bag-of-words assumption that words are independent of each other. If we segment the text into a first word and remaining characters, then the best segmentation with that first word is arrived at by finding the best segmentation of the remaining characters (which does not depend on the first word), where \"best\" means having maximum probability according to `Pwords`. Thus if we try all possible splits, the split with the maximum probability will be the overall best.\n",
" \n",
" segment('choosespain') ==\n",
" max((['c'] + segment('hoosespain'),\n",
" ['ch'] + segment('oosespain'),\n",
" ...\n",
" ['choose'] + segment('spain'),\n",
" ...\n",
" ['choosespain'] + segment('')),\n",
" key=Pwords)\n",
"\n",
"To make this somewhat efficient, we need to avoid re-computing the segmentations of the remaining characters. This can be done explicitly by *dynamic programming* or implicitly with *memoization* or *caching* of function results, as is done by `functools.lru_cache`. Also, we needn't consider all possible lengths for the first word; we can impose a maximum length. What should it be? A little more than the longest word seen so far."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"18"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max(len(w) for w in BAG)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"def splits(text, start=0, end=20) -> Tuple[str, str]:\n",
" \"\"\"Return a list of all (first, rest) pairs; start <= len(first) <= L.\"\"\"\n",
" return [(text[:i], text[i:]) \n",
" for i in range(start, min(len(text), end)+1)]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('', 'word'), ('w', 'ord'), ('wo', 'rd'), ('wor', 'd'), ('word', '')]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"splits('word')"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('c', 'hoosespain'),\n",
" ('ch', 'oosespain'),\n",
" ('cho', 'osespain'),\n",
" ('choo', 'sespain'),\n",
" ('choos', 'espain'),\n",
" ('choose', 'spain'),\n",
" ('chooses', 'pain')]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"splits('choosespain', 1, 7)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"@lru_cache(None)\n",
"def segment(text) -> List[Word]:\n",
" \"\"\"Return a list of words that is the most probable segmentation of text.\"\"\"\n",
" if not text: \n",
" return []\n",
" else:\n",
" candidates = ([first] + segment(rest)\n",
" for (first, rest) in splits(text, 1))\n",
" return max(candidates, key=Pwords)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['choose', 'spain']"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segment('choosespain')"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['speed', 'of', 'art']"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segment('speedofart')"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"decl = ('wheninthecourseofhumaneventsitbecomesnecessaryforonepeoplet' +\n",
" 'odissolvethepoliticalbandswhichhaveconnectedthemwithanother' +\n",
" 'andtoassumeamongthepowersoftheearththeseparateandequalstati' +\n",
" 'ontowhichthelawsofnatureandofnaturesgodentitlethem')"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'when in the course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth the separate and equal station to which the laws of nature and of natures god entitle them'"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sentence(segment(decl))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That looks good! It should be \"nature's\" not \"natures,\" but our approach does not allow apostrophes. Some more examples:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['small', 'and', 'insignificant']"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segment('smallandinsignificant')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That looks good. What about:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['large', 'and', 'insignificant', 'numbers']"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segment('largeandinsignificantnumbers')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's not how I would segment it. Let's look at the probabilities:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(4.1121373608995186e-10, 1.0663880482050684e-11)"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(Pwords(['large', 'and', 'insignificant']),\n",
" Pwords(['large', 'and', 'in', 'significant']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The probabilities are close, but the segmentation with fewer words is preferred. The bag of words model does not know that \"small\" goes better with \"insignificant\" and \"large\" goes better with \"significant,\" because the model assumes all words are independent.\n",
"\n",
"Summary:\n",
" \n",
"- Overall, looks pretty good!\n",
"- The bag-of-words assumption is a limitation. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data: Billions of Words\n",
"\n",
"Let's move up from millions to [*billions and billions*](https://en.wikipedia.org/wiki/Billions_and_Billions) of words. I happen to have a word count data file available in the format `\"word \\t count\"`. Let's arrange to read it in:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"def load_counts(lines, sep='\\t', fn=str.lower) -> Bag:\n",
" \"\"\"Return a Bag initialized from key/count pairs, one on each line.\"\"\"\n",
" bag = Bag()\n",
" for line in lines:\n",
" word, count = line.split(sep)\n",
" bag[fn(word)] += int(count)\n",
" return bag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll make `P1w` be a bag of billions of words (and, as a `Bag`, also a probability function):"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(333333, 588.124220187)"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"P1w = load_counts(open('count_1w.txt'))\n",
"\n",
"len(P1w), sum(P1w.values())/1e9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A third of a million distinct words with a total count of 588 billion tokens."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 23135851162),\n",
" ('of', 13151942776),\n",
" ('and', 12997637966),\n",
" ('to', 12136980858),\n",
" ('a', 9081174698),\n",
" ('in', 8469404971),\n",
" ('for', 5933321709),\n",
" ('is', 4705743816),\n",
" ('on', 3750423199),\n",
" ('that', 3400031103),\n",
" ('by', 3350048871),\n",
" ('this', 3228469771),\n",
" ('with', 3183110675),\n",
" ('i', 3086225277),\n",
" ('you', 2996181025),\n",
" ('it', 2813163874),\n",
" ('not', 2633487141),\n",
" ('or', 2590739907),\n",
" ('be', 2398724162),\n",
" ('are', 2393614870)]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"P1w.most_common(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Zipf plot for 588 billion word tokens is similar to the one for a million word tokens:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEaCAYAAADkL6tQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deXhU5dn48e+djQBhX4IYIGyKoGyJoAgYcEMFca2itmKtWN9qa+2vdaktsWpr69b6aou0KtoqiFZflVLcA4hRBDdQEGWTsK+BACHb/fvjnIQhzCRnJjOTmcn9ua65MufMWZ5n5mTueZbzPKKqGGOMMf4kNXYCjDHGxC4LEsYYYwKyIGGMMSYgCxLGGGMCsiBhjDEmIAsSxhhjArIgYTwTkUwRWSAi+0TkoQidY52InBmJY0eSiNwrIjtEZEtjpyVeiUi+iPwrSudqlOvMN48i0l1ESkQkOdrpCIYFCWoumIPuB1b96NrY6YpBU4AdQGtV/UVDDyYiM0Tk3oYnK7JEREWkTx2vdwN+AfRX1S7RS1nDRfOLOV6IyBgReU9EikVkXYBtRojIBw05j6p+p6oZqlrZkONEmgWJwya4H1j1Y1PtDUQkpTESFkN6AF+p3YFZWw9gp6pu8/eiXTdHEkcsf/fsB54CflnHNucBc6OTnEamqk3+AawDzvSzPhtQ4DrgO2CBu/4U4ANgD/A5kOezT09gPrAPeAt4DPiX+1oeUBTo3DhB+3ZgNbATmA20r5WWa9y07AB+7XOcZOBOd999wFKgG/A48FCtc74O3BLgvRgBfAwUu39HuOtnAOVAGVAS4P2a4Z7vP24aPgJ6BzjPlFrHe93n/fh/wBduGl4A0gMcYzKwCHjE/SzWuOmfDGwAtgHX+GzfBngW2A6sB+4CktzX+rifW7H73r7grl/gvu/73XReXisNZwIHgSr39Rnxct0A49z3v9xN++d+3uNrqz8bd/lbYLbP8gZgcF3XjvtaAXCf+3kddN/vgHn2k452wBz3s9vtPs+qdfx73OPvA94EOvq8/n33M98J/JoA//N+Ptt1AV77BBjqPlfgx8A3btoeByTAfvk+n2v1Z5PiMQ8Br5+Ifj9G4ySx/gh0wfh8iM8CLYHmwLHuhXae+895lrvcyd2nEHgYaAaMdj9sr//stwAfAlnu/k8AM2ul5e9uOgYBh4AT3Nd/CSwDjgfEfb0DMAzYxOEvw47AASDTT37buxf594EUYJK73MF9fQZwbx3v4wxgl3vOFOA5YFY929/r5/1YDHR107MC+HGA/ScDFThfZMnAvThfhI+779/Z7vuf4W7/LPAq0Mp9P1cB17mvzcT58kgC0oGRPudRoE8d+Tjic42z6yafAF/M7uu9cL6UkoBjcL5oN/q8ttt9rb5rp8D9bAa4r6fWlWc/6egAXAK0cD+/F4H/83m9ACdIHufmswC4332tP04QHO2e62Gc6yakIOG+DxtxA4H7/s4B2gLdcQLZuADHrHm/8R8kAuWhzusnot+PkT5BPDzcf7gS959hT/XF5/Mh9vLZ9jbgn7X2fwPnl1p39+Jr6fPa83j/Z18BnFHrYix3/6mq0+L762kxcIX7/GtgYoD8rQDOcp/fBMwNsN33gcW11hUCk93nM6g/SPzDZ/k8YGU92/sLElf7LP8JmBZg/8nANz7LJ7nvUabPup3AYJwgcgin3aD6tRuAAvf5s8B03/fXZ7tQg0Q8XDf51BEk3G02AEOBK9z3aDHQDyc4v+bx2ikAfufzWp159vA/OxjY7bNcANzls/w/wDz3+W/x+bGCE7jLCD1IXAc8Wev68P1RMRu4PcAxa95v/AeJQHkIeP14eb8a8ojlesFou1BV27qPC2u9tsHneQ/gMhHZU/0ARuL8Y3bFuXD3+2y/Pog09ABe8TnuCqASyPTZxrf3zAEgw33eDedXiD/PAFe7z68G/hlgu65+0rse51eMV37TJyJ3+nQKmBbKMQLY6vP8IICq1l6XgVOCSuPI/Pnm7Vc4JbDFIvKliPywnjR6EQ/XjRfzcQLVaPd5AXC6+5jvbuPl2vF9P4LKs4i0EJEnRGS9iOzFqQZsW6tnUKA8dvU9t3vOnYHO5YG/9oiGvL9ejlPX9RNRFiS8UZ/nG3AielufR0tVvR/YDLQTkZY+23f3eb4fp7gMgHuBd6p17HNrHTtdVTd6SOMGoHeA1/4FTBSRQcAJwP8F2G4TzsXoqztO0bpBVPX3erhTwI+rVzf0uEHYgfPr2jd/NXlT1S2qer2qdsUpYfy1rh5NHsXDdePlM6gOEqPc5/M5Okh4uXZ8z1Vfnmv7BU5V6nBVbY0TsMAJ7PXZjPMjytlBpAVO9VXQRCQVJ99vhbJ/A9R1/USUBYng/QuYICLniEiyiKSLSJ6IZKnqemAJcLeIpInISGCCz76rgHQROd+92O7CqSOtNg24T0R6AIhIJxGZ6DFd/wDuEZG+bu+RgSLSAUBVi3AaEv8J/FtVDwY4xlzgOBG5UkRSRORynPrcOR7TEKytOPXaEadON8PZOO9vK/c9vhXn80RELhORLHfz3ThfaNVdE8ORzli9brYC2fX0NpoPjAGau9fSQpxG7w7Ap+42QV07HvJcWyucUuEeEWkPTPWYP4CXgPEiMlJE0oDfUcd3n4gkiUg6TruJuJ9VmvvyKOALVd0bxPnDIeD1E+kTW5AIkqpuACbi9CTajhPhf8nh9/JKYDhOA+5UnLru6n2LceoZ/4HzC2s/UORz+L8ArwFvisg+nMbI4R6T9jDOl+CbwF7gSZzGr2rP4NTZB6pqQlV3AuNxfrXtxKmCGa+qOzymIVhPAv3d4nOg0k043Yzznq8B3sepA3/Kfe1k4CMRKcH5DH6mqmvd1/KBZ9x0fi+UE8fwdfOi+3eniHwSIO2rcNrsFrrLe3Hew0Vu8A312gmYZz/+jHM978DJ3zwvmXPT9iXwE5zPezPOj4CiOnYZjROQ5uKUbg7i/F9BI3V9re/6EZFpHqpxQ1LdOm8iRETycRo9r65v2winYzTOr5FsVa1qzLSY+sXKdWOOJCJfAZeq6leNnZZosZJEE+BWUfwMp+eRBQhjQuBWOT3blAIEWJBIeCJyAk633mNwiuzGmBCoalk0GopjjVU3GWOMCchKEsYYYwKyIGGMMSaghBqdsmPHjpqdnR3y/vv376dly5b1bxiHLG/xyfIWn+Itb0uXLt2hqp38vZZQQSI7O5slS5aEvH9BQQF5eXnhS1AMsbzFJ8tbfIq3vIlIwCFREqK6SUQmiMj04uLixk6KMcYklIQIEqr6uqpOadOmTWMnxRhjEkpCBAljjDGRkVBtEsYkovLycoqKiigtLW3spIRVmzZtWLFiRWMnIyJiNW/p6elkZWWRmprqeZ+ECBIiMgGY0KdPQ0d2Nib2FBUV0apVK7KzsxHxMjJ2fNi3bx+tWrVq7GRERCzmTVXZuXMnRUVF9OzZ0/N+CVHdFJY2iQ2L6b7+JdiwOHwJMyYMSktL6dChQ0IFCBN9IkKHDh2CLpEmRJBosA2L0WcuIHvtc/DMBRYoTMyxAGHCIZTryIIEwLqFaMUhkqhCK8tg3cLGTpExMeWHP/whnTt35sQTTzxifWFhIddff30jpcpEgwUJgOxRkJxGJUmQnAbZoyg+WN7YqTImZkyePJl5846e52fevHmMGzeuEVJkoiUhgkSDb6brNoykya+zvudVyDWvUdolh/H/u5A/zlsZ3oQaE6dGjx5N+/btj1r/zjvvcOaZZzJjxgwuvvhixo0bR9++ffnVr37VCKk0kZAQQSIsDdfdhvFdj0uh2zAAvpfTjVF9OwJwqKKSkkMV4UiqMQ12+ROFvLhkAwDllVVc/kQhr3zqzMZ5sKySy58o5PXPNwGwt7Scy58oZN7yzQDs2l/G5U8U8vZXWwHYti/0brU7duwgNTWV6v+7zz77jBdeeIFly5bxwgsvsGHDhpCPbWJHQgSJcEtPTebmM/oyorcTJJ56fx15DxSwfd+hRk6ZMbHjzTff5Oyzz65ZPuOMM2jTpg3p6en079+f9esDDgdk4khC3CcRaSN6d6DkUDmdWjUDYPu+QzXPjYm2F244teZ5anLSEcvN05KPWG6dnnrEcvuWaUcsd26VHnI6/vvf/3LrrbfWLDdrdvh/Ijk5mYoKK30nAitJeDCoW1t+eU4/AHaUHGLsgwVMX7D6yI02LIaFD9XffdbrdsbEMFXliy++YPDgwY2dFBNhVpIIUsu0FK4b1ZMzTsgEoPhgOWmbltB85kVQWeb0jrrmtZq2jSNsWOzchxFouw2Lne632aP8729MI5k0aRIFBQXs2LGDrKwsbr75ZoYMGWL3bzQBMRMkRKQX8Gugjape6q5rCfwVKAMKVPW5Rkwi4BTnbznzuJrlP81bSbcv/8UNlWWIVjoBYN1C/1/y6xY6r/vbrr4AYkwjmjlz5hHL99577xFdXydPnszkyZNrlufMmROtpJkIi2h1k4g8JSLbRGR5rfXjRORrEflWRG4HUNU1qnpdrUNcDLykqtcDF0QyraG6JCeLbkPOQpLTQJKpSk51SgL+uPdjIMk192PU8BdAjIlRd911F1dccUVjJ8NEQaRLEjOAx4Bnq1eISDLwOHAWUAR8LCKvqepXfvbPApa5zysjm9TQDO3eDrpfCAO7sn3ZO9y4qDmXbz2Gy7r52bjbMKeE4K9KqTqAVJckAgUaY4yJoogGCVVdICLZtVYPA75V1TUAIjILmAj4CxJFOIHiM2K9kb3bMFpmDmV0s7Wc6bZXbN1bSqv0FFqkpRyxnd9qpEABxNopjDGNqDHaJI4FfO+yKQKGi0gH4D5giIjcoap/AF4GHhOR84HX/R1MRKYAUwAyMzMpKCgIOWElJSUN2h9gYDJ8/vFGAB5cUsqe0ip+d1pzkjw38OXA6gOwuoDWxSsZ9PlvSKqqoCophc8H3cPeNv1oXbyStnuWs6ftiext08/TUcORt1iV6Hlr06YN+/bta+ykhF1lZWVC5gtiO2+lpaVB/b80RpDw922pqroT+HGtlfuBa+s6mKpOF5HNwIRWrVrlNGTy8XBPXp6RvYste0sZO7ArAN9s3UffzCDGmF+41GmjoIpkrWRo+/2Q3QKeyQ+6gTveJmYPRqLnLT09PebmJgiHWJxzIVxiOW/p6ekMGTLE8/aNUYVTBPjW2GcBmxpywFid4zo3uz3j3QBR8PU2znpkAe+u3Or9AP4auq2B2xgTRY0RJD4G+opITxFJA64AXmvIARs8wF8UnJzdnjvO7cfIPp0AWLO9hINl9bTFV7dTjP314RJDXT2kjIkBBQUFfPDBBw06RkZGRphSU7+CggLGjx/v97VJkyYxcOBAHnnkkailJ9ZEtLpJRGYCeUBHESkCpqrqkyJyE/AGkAw8papfNuQ8qvo68Hpubm7MDmzfslkKN5zeG4CqKmXKP5fSKaMZM6ecUveOtRu66+ohZUwMKCgoICMjgxEjRjR2UvyqrKwkOTm53u22bNnCBx984HcMqoqKClJSYuY2s4iKaElCVSep6jGqmqqqWar6pLt+rqoep6q9VfW+hp4nHkoSvpKShN9fdBI3n+HMyV1RWcXyjUGkvdswGPUL568N82H8CfN1ceGFF5KTk8OAAQOYPn16zfp58+YxdOhQBg0axBlnnMG6deuYNm0ajzzyCIMHD2bhwoVMnjyZl156qWaf6lJCSUkJZ5xxBkOHDuWkk07i1VdfrTMNf/rTn3j00UcB+PnPf87YsWMBZ7jyq6++GnBu+jvppJM48cQTue222444529/+1uGDx9OYWEh8+bNo1+/fowcOZKXX37Z7/nOPvtstm3bVpOPvLw87rzzTk4//XT+8pe/sH37di655BJOPvlkTj75ZBYtWgTAzp07mThxIkOGDOGGG26gR48e7Nixg3Xr1h0xadODDz5Ifn4+AKtXr2bcuHHk5OQwatQoVq50pimYPHkyP/3pTxkxYgS9evU64n3805/+xEknncSgQYO4/fbbWb16NUOHDq15/ZtvviEnJ6fO99QTVU2YR05OjjbEe++916D9Q/Xch+u1x21z9PMNu4Pb8buPVO/JVM1v5/z97qOAmzZW3qIh0fP21VdfBbdTENeFVzt37lRV1QMHDuiAAQN0x44dum3bNs3KytI1a9Ycsc3UqVP1gQceqNn3mmuu0RdffLFmuWXLlqqqumvXLi0uLlZV1e3bt2vv3r21qqrqiG18FRYW6qWXXqqqqiNHjtSTTz5Zy8rKND8/X6dNm6YbN27Ubt266bZt27S8vFzHjBmjr7zyiqqqAvrCCy+oqurBgwc1KytLV61apVVVVXrZZZfp+eeff9T51q5dqwMGDKhZPv300/XGG2+sWZ40aZIuXLhQVVXXr1+v/fr1U1XVm2++We+8805VVZ0zZ44Cun379qOO98ADD+jUqVNVVXXs2LG6atUqVVX98MMPdcyYMTXv3aWXXqqVlZX65Zdfau/evVVVde7cuXrqqafq/v37j3jv8/Ly9NNPP1VV1TvuuEMfffTRo/Ll73oClmiA79WEKC+JyARgQp8+fRo7KSG5YHBXFOWkY52G92VFxfTNzCA9tZ4icV3DfJimKwLXxaOPPsorr7wCwIYNG/jmm2/Yvn07o0ePpmfPngB+JyWqi6py5513smDBApKSkti4cSNbt26lS5cufrfPyclh6dKl7Nu3j2bNmjF06FCWLFnCwoULefTRR/n444/Jy8ujUyen3e+qq65iwYIFXHjhhSQnJ3PJJZcAsHLlSnr27Enfvn0BuPrqq48oHdXl8ssvr3n+9ttv89VXh2/v2rt3L/v27WPBggU8+6xz//D5559Pu3bt6jxmSUkJH3zwAZdddlnNukOHDk9LcOGFF5KUlET//v3ZunVrzbmvvfZaWrRoARx+73/0ox/x9NNP8/DDD/PCCy+weHHDS5IJESQ0Dtok6pLRLIWrhvcA4EBZBdc8vZiRfTry6KR6uqnVvku7eQenisHaKpq2MN+9X1BQwNtvv01hYSEtWrQgLy+P0tJSVNXTAH8pKSlUVVUBTmAoKysDYPbs2Wzfvp2lS5eSmppKdnY2paWBJ0Gq3ubpp59mxIgRDBw4kPfee4/Vq1dzwgknsGrVqoD7pqenH9EOEerAhC1btqx5XlVVRWFhIc2bNz9qO3/H930fgJq8VlVV0bZtWz777DO/5/Qdgt350U/A9/6SSy7h7rvvZuzYseTk5NChQwePOQsstu9i9ije2iTq0iIthcevHMr/jHEaufeWlrOsKEC+fHs/jbsf5t0O797nDBRobRRNl79ecQ1QXFxMu3btaNGiBStXruTDDz8E4NRTT2X+/PmsXbsWgF27dgHQqlWrI24ky87OZunSpQC8+uqrlJeX1xy3c+fOpKam8t5773mapGj06NE8+OCDjB49mlGjRjFt2jQGDx6MiDB8+HDmz5/Pjh07qKysZObMmZx++ulHHaNfv36sXbuW1aud4f5rD17o1dlnn81jjz1Ws1z9JT969Ghmz54NOHNu7N69G3Bu9t22bRs7d+7k0KFDNYMgtm7dmp49e/Liiy8CTgD4/PPP6z33U089xYEDB4DD7316ejrnnHMON954I9deW+ctZp4lRJDQGL1PIlSn9u5Avy6tAfj7gjVc+NdFbNpz0P/G1Y3YB3fa/RPmMN/ODQ00btw4KioqGDhwIL/5zW845RSnR16nTp2YPn06F198MYMGDaqpipkwYQKvvPJKTYPv9ddfz/z58xk2bBgfffRRza/xyy+/nCVLlpCbm8tzzz1Hv371jx4watQoNm/ezKmnnkpmZibp6emMGuWUlI455hj+8Ic/MGbMGAYNGsTQoUOZOHHiUcdIT09n+vTpnH/++YwcOZIePXqE9L48+uijLFmyhIEDB9K/f3+mTZsGwNSpU1m0aBFDhw7lzTffpHv37oBTEqpuPB8/fvwR+X3uued48sknGTRoEAMGDKi3EX/cuHFccMEF5ObmMnjwYB588MGa16666ipE5IhZAxtCqosviSA3N1eXLFkS8v6xeOfu3tJyFq7awfkDjwHgg9U7GNq93dHtFbWHGh93vxM43KqnWMxbuCR63jIzMznhhBMaOylhF8t3JTeUb96ys7NZsmQJHTt2jMq5H3zwQYqLi7nnnnv8vr5ixYqjricRWaqquf62T4g2iUTWOj21JkBs21vK5Kc+5poRPfj1+f2P3ND3/onmHZyqJ9+hO4wxCe+iiy5i9erVvPvuu2E7ZkIEiXjv3eRV59bpPH3tyfTp7PQzL9p9gD0HyjnR7RVVc+PdwoeOrHr6/Hm676yAJeuc0kXzDkeUMowxkbNu3bqonau6B1o4JUSQiPfeTcE4rc/hIuuf3/6Gecu3UHjHWFqlpx7eyLd3S1IyfPo8PSvLYO0/ccZXVJAkSEqBIVdDl0EWPIwxfiVEkGiqfjO+PxMHd60JEG98uYW84zvRzLfqqbgIlj6DUN325P7VKieILHnKXe8GD8QJLOc9BLmTo5shE5DX7qbG1CWUNuiE6N2USF1gg9GmeSqj+jo3Dq3cspcb/rmUfxa63Qire7cMmgTJaVTVjNBe+281nyBSVQH/uRVmXQVzfm7daRtZeno6O3fuDOkf3JhqqsrOnTtJT08Par+EKEk0peqmQPp1ac2/rhtOTg/n7s7lG4tJThJOcEsV6959ll4Dcg9XK235DD59HirLgSoOlyRcWgkr3cnsP30OJs+xKqhGkpWVRVFREdu3b2/spIRVaWlp0F9Y8SJW85aenk5WVlZQ+yREkDCOkX0Pt1f8cd5K1mzfz/xf5pHSbRjf9ThAr9y8I3cYdOXh3lAHd0LpXih8DKoqOSJgVB5yeksN+YETXEq2Q0Znp5RigSPiUlNTa4a+SCQFBQVBTX4TTxIpbxYkEtRjk4aybud+UpKTqKpSFhaVM6KiirQUnxpGf/Nt9zsfPn8ePvkXVJUfXr9xqfPwZSUMYxJeQrRJmKO1aZHKoG5tAXj/2x08ubyMN7/aUv+O3YbB+D/DtXPh2HqGGa4uYVibhTEJy4JEEzD6uE7cMSyd8050bsp7/5sdrNyyt+6dug1z7tpOTqt7u41LYcZ4CxTGJKiEqG5qKjfTNcTx7ZNJShJUlXv/8xXN05J5+cYRdXer7DYMJv/HqX5CnPsptnwG6wth+8rD29kw5cYkrIQIEta7yTsRYdaUU9i1vwwR4WBZJS8t3cDlJ3c/sr2imr92iw2LYcb5TnAA58a85g0fktgYE3usuqkJatsijV6dnKE95i7bzG9e/ZJlwU6fOvk/TiO3JDvdZefcAtNG2n0VxiQYCxJN3MVDj2XOzSNr7q947fNNfLN1Xz174QSKY3OcO7cBUNiyzLmD+8lznBvxLFgYE/csSDRxIlIzQGBZRRW//88K/vzON952zh4Ffts0qpwb8Z4+zwKFMXHOgoSpkZaSxNyfjWLqeGcY8q17S3l60VrKK6v879BtGIz4aeADVpXDrEmwZEb4E2uMiYqYDhIi0l9EZovI30Tk0sZOT1PQvmUanVs7wwn836cb+cPclWwpDjzvMGfdDeP/4lQ9dTmJo8aE2r8D5vwMnj7XShXGxKGoBwkReUpEtonI8lrrx4nI1yLyrYjc7q4+F/hfVb0R+EG009rUTRndi//eMopu7VsAMGPRWr7d5qe9IncyXP8u/Ph950a8owYPBNZ/AE+eBY8Pt5KFMXGkMUoSM4BxvitEJBl4HCco9AcmiUh/4J/AFSLyAGB9LKNMROjt9oLac6CMR97+hheXFNW9U+5kN1AEuLS2r3RKFr/Pgj8PtAZuY2Jc1IOEqi4AdtVaPQz4VlXXqGoZMAuYqKrbVPUnwO3Ajign1fho2yKNd39xOjeNdW5YXL6xmGc+WEeFv/aK3Mlw3RvQY0TgA5btgz3rnQbuJ8+20oUxMUoaY4x6EckG5qjqie7ypcA4Vf2Ru/x9YDjwIHAn0BL4m6q+7+dYU4ApAJmZmTmzZs0KOV0lJSVkZGSEvH8sC3feZq0sY9HGcu4f3YKWqYHv2j5m0xv0XfU3n0mP/M9koUBlcks2dT2Htb2vCSot9rnFJ8tb7BgzZsxSVc3191qs3HHt71tGVXUdbgAIRFWni8hmYEKrVq1y8vLyQk5EQUEBDdk/loU7b6efrmwqLuXYts1RVR5+axUXDTm25ia9w/Jgw0XO0B5r5sOuNUcdS9xHUuV+emx4mR6pe+AH3ufqtc8tPlne4kOs9G4qArr5LGcBm7zurKqvq+qUNm3ahD1hxj8R4di2zQH4btcBnl60jkWrd/rfuHpk2Z9+6vSE6ng8NG8f+OBr3oW/DLG2CmNiQKwEiY+BviLSU0TSgCuA17zu3FSnL40VPTq0pOCXeUw62Ynz7329jX99uJ7KKj9VmbmT4abFcNtaJ2D4LUQCu9c4vaH+NtKChTGNqDG6wM4ECoHjRaRIRK5T1QrgJuANYAUwW1W/jHbaTOg6ZjQjJdm5nP7zxWae+WBd/XMy506G6950xoBKa+V/m63LnGDxuw7w+652v4UxURb1NglVnRRg/VxgbojHtFFgY8gDlw5k5/4yUpKTKKuo4ndzvuRHI3uR3bHl0Rt3GwZXPO88f/Yip6rJn6oKKKs4fL9FUgp0OgHGPxy5jBhjYqa6ySQQEaFjRjMAvtxUzCufbGTdzv317/iDV+Ck73k7SVVFTSmj5+pnGpBaY0xdEiJIWJtE7BrSvR3v3zaWvOM7AzBr8Xc8/9F3gauiLvk7XPcWtO/l+RzdN7zs3Jxn91oYE3YJESSsd1Nsa9fy8BSob321lTe+3FL/jHjVPaHadIeU5tR7qZbtc+7k/lNva7MwJoxi5T6JBrHpS+PHP67JZd+hCgB27S/jnjlf8YuzjyOrXYujN86d7DyqvTUVljwNh/ZCoJvzDuxw2iwQ6DUmqPstjDFHs5KEiSoRoXV6KgDLNhbz9oqtHCir9LbzWXfDHd9B/h7oNbZmtf+KK3UawfPbwD2dnABjjAlaQgQJE59OP64TH95xBsdlOt1f//z2Kl5cssHbzj94xamOCtR11ldlGSz6swUMY0KQEEHCGq7jV8tmTo1nRWUVhbr/7yUAACAASURBVKt38nnRHu87506GO4tYddz/QFKqt31qAkZb+Lf1mDamPgkRJKy6Kf6lJCcxa8op3HW+Myvet9tKuOn5T9i6t44Jj1ybu54Dv93hdJ+VZI9nVFg22yldPDasASk3JrElRMO1SQwiQnqq8yW/YvNePl63i5SkOnpB1XbJ350H1H1jXm07vnaCBUBSGpz3wJEN5sY0YQlRkjCJZ8Kgriz41Rg6uDfl3fbSF7z2uecxH502i/xiOO0W54vfq6oypyvt3e3svgtj8BAkROTfInK+iMRsQLE2icTULMUpVZQcqmDVtn1s3nMw+IOcdTf8drsTMLzezQ2gVU6wyG8D92cHf15jEoSXL/6/AVcC34jI/SLSL8JpCpq1SSS2jGYp/PvHI7huZE8A3v9mBzc9/wm795cFd6BL/n64dIHXtgugdLcTLH6fFdz5jEkA9QYJVX1bVa8ChgLrgLdE5AMRuVZEPHYpMaZhkpKkZpTZDbsP8M3WElo0C+KL3tdZd0P+Lmf4j2ZB/LAo2+cEi/w28LuOoZ3bmDjjqeFaRDoAVwPfBz4FngNGAtcAeZFKnDH+TBrWnctyskhJTqKisoqHlpRyqNMWzhnQJbgDdRvm3JxX7bFhTiO2F1Xlhxu709vB7euCO7cxccJLm8TLwEKgBTBBVS9Q1RdU9WYgfiZxNQmlulSxa38ZJeXqf4KjYN202KmOatM9uP2qq6OsK61JQF5KEo+pqt++hIEmzjYmWjq3Tuc3p6Qz5kSnFDFr8XcUrtnJHy4+iRZpIfbw/vky5++GxfDkOMDjsCE1XWnFGTrEmATgpeH6BBFpW70gIu1E5H8imKagWe+mpi1JpGZU2X2lFezaX0Zz936LemfHq0u3YYfbLiSYgKPuECCdQz+3MTHCS5C4XlVrfhap6m4gpsYzsN5Nptr1o3vx7A+HISLsKy1n4uOLWPjN9oYdtNswmLrTqYrKLw5iCJBDbkN3u4ad35hG5CVIJInP4P8ikgwEcXeSMdFVfbnuLHG6yFaPOtugUoWv3+5wgkXXHI87VDnB4t5jwnN+Y6LIS5B4A5gtImeIyFhgJjAvsskypuGyO7bk1Z+cxqBuTm3pQ2+u4tbZn4WnkRtgyrtOsGiZ6W37igNOsLCBBU0c8VLRehtwA3AjzvwubwL/iGSijAkX3xnwUpKFZilJJLvjQVVVKUnBjA0VyC9XOX83LHYnPKrHstnOo9dYmxTJxDwvN9NVqerfVPVSVb1EVZ9QVY/dPYyJHbeceRy/v+gkADbsOsCZD89n6fpd4TtBt2HBVUNVT4pk062aGOblPonTROQtEVklImtEZK2IrIlG4kSku4i8JiJPicjt0TinSWzVJYuSQxW0b5nGMW2aA4SvCgoOV0P5zJ5XpyfPOnxjnjExxkubxJPAwzh3WJ8M5Lp/Q+J+4W8TkeW11o8Tka9F5FufgHAc8B9V/SHQP9RzGlPbCce05qUbR9C1rRMkfv7CZ/z6lWXhPUn1SLQpfubv9qd6yA9jYoiXIFGsqv9V1W2qurP60YBzzgDG+a5we0w9DpyLEwwmiUh/nCFArhCRd4H3GnBOYwKqqlK6tm1eEzDAmSkvbO7a7AQLryxYmBgi9XULFJH7cYbMfBk4VL1eVT8J+aQi2cAcVT3RXT4VyFfVc9zlO9xNy4HFqrpARF5S1Uv9HGsKMAUgMzMzZ9asWaEmi5KSEjIyEnOkEcubd9/urmTaF4f42dB0urUK/wj5IwsmkoTTC6Ra9XPf/0YFqoD3814NexpigV2TsWPMmDFLA42g4aV303D3r+8BFPBY4erJscAGn+Ui97zTgHwRuRJnBNqjqOp0YDpAbm6u5uXlhZyIgoICGrJ/LLO8edd2wx4G7F7FxWcPpWWzFMoqqkhLCWOwyHNLFX5KC7UDhwB5BROBJMjfHb40xAC7JuNDvUFCVcdEIR3++iGqqi4Hjio9HLWzyARgQp8+fcKeMNP0DO7Wlmd+6AzWV1WlXDG9kOG9OnDbuDBPpZIfOFgczb0hL5hqK2PCwEvvpkwReVJE/usu9xeR68KcjiKgm89yFhDEXJXGREZ5VRW52e05LtOpOlBVDlWEuQd4frEzPpQfR/16svYKE2VeytAzcO667uourwJuCXM6Pgb6ikhPEUkDrgBe87qzjd1kIqVZSjJ3nncCFw1xZqV77fNNnPnwfDbsOhDeE1XfY5HW6ojVAVsMLViYKPESJDqq6mycNjRUtQLPYycfTURmAoXA8SJSJCLXuce8CScYrQBmq+qXQRzTRoE1UdGldTo53dvV9IQ6UFYR3hPcWWQ9oUxM8dJwvd+dmU4BROQUIORvY1WdFGD9XGBuiMd8HXg9NzfXBsUxETW8VweG9+oAQGl5Jef8eQFXDuvBjXm9w3siN1BU5rfxNn1kdaCwNgsTZl5KErfiVP30FpFFwLPAzRFNVZCsJGEaQ0WVcnb/Lgzp7gwgeKiiMuztFe/nvWolC9OovIzd9AlwOjACZ6C/Aar6RaQTFgxrkzCNIaNZCr8Z359T3JLF3xes4ZxHFlB8sDz8Jwu2hGCjzZowqbckKyI/qLVqqIigqs9GKE3GxKXB3dqxt7SCNs2d+SuKD5bXPA+LoLrMcni0WauCMg3gpbrpZJ/HKCAfuCCCaQqaVTeZWDCyb0fuPO8EALbtLWXkH99l5uLvwn+i6hnyPG/fBu5uH/50mCbBS3XTzT6P64EhxNjMdFbdZGJNs9RkLs3J4lS3KmpfaTllFWEcDwqCCxZaaW0VJiShjDVwAOgb7oQYk0jaNE9l6oQBZHdsCcC9c1Yw4X/fpzycAwdWCyZYWMO2CZKXNonXOXxPTxLOKK2zI5moYNmwHCbWjTupC30zM0hNdn6Xbd93iE6tmoX3JMG0WdgQH8YjLyWJB4GH3McfgNGqGlMTAFl1k4l1Y47vzI9G9QLgq017Oe3+d/nvss2ROZmVKkwYeRngb340EmJMU3FMm3SuGdGDEb07Ak6pom2L1JpSRlgEW6rw3ccYH14G+NsnInv9PPaJyN5oJNKYRNKuZRq/Pr8/bVqkoqrc8sKnXDH9Q+qb2yUk+cUgyR63tVKFOZqXny6PALfjzPmQBdwG3KuqrVS1dSQT55V1gTXx7Ien9eQHp/bAvf8o/IMHTt0VXBXUkhnhPb+Ja16CxDmq+ldV3aeqe1X1b8AlkU5YMKxNwsQrEeGMEzKZOPhYAN5ZsY28Bwv4cE1DZggOIL8YOh5f/3ZzfmalClPDS5CoFJGrRCRZRJJE5CoaMAqsMSawId3b8pMxfcjp0Q6A7Qeqwttt9qbFwZUqTJPnJUhcCXwP2Oo+LnPXGWPCrENGM2496zhSk5Mor6zioSWl/M9zIU8nH5jXeyusB1ST5+WO63WqOlFVO6pqJ1W9UFXXRSFtxjRpKUnC5f3SuHZENgDllVWs2V4S3pNYqcLUw0vvpuNE5B0RWe4uDxSRuyKfNO+s4dokIhFhSOcURvRxuso+9+F6zn5kAd9s3RfeEwVTqjBNjpfqpr8DdwDlAO4w4VdEMlHBsoZr0xSMH9SVO847gT6dnfm2v96yj4pwtldY9ZPxw0uQaKGqi2utC/OcjcaY+nTMaMZ1I3siIuwtLefy6YX85lXPs/x6k18MLTM9bGeBoqnwEiR2iEhvDk9feikQofEEjDFetGqWwv0XD2Sy215RfLCctTv2h+fgv1xl1U+mhpcg8RPgCaCfiGwEbgF+HNFUGWPqJCKMO7ELx3dpBcDj733LuD8vYPu+Q+E7iddAcX92+M5pYk6dQUJEkoBcVT0T6AT0U9WRqro+Kqkzxnjyo1E9+cPFJ9WMLLt0/W4qq8IwzIeXQFG620oVCazOIKGqVcBN7vP9qhrmbhV1E5FRIjJNRP4hIh9E89zGxJPOrdK5eGgWABt2HeDyJwr5yzvfhOfg1vupSfNS3fSWiPw/EekmIu2rH6GeUESeEpFt1V1qfdaPE5GvReRbEbkdQFUXquqPgTnAM6Ge05imJKtdcx6dNITvn9IDgKLdB1gXjvYKCxRNkpcg8UOcdokFwFL3saQB55wBjPNdISLJwOPAuTiTGk0Skf4+m1wJzGzAOY1pMkSE8046pqbq6f7/ruSivy6itDwMo+lYN9kmx8t8Ej3DeUJVXSAi2bVWDwO+VdU1ACIyC5gIfCUi3YFiVbVhyY0JwW/H9+fLTXtJT3WGDF+wajun9elIcpKEdkCvc1XY7HcJQQKNYS8iv1fVO93nZ6nqW2E7qRMk5qjqie7ypcA4Vf2Ru/x9YLiq3iQidwNvqKrfNgkRmQJMAcjMzMyZNWtWyOkqKSkhIyMj5P1jmeUtPoU7b2v2VPK7D0v5Qf80xnZPbfDxRhZMpHq2Ct+QU/2tUgm8n/eq333tc4sdY8aMWaqquf5eq6skMQ64033+RyBsQcIPfz9pFEBVp9a1o6pOF5HNwIRWrVrl5OXlhZyIgoICGrJ/LLO8xadw5+10Vbr22ULe8Z1JT03mq017aZWeQrf2LUI7YF6x3xJF9T90CpBXMBHadIefLztiG/vc4kMY50tskCKgm89yFrDJ6842LIcx3jj3VxxTU/X021eXc81Ti6lqSHdZL1VKxd9ZO0Wcqqsk0VlEbsX5UVD9vIaqPhzGdHwM9BWRnsBGnLGhPA9HLiITgAl9+vQJY5KMSXyPXTmUjXsOkpQkVFYp85ZvYdyJXYJvr7B2ioRVV0ni70ArIMPnue8jJCIyEygEjheRIhG5TlUrcO7HeANYAcxWVc+D0lhJwpjQdGmTXjPB0VtfbeEnz3/Cuyu3hX5A6yabcAKWJFT17kicUFUnBVg/F5gbyjGtJGFMw50zoAtPTc5lzPGdAShcvZOsds2Db6/I999OceQ2bSBAg7aJLbHSJtEgVpIwpuFEhLH9MhFxqp5uf/kLfvnS56EdzMNd2iMLJoZ2bBNVCREkbNIhY8IrOUmYef0p3HvhiQDsP1TB/326MfgG7joCRTLYjXdxICGChJUkjAm/rm2b06ez0/z470+KuOWFz/hqcwj3tFo7RVwL2CZRuzdTbWHu3dQg1iZhTGRdPbwH/bq05sRjnS/z/y7bzMBubTm2bXNvB/DTTnFU/ynr+RST6ipJVPdiygVuBI51Hz/GGV8pZlhJwpjISkoShvV0xvU8UFbBHa8s46E3vg7uILUCgN+KKytRxJyAQUJV73Z7OHUEhqrqL1T1F0AOzs1uxpgmqEVaCv/56ShuO7cfAJv2HOSVT4u8tVfkF0PXnHq2sUARS7y0SXQHynyWy4DsiKQmRNZwbUx0Hdu2OZmt0wGYufg7bvv3MrbuK/W285R3Ib+YOsektUARM7wEiX8Ci0UkX0SmAh8RY3M7WHWTMY3n52cex8s3juCYNk77xPMffcfm4oP17hdo4L8aFihiQr1BQlXvA64FdgN7gGtV9Q+RTpgxJj4kJUlNg/a2vaX8bs6XPPfhd952rq+h2gJFo6tzPgl3jusv3CG9P4lOkowx8apz63Te+vnptGuZBsDyjcWs3l7CBYO6IhJgPKj67tC2Xk+Nyssc15+7E//ELGuTMCZ2dGvfgoxmzu/P5z5azz1zVrC/rJ5Z8axEEbO8tEkcA3wpIu+IyGvVj0gnLBjWJmFMbLr3wpOYfcMpZDRLQVV57N1v2Lo3QAO3BYqYVO/0pUBEBvozxiS+5CShVydnhrZVW0t49J1v6ZDRjEnDAlROWNVTzPHScD0fWMnhm+tWuOuMMcaz47u04p1fnM73cp35xb7YXsGcLzZx1BTKVqKIKfUGCRH5HrAYuAz4HvCROye1McYEpVv7FjUTGr23oYLH3v2WSn834VmgiBle2iR+DZysqteo6g+AYcBvIpssY0yiu3lIM2ZcO4yU5CQOVVTy+7kr2ObbXmGBIiZ4CRJJquo7VdVOj/tFjfVuMib+JInQpY1z1/Yn6/fw9KK1fL1135EbWaBodF6+7OeJyBsiMllEJgP/IcQZ5CLFejcZE99O7d2B928by6i+nQCY/fEG5i7b7LRXWKBoVF4arn8JTAcGAoOA6ap6W6QTZoxpWqrHglJVXly6gX8vLTp8A54FikZT13wStwCLgE9V9d/Av6OWKmNMkyXizIq3r7QCgJ0lh/jz29/w019so9NDnQPvaN1jI6KukkQW8Bdgm4gUiMjvReR8EWkfpbQZY5qolOSkmqE9Fq/dxUtLiyg+WG4likZQ13wS/09VRwBdgDuBXcAPgeUi8lWU0meMaeLOPekYCu8YS5/Ozk15j41eQkVdO1igCCsvDdfNgdZAG/exCWe48IgTkSQRuU9E/ldEronGOY0xsadtC6dUUVZRxX+WbeHeoR/UvUN+GwsWYRIwSIjIdBFZBLwAnAp8AFymqrmqem2oJxSRp0Rkm4gsr7V+nIh8LSLfisjt7uqJOFOmlgNFoZ7TGJMY0lKSeP2m0/jVuOMhv5gKnGlQA86JZ4GiweoqSXQHmgFbgI04X9J7wnDOGcA43xUikgw8DpyLM3/2JBHpDxwPFKrqrTjzbBtjmriU5CRapDl9bl46b1ndM9yBBYoGkqPGTfF90el/NgAY4T5OxGmbKFTVqSGfVCQbmOPOU4GInArkq+o57vId7qYbgDJVnS0iL6jq5X6ONQWYApCZmZkza9asUJNFSUkJGRkZIe8fyyxv8cnyVr/SCuXM9y8mmaqaddUzV1R/u1XiYSa8MIq3z23MmDFLVTXX32t1BomajUSygNNwAsV4oIOqtg01QX6CxKXAOFX9kbv8fWA48Cvgf4EDwEpVfbyu4+bm5uqSJUtCTRYFBQXk5eWFvH8ss7zFJ8ubdxX3ZZFUvg/hcJA4SpS6yMbb5yYiAYNEXfdJ/BQnKJyG0yawCCgEngKWhTuNftapqh4Arqt3Z5EJwIQ+ffqEOVnGmHiR8usiyiurkHvakYJTijjqi8XupQhaXW0S2cBLwDBV7aWq31fVv6rq5+6MdeFUBHTzWc7C6UVljDGepSYnkeITBPzWk1gbRVDquk/iVlV9SVU3RyEdHwN9RaSniKQBVwCeZ7+zsZuMMUfIL/ZfPVHzun1XeBX10VxFZCZOtdXxIlIkItepagVwE/AGsAKYrapfBnFMGwXWGHMkn0BRHRyOCBwWKDyJepBQ1Umqeoyqpqpqlqo+6a6fq6rHqWpvVb0vyGNaScIYczS36kk4suqp+t6KCgsU9YqpeSFCZSUJY0xAPoGiWvXzZLASRT0SIkhYScIYUyc/PZqqu8pWlyiWrt8d7VTFhYQIEsYYU686ur4mA4Oezo5aUuJJQgQJq24yxngSoEQB7k1j+W144I2V3DPnK7zcaNwUJESQsOomY4xndQQKgJ8XDmf/oYqaWfGaerBIiCBhjDFBqaPqKQX4w7JRAKzfuZ8Jj73Pl5uabi1FQgQJq24yxgStjkAhAPlt2FFSRmUVdMxoBjTNUkVCBAmrbjLGhKSecZxyns5m7k9Hktk6HYCfzfqMB9/4OhopixkJESSMMSZk9QQKudsZ8LqisooWacmkpx7+2qyqSvyShQUJY4ypb2TY/DakJCdx/yUD+ckYZ7TpxWt3cd6jC1m7Y38UEth4EiJIWJuEMabBPAQKoKbXU3llFa2bp9LFrYqqTNBSRUIECWuTMMaEhcdAAXBan47MvuFUmqclU1WlfO+JQqbNXx3hBEZfQgQJY4wJmyACRbXSikp6d2pZU6qoUk2YkoUFCWOMqS3IQNEiLYU/XTqIC4ccC0DhpgrG/+/7bNtXGqkURo0FCWOM8SeEEkW1lqlCz44t6NjSub+irCLck3lGT0IECWu4NsZERIiBYnDnFP56VQ5JScKBsgrOemQ+//pwfQQSGHkJESSs4doYEzENKFGAU4rI7dGe47u0qlmOp/aKhAgSxhgTUQ0IFG1bpPHQ9wZxcnZ7AP5WsJoLHnufA2UV4UxhxFiQMMYYLxpYoqh2XGYGJ2e3p0VaCgAHyyobmrKIsiBhjDFehSFQnHvSMeRfMACArXtLGXH/O7z2+aZwpC4iLEgYY0wwwlSiABCBsf0yGZzljA+1/1BFzI0HFdNBQkTyRGShiEwTkbzGTo8xxgD1BoqRBRM9HaZzq3Qe+t4gundoAcDU177ksicKY6phO+pBQkSeEpFtIrK81vpxIvK1iHwrIre7qxUoAdKBomin1RhjAqpnzuxgShTVRvXtyNn9M0lOcsaHKj5YHmLiwqcxShIzgHG+K0QkGXgcOBfoD0wSkf7AQlU9F7gNuDvK6TTGmLqFseoJYOLgY7nh9N4ALN9YzCm/f4f5q7aHmrqwiHqQUNUFwK5aq4cB36rqGlUtA2YBE1W1+jbF3UCzKCbTGGO8qWfO7FBKFAAdMtK4cMixDOnutFfs3l/WKO0V0hjT8YlINjBHVU90ly8Fxqnqj9zl7wPDgXeBc4C2wN9UtcDPsaYAUwAyMzNzZs2aFXK6SkpKyMjICHn/WGZ5i0+Wt/gxsmCiU83E4SBR/e1aCbyf92rIx1ZV7l9cSrMU4dac9Aak0r8xY8YsVdVcf6+lhP1soRE/61RVXwZermtHVZ0uIpuBCa1atcrJy8sLOREFBQU0ZP9YZnmLT5a3OJJXXFNqUJwvteovthQgr2Bi/dVTAagqu1pvJDlJyBtyLKrK9pJDdG4V/oBRW6z0bioCuvksZwGeOw7bsBzGmJgQ5jaKaiLCJTlZNaPMvvHlVkb/6T0+37AnpOMFI1aCxMdAXxHpKSJpwBXAa153tgH+jDExI7+YOu+hDjFQ+Drx2NZcPbwHA7q2BmBLcSmRajpojC6wM4FC4HgRKRKR61S1ArgJeANYAcxW1S+9HtNKEsaYWFJv+0MDA0VWuxbcNb4/KclJHKqo5IrphXwaoVJFY/RumqSqx6hqqqpmqeqT7vq5qnqcqvZW1fuCOaaVJIwxMSdCVU+1pSYlcc+FJzK0e7uwHK+2WKluahArSRhjYlIUAkVSkjCqb6cGHyfg8SN25CiykoQxJmZFqUQRKQkRJKwkYYyJaXEcKBIiSFhJwhgT8+I0UCREkLCShDEmLsRhoEiIIGGMMXEjzgJFQgQJq24yxsSVOAoUCREkrLrJGBN34iRQJESQMMaYuBQHgcKChDHGNKYYDxQJESSsTcIYE9e8BIpGChYJESSsTcIYE/e8zDXRCIEiIYKEMcYkhBgMFBYkjDEmlsRYoLAgYYwxsSaGAoUFCWOMiUX5xTHR8ykhgoT1bjLGJCwvpYoISoggYb2bjDEJra5AEeHSREIECWOMSXiNFCgsSBhjTCKIUKCwIGGMMfGiERqyLUgYY0w8iXKgiPkgISItRWSpiIxv7LQYY0xMiGKPp6gHCRF5SkS2icjyWuvHicjXIvKtiNzu89JtwOzoptIYY2JclAJFY5QkZgDjfFeISDLwOHAu0B+YJCL9ReRM4Ctga7QTaYwxMc9foAhz8BBVDesBPZ1UJBuYo6onusunAvmqeo67fIe7aQbQEidwHAQuUtWqWseaAkwByMzMzJk1a1bI6SopKSEjIyPk/WOZ5S0+Wd7iU7zlbcyYMUtVNdffaynRTkwAxwIbfJaLgOGqehOAiEwGdtQOEACqOh2YDpCbm6t5eXkhJ6KgoICG7B/LLG/xyfIWnxIpb7ESJMTPupoijqrOqHNnkQnAhD59+oQ5WcYY07TFSu+mIqCbz3IWsKmR0mKMMcYVK0HiY6CviPQUkTTgCuA1rzvb2E3GGBMZjdEFdiZQCBwvIkUicp2qVgA3AW8AK4DZqvplEMe0UWCNMSYCot4moaqTAqyfC8wN8ZivA6/n5uZe35C0GWOMOVKjdIGNFBHZDqx3F9sAvkUL3+VAzzsCOxqQhNrnDGU7f6/Vt66p5M13OZx5C5SOYLYJZ958n8dq3vytD/Z/zvJWt4b+zwWTt7aq2snv0VU1IR/A9EDLdTxfEs5zhrKdv9fqW9dU8ua7HM68ec1ftPJWK58xmbf60l9XXi1v4ctbMHnwmrfaj1hpuI6E1+tYDvQ83OcMZTt/r9W3rqnkzXc5nHnzerxo5c1reryKRN78rY/V/7lEzltd24WatyMkVHVTQ4nIEg1w12G8s7zFJ8tbfEqkvCVySSIU0xs7ARFkeYtPlrf4lDB5s5KEMcaYgKwkYYwxJiALEsYYYwKyIGGMMSYgCxIBuNOmPiMifxeRqxo7PeEkIr1E5EkReamx0xIJInKh+7m9KiJnN3Z6wklEThCRaSLykojc2NjpCbdEna5YRPJEZKH72eU1dnqC0aSCRJBTp14MvKSq1wMXRD2xQQomb6q6RlWva5yUhibI/P2f+7lNBi5vhOQGJci8rVDVHwPfA2K+i2UiT1ccZN4UKAHScUa9jh8NvSswnh7AaGAosNxnXTKwGugFpAGf48yEdwcw2N3m+cZOezjz5vP6S42d7gjn7yFgaGOnPdx5w/nR8gFwZWOnPZx5A87EGQF6MjC+sdMe5rwlua9nAs81dtqDeTSpkoSqLgB21Vo9DPhWnV/XZcAsYCJOtM9yt4n59ynIvMWdYPInjj8C/1XVT6Kd1mAF+9mp6muqOgKI+WrQIPM2BjgFuBK4XkRi+v8umLzp4Vk1dwPNopjMBouVmekak9+pU4FHgcdE5HzCPwREtPjNm4h0AO4DhojIHar6h0ZJXcMF+uxuxvlV2kZE+qjqtMZIXAMF+uzycKpCmxHiqMkxIOTpiuNAoM/tYuAcoC3wWGMkLFQWJAJMnaqq+4Fro52YMAuUt53Aj6OdmAgIlL9HcYJ8PAuUtwKgILpJCbsGTVcc4wJ9bi8DL0c7MeEQ08W5KEnkqVMTOW+Q2PmzvMWnhMubBYkGTp0a4xI5b5DY+bO8xaeEy1uTChKRmDo1ViRy3iCx82d5s7zFMhvgzxhjTEBNqiRhjDEmOBYkjDHGBGRBwhhjTEAWJIwxxgRkQcIYY0xAFiSMMcYEXmguFgAABBRJREFUZEHCxDwReUREbvFZfkNE/uGz/JCI3NqA4+eLyP/zur6eYxWISIOH8HaPs8RnOVdEChp6XPdYk0UkrsYPMo3HgoSJBx8AIwDckUE7AgN8Xh8BLPJyIBFJDnvqIqeziJzb2ImoLc7eQ9NAFiRMPFiEGyRwgsNyYJ+ItBORZsAJwKfuEOEPiMhyEVkmIpdDzaxg74nI88Ayd92v3Ylh3gaOry8B7i/7P4rIYhFZJSKj3PXNRWSWiHwhIi8AzX32OVtECkXkExF5UUQyRKSNe97j3W1misj1AU77AHCXn7QcURIQkTnu6LCISImbzqUi8raIDHPTvkZEfCfP6iYi89y0TPU51tVuHj8TkSeqA4J73N+JyEfAqfW9XyZxWJAwMU9VNwEVItIdJ1gUAtVfVrnAF+7Y/RcDg4FBOEOFPyAix7iHGQb8WlX7i0gOzpg6Q9x9TvaYlBRVHQbcAlR/sd4IHFDVgTjDr+cAiEhHnC/4M1V1KLAEuFVVi3GGbZghIlcA7VT17wHOVwgcEpExHtMH0BIoUNUcYB9wL3AWcBHwO5/thuHMRzEYuMytzjoBZya/01R1MFDJ4TkrWuJMrjNcVd8PIj0mztlQ4SZeVJcmRgAP44zbPwIoxqmOAhgJzFTVSmCriMzHCQB7gcWqutbdbhTwiqoeABARrwOwVQ/1vBTIdp+Pxh2WXFW/EJEv3PWn4MxItkhEwJmlrNDd7i0RuQx4HCeg1eVenGBzm8c0lgHz3OfLgEOqWi4iy3zSDPCWO2Q8IvIyzntXgRPkPnbT3BzY5m5fCfzbYxpMArEgYeJFdbvESTjVTRuAX+AEgKfcbfyN5V9tf63lUAYtO+T+reTI/x1/xxKcL+JJR73gtKucABwE2lPHnMeq+q6I3IMTdKpVcGQtQLrP83I9PCBbVXWaVbVKROpKs7ppfkZV7/CTlFI3+JomxqqbTLxYBIwHdqlqparuwpnl61TcX+jAAuByEUkWkU44v/IX+znWAuAitz2hFTChAelagFslIyInAgPd9R8Cp4lIH/e1FiJynPvaz3FGCJ0EPCUiqfWc4z7gVz7L64DBIpIkIt1wqo6CdZaItBeR5sCFOO/vO8ClItLZTXN7EekRwrFNArGShIkXy3B6NT1fa12Gqu5wl1/BCRqf4/wy/pWqbhGRfr4HUtVP3Ebmz4D1wMIGpOtvwNNuNdNnuEFJVbeLMw3nTLdxHeAutxrnR8AwVd0nIgtwqpOmHnXkw+mdKyLbfVYtAtbi5H85EMo83u8D/wT6AM+r6hIAEbkLeNMt7ZQDP8F5j0wTZUOFG2OMCciqm4wxxgRkQcIYY0xAFiSMMcYEZEHCGGNMQBYkjDHGBGRBwhhjTEAWJIwxxgRkQcIYY0xA/x+78F4ngMSLEQAAAABJRU5ErkJggg==\n",
"text/plain": [
" \n",
" \n",
"\n",
"A generalization of Laplace's approach is called [additive smoothing](https://en.wikipedia.org/wiki/Additive_smoothing): we add a **pseudocount** (which does not have to be 1) to each of the outcomes we have seen so far, plus one more for the unseen outcomes. We will redefine `Bag` and `P1w` to use additive smoothing. This time, I can't have the `ProbabilityFunction` as a mixin, because I need to supply the psuedocount as an argument when the bag is constructed, and then use it within the probability function. Thus, I will redefine `Bag` as follows:\n"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"class Bag(Counter): \n",
" \"\"\"A bag of words with a probability function using additive smoothing.\"\"\"\n",
" def __init__(self, elements=(), pseudocount=1):\n",
" self.update(elements)\n",
" self.pseudocount = pseudocount\n",
"\n",
" def __call__(self, outcome):\n",
" \"\"\"The probability of `outcome`, smoothed by adding pseudocount to each outcome.\"\"\"\n",
" if not hasattr(self, 'denominator'):\n",
" self.denominator = sum(self.values()) + self.pseudocount * (len(self) + 1)\n",
" return (self[outcome] + self.pseudocount) / self.denominator"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"P1w = Bag(P1w)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now an unknown word has a non-zero probability, as do sequences that contain unknown words, but known words still have higher probabilities:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.7003201005861308e-12"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"P1w('neverbeforeseen')"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'this is the oligonucleotide test': 4.287512523438411e-14,\n",
" 'this is the neverbeforeseen test': 8.033021050553851e-20,\n",
" 'this is the zqbhjhsyefvvjqc test': 8.033021050553851e-20}"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{test: Pwords2(tokens(test)) for test in tests}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many more advanced ways to do smoothing, but we won't cover them here.\n",
"\n",
"There is one more issue to contend with that we will mention but not fix: the smallest positive 64-bit floating point number is about $10^{-323}$. That means that if we are trying to compute the probability of a long sequence of words, we will eventually reach a point where we get **floating point underflow** and the probability is incorrectly reported as zero. We see that this happens somewhere around a 100 word sequence, depending on the exact words:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{70: 1.1733110491887828e-219,\n",
" 80: 2.279383032662303e-258,\n",
" 90: 1.5643367062438602e-271,\n",
" 100: 7.120951572374323e-301,\n",
" 110: 1.67177e-319,\n",
" 120: 0.0,\n",
" 130: 0.0,\n",
" 140: 0.0,\n",
" 150: 0.0}"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{n: Pwords(sample(WORDS, n)) for n in range(70, 151, 10)}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This problem can be addressed by adding logarithms rather than multiplying probabilities, or by re-scaling numbers when they get too small."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Task: Secret Codes\n",
"\n",
"Let's tackle one more task: decoding secret messages. We'll start with the simplest of codes, a rotation cipher, sometimes called a shift cipher or a Caesar cipher (because this was state-of-the-art crypotgraphy in 100 BC). First, a method to encode:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"def rot(msg, n=13): \n",
" \"Encode a message with a rotation (Caesar) cipher.\" \n",
" return encode(msg, alphabet[n:]+alphabet[:n])\n",
"\n",
"def encode(msg, key): \n",
" \"Encode a message with a substitution cipher.\" \n",
" table = str.maketrans(upperlower(alphabet), upperlower(key))\n",
" return msg.translate(table) \n",
"\n",
"def upperlower(text): return text.upper() + text.lower() "
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Uijt jt b tfdsfu nfttbhf.'"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"msg = 'This is a secret message.'\n",
"\n",
"rot(msg, 1)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Guvf vf n frperg zrffntr.'"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rot(msg, 13)"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'This is a secret message.'"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rot(rot(msg, 13), 13)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Decoding a Caesar cipher message is easy: try all 26 candidates, and find the one with the maximum `Pwords`:"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"def decode_rot(secret):\n",
" \"Decode a secret message that has been encoded with a rotation cipher.\"\n",
" candidates = [tokens(rot(secret, i)) for i in range(26)]\n",
" return sentence(max(candidates, key=Pwords))"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'this is a secret message'"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"secret = rot(msg, 17)\n",
"\n",
"decode_rot(secret)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make it a bit harder. When the secret message contains separate words, it is too easy to decode by guessing that the one-letter words are most likely \"I\" or \"a\". So what if the encode routine squished all the letters together:"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"def squish_rot(msg, n=13):\n",
" \"Encode a message with a rotation (Caesar) cipher, keeping letters only.\" \n",
" return encode(cat(tokens(msg)), alphabet[n:]+alphabet[:n])"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'pahdghplmaxtglpxkmablmbfxtgrhgxtgrhgxunxeexk'"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"secret = squish_rot('Who knows the answer this time? Anyone? Anyone? Bueller?', 19)\n",
"secret"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That looks harder to decode. A decoder will have to try all rotations, then segment each rotation, and find the candidate with the best `Pwords`:"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"def decode_rot(secret):\n",
" \"\"\"Decode a secret message that has been encoded with a rotation cipher,\n",
" and which has had all the non-letters squeezed out.\"\"\"\n",
" candidates = [segment(rot(secret, i)) for i in range(26)]\n",
" return max(candidates, key=lambda msg: Pwords(msg))"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'who knows the answer this time anyone anyone bu e ll er'"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sentence(decode_rot(secret))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Almost right, but didn't quite get the proper name.\n",
"\n",
"What about a general substitution cipher? The problem is that there are $26! \\approx 10^{26}$ substitution ciphers, and we can't enumerate all of them. We would need to search through the space: initially make some guess at a substitution, then swap two letters; if that looks better keep going, if not try something else. This approach solves most substitution cipher problems, although it can take a few minutes on a message of length 100 words or so.\n",
"\n",
"# What's Next?\n",
"\n",
"What to do next? Here are some options:\n",
" \n",
"- **Spelling correction**: Use bigram or trigram context; make a model of spelling errors/edit distance; go beyond edit distance 2; make it more efficient\n",
"- **Evaluation**: Make a serious test suite; search for best parameters (e.g. $c_1, c_2, c_3$)\n",
"- **Smoothing**: Implement Kneser-Ney and/or Interpolation; do letter *n*-gram-based smoothing\n",
"- **Secret Codes**: Implement a search over substitution ciphers\n",
"- **Classification**: Given a corpus of texts, each with a classification label, write a classifier that will take a new text and return a label. Examples: spam/no-spam; favorable/unfavorable; what author am I most like; reading level.\n",
"- **Clustering**: Group data by similarity. Find synonyms/related words.\n",
"- **Parsing**: Representing nested structures rather than linear sequences of words. relations between parts of the structure. Implicit missing bits. Inducing a grammar.\n",
"- **Meaning**: What semantic relations are meant by the syntactic relations?\n",
"- **Translation**: Using examples to transform one language into another.\n",
"- **Question Answering**: Using examples to transfer a question into an answer, either by retrieving a passage, or by synthesizing one.\n",
"- **Speech**: Dealing with analog audio signals rather than discrete sequences of characters."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
`):\n",
" \n",
"$P(w_1 \\ldots w_n) = P(w_1 \\mid \\mbox{}) \\times P(w_2 \\mid w_1) \\times \\ldots P(w_n \\mid w_{n-1}) = \\Pi_{i=1}^{n} P(w_i \\mid w_{i-1})$\n",
"\n",
"The conditional probability of each word given the previous word is given by:\n",
"\n",
"$P(w_i \\mid w_{i-1}) = \\frac{P(w_{i-1}w_i)}{P(w_{i-1})} $\n",
"\n",
"That is, the bigram probability of the word and the previous word, divided by the unigram probability of the previous word. \n",
"\n",
"The bigram model can be thought of as a \"multiple bags of words\" model, where there are multiple bags, one for each distinct word, and a special bag marked ``. (We assume that each sentence boundary in the text is also annotated with a ``.) To build a model, take a text,\n",
"cut it up into slips of paper with two words on them, put each slip into the bag labelled with the first word on the slip. To generate language from the model, first pick a slip from the bag labeled `` and output the second word on the slip. Now pick a slip from the bag labeled with that second word, and output the second word on *that* slip. Continue in this manner.\n",
"\n",
"\n",
"The file `count_2w.txt` has bigram data in the form `\"word1 word2 \\t count\"`. We'll load this into the bag `P2w`."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"P2w = load_counts(open('count_2w.txt'))"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(258437, 225.955251755)"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(P2w), sum(P2w.values())/1e9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Over a quarter million distinct two-word pairs, with a total count of 225 billion. Let's see the most common two-word sequences:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('of the', 2772205934),\n",
" ('in the', 1735111785),\n",
" ('to the', 1147345124),\n",
" ('on the', 841095584),\n",
" ('for the', 726714015),\n",
" ('and the', 644282998),\n",
" ('to be', 513603140),\n",
" ('with the', 480181610),\n",
" ('is a', 479118206),\n",
" ('from the', 454051070),\n",
" ('at the', 449390032),\n",
" ('by the', 428337874),\n",
" ('do not', 400755693),\n",
" ('it is', 391172862),\n",
" ('of a', 387456600),\n",
" ('in a', 387077847),\n",
" ('will be', 357181303),\n",
" ('if you', 350999373),\n",
" ('that the', 337117243),\n",
" ('is the', 313123377)]"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"P2w.most_common(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use `Pwords2` for the probability of a word sequence in the bigram model, and `cPword` for the conditional probability of a word given the previous word. \n",
"\n",
"But we run into a problem: in `cPword` it is natural to divide by the probability of the previous word, but we don't want to divide by zero in the case where we haven't seen that word before. So we will use a **backoff model** that says if we don't have counts for both words, we'll fall back to the unigram probability of the word (ignoring the previous word)."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"def Pwords2(words, prev='') -> float:\n",
" \"\"\"The probability of a sequence of words, using bigram data, given previous word.\"\"\"\n",
" return Π(cPword(w, (prev if (i == 0) else words[i-1]) )\n",
" for (i, w) in enumerate(words))\n",
"\n",
"def cPword(word, prev) -> float:\n",
" \"\"\"Conditional probability of word, given previous word.\"\"\"\n",
" bigram = prev + ' ' + word\n",
" if P2w(bigram) > 0 and P1w(prev) > 0:\n",
" return P2w(bigram) / P1w(prev)\n",
" else: # Average the back-off value and zero.\n",
" return P1w(word) "
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.012268827179134844"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"P2w('of the')"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5.575887217554441e-06"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"P2w('the of')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll look at all 24 permutations of the words in \"this is a test\" and compute the probability for each under the bigram model. We see that \"this is a test\" is the most probable permutation, and \"this a is test\" is the least probable, almost a million times less probable: "
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(3.472808344777707e-07, 'this is a test'),\n",
" (1.127597078828602e-07, 'test this is a'),\n",
" (6.965779368966287e-08, 'this test is a'),\n",
" (3.156249379346401e-08, 'a test this is'),\n",
" (2.2987584450049622e-08, 'is a test this'),\n",
" (1.5938698734133065e-08, 'is a this test'),\n",
" (1.340683369945385e-08, 'test is a this'),\n",
" (1.306996404175386e-08, 'a test is this'),\n",
" (4.2119901013695514e-09, 'a this is test'),\n",
" (4.0586505617016795e-09, 'a this test is'),\n",
" (2.225067906668548e-09, 'test a this is'),\n",
" (2.2250679066685475e-09, 'this is test a'),\n",
" (1.7086438151095857e-09, 'is this test a'),\n",
" (9.36681886482603e-10, 'this a test is'),\n",
" (7.464591940646502e-10, 'is this a test'),\n",
" (6.790746260498466e-10, 'test is this a'),\n",
" (9.442288029367219e-11, 'is test a this'),\n",
" (7.210056993667689e-11, 'a is this test'),\n",
" (6.959024970571533e-11, 'is test this a'),\n",
" (1.0936134003445812e-11, 'this test a is'),\n",
" (7.330795983439265e-12, 'test a is this'),\n",
" (6.215026001209826e-12, 'a is test this'),\n",
" (1.551284601334948e-12, 'test this a is'),\n",
" (9.945186564405085e-13, 'this a is test')]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted((Pwords2(t), sentence(t)) for t in permutations(tokens('this is a test')))[::-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Under the bag of words model, all permutations have the same probability (except for roundoff error in the least-significant digit):"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(2.9841954327098435e-11, ('this', 'test', 'is', 'a')),\n",
" (2.9841954327098435e-11, ('this', 'test', 'a', 'is')),\n",
" (2.9841954327098435e-11, ('this', 'is', 'test', 'a')),\n",
" (2.9841954327098435e-11, ('test', 'this', 'is', 'a')),\n",
" (2.9841954327098435e-11, ('test', 'this', 'a', 'is')),\n",
" (2.9841954327098435e-11, ('test', 'is', 'this', 'a')),\n",
" (2.9841954327098435e-11, ('test', 'a', 'this', 'is')),\n",
" (2.9841954327098435e-11, ('is', 'this', 'test', 'a')),\n",
" (2.9841954327098435e-11, ('is', 'test', 'this', 'a')),\n",
" (2.9841954327098435e-11, ('a', 'test', 'this', 'is')),\n",
" (2.984195432709843e-11, ('this', 'is', 'a', 'test')),\n",
" (2.984195432709843e-11, ('this', 'a', 'test', 'is')),\n",
" (2.984195432709843e-11, ('this', 'a', 'is', 'test')),\n",
" (2.984195432709843e-11, ('test', 'is', 'a', 'this')),\n",
" (2.984195432709843e-11, ('test', 'a', 'is', 'this')),\n",
" (2.984195432709843e-11, ('is', 'this', 'a', 'test')),\n",
" (2.984195432709843e-11, ('is', 'test', 'a', 'this')),\n",
" (2.984195432709843e-11, ('is', 'a', 'this', 'test')),\n",
" (2.984195432709843e-11, ('is', 'a', 'test', 'this')),\n",
" (2.984195432709843e-11, ('a', 'this', 'test', 'is')),\n",
" (2.984195432709843e-11, ('a', 'this', 'is', 'test')),\n",
" (2.984195432709843e-11, ('a', 'test', 'is', 'this')),\n",
" (2.984195432709843e-11, ('a', 'is', 'this', 'test')),\n",
" (2.984195432709843e-11, ('a', 'is', 'test', 'this'))]"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted((Pwords(t), t) for t in permutations(tokens('this is a test')))[::-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Task: Segmentation with a Bigram Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's re-do segmentation using a bigram model. To make `segment2`, we copy `segment`, and make sure to pass around the previous token, and to evaluate probabilities with `Pwords2` instead of `Pwords`."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"@lru_cache(None)\n",
"def segment2(text, prev=''): \n",
" \"Return best segmentation of text; use bigram data.\" \n",
" if not text: \n",
" return []\n",
" else:\n",
" candidates = ([first] + segment2(rest, first) \n",
" for (first, rest) in splits(text, 1))\n",
" return max(candidates, key=lambda words: Pwords2(words, prev))\n",
" \n",
"def splits(text, start=0, end=20) -> Tuple[str, str]:\n",
" \"\"\"Return a list of all (first, rest) pairs; start <= len(first) <= L.\"\"\"\n",
" return [(text[:i], text[i:]) \n",
" for i in range(start, min(len(text), end)+1)]"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['choose', 'spain']"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segment2('choosespain')"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['speed', 'of', 'art']"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segment2('speedofart')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the following example, both `segment` and `segment2` perform perfectly:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('a dry bare sandy hole with nothing in it to sit down on or to eat',\n",
" 'a dry bare sandy hole with nothing in it to sit down on or to eat')"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tolkein = 'adrybaresandyholewithnothinginittositdownonortoeat'\n",
"sentence(segment(tolkein)), sentence(segment2(tolkein))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But for the next example, `segment` gets three words wrong, while `segment2` only misses one word, `unregarded`, which is not in `P1w`."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'far out in the un chart ed back waters of the un fashionable end of the western spiral arm of the galaxy lies a small un regarded yellow sun'"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adams = ('faroutintheunchartedbackwatersoftheunfashionableendofthewesternspiral' +\n",
" 'armofthegalaxyliesasmallunregardedyellowsun')\n",
"sentence(segment(adams))"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the galaxy lies a small un regarded yellow sun'"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sentence(segment2(adams))"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'unregarded' in P1w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Conclusion? The bigram model is a little better, but these examples haven't demonstrated a large difference. Even with hundreds of billions of words as data, we still don't have a complete model of English words."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Theory: Evaluation\n",
"\n",
"So far, we've got an intuitive feel for how this all works. But we don't have any solid metrics that quantify the results. Without metrics, we can't say if we are doing well, nor if a change is an improvement. In general,\n",
"when developing a program that relies on data to help make\n",
"predictions, it is good practice to divide your data into three sets:\n",
"\n",
"
\n",
"\n",
"For this program, the training data is the word frequency BAG, the development set is the examples like `\"choosespain\"` that we have been playing with, and now we need a test set."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"def test_segmenter(segmenter, tests):\n",
" \"Try segmenter on tests; report failures; return fraction correct.\"\n",
" return sum([test_one_segment(segmenter, test) \n",
" for test in tests]), len(tests)\n",
"\n",
"def test_one_segment(segmenter, test):\n",
" words = tokens(test)\n",
" result = segmenter(cat(words))\n",
" correct = (result == words)\n",
" if not correct:\n",
" print('expected:', sentence(words))\n",
" print(' got:', sentence(result))\n",
" return correct\n",
"\n",
"cat = ''.join\n",
"\n",
"proverbs = (\"\"\"A little knowledge is a dangerous thing\n",
" A man who is his own lawyer has a fool for his client\n",
" All work and no play makes Jack a dull boy\n",
" Better to remain silent and be thought a fool that to speak and remove all doubt;\n",
" Do unto others as you would have them do to you\n",
" Early to bed and early to rise, makes a man healthy, wealthy and wise\n",
" Fools rush in where angels fear to tread\n",
" Genius is one percent inspiration, ninety-nine percent perspiration\n",
" If you lie down with dogs, you will get up with fleas\n",
" Lightning never strikes twice in the same place\n",
" Power corrupts; absolute power corrupts absolutely\n",
" Here today, gone tomorrow\n",
" See no evil, hear no evil, speak no evil\n",
" Sticks and stones may break my bones, but words will never hurt me\n",
" Take care of the pence and the pounds will take care of themselves\n",
" Take care of the sense and the sounds will take care of themselves\n",
" The bigger they are, the harder they fall\n",
" The grass is always greener on the other side of the fence\n",
" The more things change, the more they stay the same\n",
" Those who do not learn from history are doomed to repeat it\"\"\"\n",
" .splitlines())"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"expected: power corrupts absolute power corrupts absolutely\n",
" got: power corrupt s absolute power corrupt s absolutely\n",
"expected: the grass is always greener on the other side of the fence\n",
" got: the grass is always green er on the other side of the fence\n"
]
},
{
"data": {
"text/plain": [
"(18, 20)"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_segmenter(segment, proverbs)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(20, 20)"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_segmenter(segment2, proverbs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This confirms that both segmenters are good, and that `segment2` is slightly better, getting 20 out of 20 examples right, compared to 18 out of 20 for `segment`. There is much more that can be done in terms of the variety of tests, and in measuring statistical significance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Theory and Practice: Smoothing\n",
"\n",
"Here are some more test cases:\n"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'this is the oligonucleotide test': 4.287506893716667e-14,\n",
" 'this is the neverbeforeseen test': 0.0,\n",
" 'this is the zqbhjhsyefvvjqc test': 0.0}"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tests = ['this is the oligonucleotide test',\n",
" 'this is the neverbeforeseen test',\n",
" 'this is the zqbhjhsyefvvjqc test']\n",
"\n",
"{test: Pwords2(tokens(test)) for test in tests}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The issue here is the finality of a probability of zero. Out of the three 15-letter words, it turns out that \"oligonucleotide\" is in the dictionary, but if it hadn't been, if somehow our corpus of words had missed it, then the probability of that whole phrase would have been zero. It seems that is too strict; there must be some \"real\" words that are not in our dictionary, so we shouldn't give them probability zero. There is also a question of likelyhood of being a \"real\" word. It does seem that \"neverbeforeseen\" is more English-like than \"zqbhjhsyefvvjqc\", and so perhaps should have a higher probability.\n",
"\n",
"We can address this by assigning a non-zero probability to words that are not in the dictionary. This is even more important when it comes to multi-word phrases (such as bigrams), because many legitimate bigrams will not appear in the training data.\n",
"\n",
"We can think of our probability model as being overly spiky; it has a spike of probability mass wherever a word or phrase occurs in the corpus. What we would like to do is *smooth* over those spikes so that we get a model that does not depend so much on the details of our corpus. \n",
"\n",
"For example, Laplace was asked for an estimate of the probability of the sun rising tomorrow. From past data, he knows that the sun has risen $n/n$ times for the last *n* days (with some uncertainty about the value of *n*), so the maximum liklihood estimator is probability 1. But Laplace wanted to balance the observed data with the possibility that tomorrow, either the sun will rise or it won't, so he came up with an estimate of $(n + 1) / (n + 2)$.\n",
"\n",
" \n",
"
What we know is little, and what we are ignorant of is immense.
— Pierre Simon Laplace (1749-1827)\n",
"