{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
Peter Norvig
2014
Python 3, 2020
\n", "\n", "
\n", "

How To Do Things With Words (And Counters)\n", "
or\n", "
Statistical Natural Language Processing in Python\n", "
or\n", "
Everything I Needed to Know I learned From Sesame Street

\n", "
\n", "
\n", "
One, two, three, ah, ah, ah! — The Count\n", "
\n", "
\n", " \n", "This notebook will look at data, models, and tasks involving stastical natural language processing.\n", " \n", "First some boring preliminaries:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re\n", "import math\n", "import random\n", "import matplotlib.pyplot as plt\n", "from collections import Counter\n", "from itertools import permutations\n", "from functools import lru_cache\n", "from typing import List, Tuple, Set, Dict, Callable\n", "\n", "Word = str # We implement words as strings\n", "cat = ''.join # Function to concatenate strings together" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data: Text and Words\n", "\n", "Before we can do things with words, we need some words. I happen to have a big text called [big.txt](big.txt). We can read it, and see how big it is (in characters):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6488409" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TEXT = open('big.txt').read()\n", "len(TEXT)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Over six million characters.\n", "\n", "Now break the characters into words (or more formal-sounding: *tokens*). We'll ignore capitalization and punctuation and anything but the 26 letters (other tokenizers *do* deal with such things). The function `tokens` turns text into a list of words, and `sentence` puts tokens back together into a string:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def tokens(text) -> List[Word]:\n", " \"\"\"List all the word tokens (consecutive letters) in a text. Normalize to lowercase.\"\"\"\n", " return re.findall('[a-z]+', text.lower()) \n", "\n", "sentence = ' '.join # Function to join words with spaces" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['this', 'is', 'a', 'test']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens('This is: \"A test\". 😋')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'this is a test'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence(tokens('This is: \"A test\". 😋'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many words are in the text?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1105211" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "WORDS = tokens(TEXT)\n", "len(WORDS)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Over a million words. Here are the first 10:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'the project gutenberg ebook of the adventures of sherlock holmes'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence(WORDS[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model: The Bag of Words Model\n", "\n", "`WORDS` is a list of the words in `TEXT`, and with a little work it can also serve as a *generative model* of text. We know that language is very complicated, but we can create a simplified model of language that captures part of the complexity. \n", "\n", "In the so-called [**bag of words** model](https://en.wikipedia.org/wiki/Bag-of-words_model), we ignore the order of words, but maintain their frequency. Think of it this way: take all the words from the text, and throw them into a bag. Shake the bag, and then generate a sentence by pulling words out of the bag one at a time. Chances are the \"sentence\" won't be grammatical or sensible, but it will have words in roughly the right proportions. Here we generate a \"sentence\" with this approach:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def sample(words, n=10) -> str:\n", " \"\"\"Sample n random words from a list of words.\"\"\"\n", " return [random.choice(words) for _ in range(n)]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'sore the toes debt my after interference struggle the moment'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence(sample(WORDS))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`WORDS` is a million elements list, and has a lot of repetition. For example:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "80029" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "WORDS.count('the')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more compact representation for a bag of words is a `collections.Counter`: a dictionary of `{'word': count}` entries (with some handy extra [methods](https://docs.python.org/3/library/collections.html#collections.Counter)). We could use `Counter` directly, but I will define `Bag` as a subclass of Counter (note that [bag](https://en.wikipedia.org/wiki/Multiset) is considered a synonym of [multiset](https://en.wikipedia.org/wiki/Multiset)):" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "class Bag(Counter): \"\"\"A bag of words.\"\"\"" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Bag({'is': 3, 'this': 1, 'a': 3, 'test': 3, 'it': 2})" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Bag(tokens('Is this a test? It is a test! A test it is!'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a `Bag` of `WORDS` and get a feel for what's there:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "29152" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "BAG = Bag(WORDS)\n", "\n", "len(BAG) # Number of different words" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 80029),\n", " ('of', 40025),\n", " ('and', 38312),\n", " ('to', 28766),\n", " ('in', 22047),\n", " ('a', 21155),\n", " ('that', 12512),\n", " ('he', 12401),\n", " ('was', 11410),\n", " ('it', 10681),\n", " ('his', 10034),\n", " ('is', 9774),\n", " ('with', 9740),\n", " ('as', 8064),\n", " ('i', 7682),\n", " ('had', 7383),\n", " ('for', 6938),\n", " ('at', 6791),\n", " ('by', 6738),\n", " ('on', 6643)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "BAG.most_common(20) # Most common words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model: Zipf's Law\n", "\n", "In 1935, linguist George Zipf observed that in any big text, the *n*th most frequent word appears with a frequency of roughly 1/*n* of the most frequent word. He get's credit for [*Zipf's Law*](https://en.wikipedia.org/wiki/Zipf%27s_law), even though [Felix Auerbach](https://en.wikipedia.org/wiki/Felix_Auerbach) made a similar observation in 1913. If we plot the frequency of words, most common first, on a log-log plot, they should come out as a straight line if Zipf's Law holds. Here we see that it is a fairly close fit:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEaCAYAAAD+E0veAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deXhU1fnA8e+bjQCJ7EQkkbBvsggBlAIGFIsV3K0iLliXaqt16eJSrbHW1l+lWq22iqK4IIJWqyDiHkBAWdxAUWQVEIGgBgKELPP+/rh3hkmYmUyWycwk7+d55mHunXvvvHNymXfOOfeeI6qKMcYYA5AQ7QCMMcbEDksKxhhjfCwpGGOM8bGkYIwxxseSgjHGGB9LCsYYY3wsKZigRCRDRBaKyF4R+UeE3mOTiJwUiWNHkoj8RUQKROS7aMcSr0QkT0Seraf3isp55v8ZReRoESkSkcT6jqM6GmVScE+QA+4fyPs4KtpxxaArgQLgCFX9bW0PJiLTReQvtQ8rskRERaRbiNezgN8CfVT1yPqLrPbq84s4XojIaBF5T0QKRWRTkG2Gi8iS2ryPqn6jqmmqWl6b40Rao0wKrgnuH8j7+LbyBiKSFI3AYkgn4Au1Oxwr6wTsVtWdgV6086YiccTyd80+4Ang9yG2+Rkwr37CiTJVbXQPYBNwUoD12YAClwHfAAvd9ccBS4AfgU+BXL99OgMLgL3AW8BDwLPua7nA1mDvjZOUbwbWA7uB2UDrSrFc4sZSAPzR7ziJwK3uvnuBlUAW8DDwj0rvOQe4PkhZDAeWA4Xuv8Pd9dOBUqAEKApSXtPd93vNjeFDoGuQ97my0vHm+JXH74DP3BhmAalBjjEZWAzc7/4tNrjxTwa2ADuBS/y2bwE8DewCNgO3AQnua93cv1uhW7az3PUL3XLf58Z5XqUYTgIOAB739enxct4A49zyL3Vj/zRAGV/q/du4y+uA2X7LW4CBoc4d97V84G7373XALe+gnzlAHK2Aue7f7gf3eWal49/lHn8v8CbQ1u/1i9y/+W7gjwT5Px/gb7spyGsfAYPc5wpcBXztxvYwIEH2y/P7u3r/Nklhfoag509Evx/r401i7RHsBPH7oz0NNAeaAh3dE+tn7n/Gse5yO3efpcB9QBNglPvHDfc/9/XAB0Cmu/+jwMxKsTzmxjEAOAj0dl//PbAK6AmI+3obYCjwLYe+/NoC+4GMAJ+3tXtSXwQkARPd5Tbu69OBv4Qox+nA9+57JgEzgOer2P4vAcpjGXCUG88a4Kog+08GynC+uBKBv+B88T3slt/Jbvmnuds/DbwCpLvluRa4zH1tJs6XRQKQCozwex8FuoX4HBX+rnF23uQR5IvYfb0LzpdQAtAB54t1m99rP7ivVXXu5Lt/m77u68mhPnOAONoAZwPN3L/fC8D//F7Px0mKPdzPmQ/c477WByfpjXLf6z6c86ZGScEth224X/xu+c4FWgJH4ySucUGO6StvAieFYJ8h5PkT0e/HSL9BLD7c/2BF7sn/o/dk8/ujdfHb9ibgmUr7v4HzS+xo92Rr7vfac4T/n3sNcGKlk6/U/U/kjcX/19Ey4Hz3+VfA6UE+3xpgrPv8GmBekO0uApZVWrcUmOw+n07VSeFxv+WfAV9WsX2gpHCh3/LfgUeC7D8Z+NpvuZ9bRhl+63YDA3GSxkGcdn/va78E8t3nTwNT/cvXb7uaJoV4OG/yCJEU3G22AIOA890yWgb0wknGr4Z57uQDf/Z7LeRnDuP/7EDgB7/lfOA2v+VfAfPd53/C78cJTqIuoeZJ4TJgWqXzw/9HxGzg5iDH9JU3gZNCsM8Q9PwJp7xq84jldr5IO0NVW7qPMyq9tsXveSfgXBH50fsARuD8RzwK50Td57f95mrE0Al42e+4a4ByIMNvG/+rW/YDae7zLJxfGYE8BVzoPr8QeCbIdkcFiHczzq+UcAWMT0Ru9evEf6Qmxwhih9/zAwCqWnldGk4NKYWKn8//s/0Bp4a1TEQ+F5FfVBFjOOLhvAnHApzENMp9ng+c4D4WuNuEc+74l0e1PrOINBORR0Vks4jswWnWa1npyp1gn/Eo//d233N3sPcKQ6D+hNqUbzjHCXX+RFRjTgqhqN/zLTgZu6Xfo7mq3gNsB1qJSHO/7Y/2e74Pp/oLgHtCt6t07FMqHTtVVbeFEeMWoGuQ154FTheRAUBv4H9BtvsW5+TzdzROVblWVPWveqgT/yrv6toetxoKcH49+38+32dT1e9U9QpVPQqnBvHvUFcchSkezptw/gbepDDSfb6Aw5NCOOeO/3tV9Zkr+y1O0+gwVT0CJ0GBk8irsh3nR5Ozg0gznOaoahORZJzP/VZN9q+FUOdPRFlSqNqzwAQR+amIJIpIqojkikimqm4GVgB3ikiKiIwAJvjtuxZIFZFT3ZPrNpw2Tq9HgLtFpBOAiLQTkdPDjOtx4C4R6e5e3dFfRNoAqOpWnI6/Z4D/quqBIMeYB/QQkQtEJElEzsNpj50bZgzVtQOnXTri1LnsbzZO+aa7ZXwjzt8TETlXRDLdzX/A+QLzXipYF3HG6nmzA8iu4mqgBcBooKl7Li3C6aRuA3zsblOtcyeMz1xZOk6t70cRaQ3cEebnA3gRGC8iI0QkBfgzIb7rRCRBRFJx+j3E/VuluC+PBD5T1T3VeP+6EPT8ifQbW1KogqpuAU7HudJnF04G/z2Hyu4CYBhOh+sdOG3V3n0LcdoJH8f5BbUP2Op3+AeAV4E3RWQvTufhsDBDuw/nS+9NYA8wDaezyuspnDb3YE1HqOpuYDzOr7LdOE0q41W1IMwYqmsa0MetDgervdSla3HKfAPwPk4b9hPua0OAD0WkCOdvcJ2qbnRfywOecuP8eU3eOIbPmxfcf3eLyEdBYl+L0+e2yF3eg1OGi91kW9NzJ+hnDuCfOOdzAc7nmx/Oh3Nj+xz4Nc7feztO0t8aYpdROAloHk7t5QDO/yuI0qWoVZ0/IvJIGM2yNeLtTTd1RETycDopL6xq2wjHMQrn10a2qnqiGYupWqycN6YiEfkCOEdVv4h2LPXFagoNkNvkcB3OlUGWEIypAbcJ6enGlBDAkkKDIyK9cS6z7YBTBTfG1ICqltRHx26sseYjY4wxPlZTMMYY42NJwRhjjE9cj+bYtm1bzc7OrtG++/bto3nz5lVv2EhYeRzOyqQiK4+K4rk8Vq5cWaCq7QK9FtdJITs7mxUrVtRo3/z8fHJzc+s2oDhm5XE4K5OKrDwqiufyEJGgQ4zETPORe7feIvemjNxox2OMMY1RRJOCiDwhIjtFZHWl9eNE5CsRWSciN7urFecuylRC331ojDEmQiJdU5iOM2aKjzu418PAKThjpUwUkT7AIlU9BWfI2DsjHJcxxpgAItqnoKoLRSS70uqhwDpV3QAgIs/jzAvgvWvwByoO/mVMo1JaWsrWrVspLi6OdigVtGjRgjVr1kQ7jJgRD+WRmppKZmYmycnJYe8TjY7mjlQcZ30rMExEzgJ+ijOb0UPBdhaRK3GmdiQjI4P8/PwaBVFUVFTjfRsiK4/DRatM0tLSyMjIoGPHjoiEM1J0/SgvLycxMbHqDRuJWC8PVaWwsJBPP/2UoqKisPeLRlIIdJarqr4EvFTVzqo6VUS2AxPS09MH16j3f8syNrz7Il0GXAxZQ6u/fwMUz1dSREq0ymTNmjVkZmbGVEIA2Lt3L+np6dEOI2bEQ3mkp6dTVFRETk5O2PtE4+qjrfhNgIEzz+y31TmAqs5R1StbtGhR/XffsgyeOo3sjTPgqdOcZWNiTKwlBBOfanIeRSMpLAe6i0hndxTC83HGhg+biEwQkamFhYXVf/dNi9DyEhLwUF52EDYtqv4xjGngfvGLX9C+fXuOOeaYCuuXLl3KFVdcEaWoTH2I9CWpM3Em8+4pIltF5DJVLcOZTP4NnLllZ7uTYoStVjWF7JGQmEI5CWhCCmSPRFU5UFJe9b7GNBKTJ09m/vzD57WZP38+48aNC7CHaSgimhRUdaKqdlDVZFXNVNVp7vp5qtpDVbuq6t2RjOEwWUORS15lc+dJJF06B7KG8tqq7Zz4j3w2795X9f7GNAKjRo2idevWh61/5513OOmkk5g+fTpnnXUW48aNo3v37vzhD3+IQpQmEmLmjubqqFXzEUDWUL7pdI6vk/molk05rksbMls5c6UXl1qtwcSO8x5dygsrnAv2Sss9nPfoUl7+2Lm/80BJOec9upQ5nzrdcnuKSznv0aXMX70dgO/3lXDeo0t5+4sdAOzcW/PLXHfv3k1ycjLeGvonn3zCrFmzWLVqFbNmzWLLli1VHMHEg7hMCrVqPgpg0NGtuO+8gSQmCMWl5ZzywCIeX7ShTo5tTEPxzjvvcPLJJ/uWTzzxRFq0aEFqaip9+vRh8+agw+mYOBKXA+KJyARgQrdu3er82KXlHo7v2oY+Rx0BQFm5hwQREhLsahATHbN+ebzveXJiQoXlpimJFZaPSE2usNy6eUqF5fbpqTWO46233uKmm27yLTdpcuge08TERMrKymp8bBM7rKZQSXpqMn89sx/Du7YFYOqiDZz1nyXsO2gnvGm8VJXPP/+cgQMHRjsUE2FxWVOoTx1bNqV3hyNo3sQpqrJyD0mJcZlLjQnbxIkTyc/Pp6CggMzMTK699lr69+9v9080AvbtVoXTB3bkb2f1A5xOuhPuzeedNTuiHJUxkTVz5ky2b9/uG4eptLSUk046yff65MmTeeihQ6PRzJ071+6IbyDisqYQyT6FUA6WeuiekUbnts5sS+UeJdH6GkwjcNttt7F3795oh2HqQVzWFCLZpxBKVutmTL90KF3apQFw+yuruXH2J6hqvcZhjDGREpc1hVigqrRPb8LBMo/TzrplGZ6Ni0joPDK8Qfa2LHOG2MgOc3tjjKkHlhRqSES4/qQezsKWZXimT0DLS/AkppAweU7oL3p3UD7KSyAxBS551RKDMSYmxGXzUa3vaK5rmxYhnhIS8SCeUmfQvVBNSpsWOQlBy51/bVA+Y0yMiMukEK0+haCyRyKJTUASkcQUtNMIrnh6JY8uWB90exJTQBKdf7NH1m+8xhgTRFwmhZiTNdRpAhrzR7jkVQ52yKFZSiJNkg4Vb4WaQ6XtrenIxLP8/HyWLFlSq2OkpaXVUTRVy8/PZ/z48QFfmzhxIv379+f++++vt3hijfUp1JWsob4v91TgwYnH+hLBu1/u4In3N3H/eQNpl97ksO0Dso5oEyfy8/NJS0tj+PDh0Q4loHCnzfzuu+9YsmRJwDGcysrKSEpqHF+XVlOIIO/dn3uLy9hfUkaLpmFOnu3tiH73bpsdzoRnyzJY9I86O1fOOOMMBg8eTN++fZk6dapv/fz58xk0aBADBgzgxBNPZNOmTTzyyCPcf//9DBw4kEWLFjF58mRefPFF3z7eWkBRUREnnngigwYNol+/frzyyishY/j73//Ogw8+CMANN9zAmDFjAGdgvgsvvBBwbrLr168fxxxzTIVxmdLS0vjTn/7EsGHDWLp0KfPnz6dXr16MGDGCl14KPOvvySefzM6dO32fIzc3l1tvvZUTTjiBBx54gF27dnH22WczZMgQhgwZwgcffAA4o8eefPLJHHvssfzyl7+kU6dOFBQUsGnTpgqTFE2ZMoW8vDwA1q9fz7hx4xg8eDAjR47kyy+/BJybAn/zm98wfPhwunTpUqEc//73v9OvXz8GDBjAzTffzPr16xk0aJDv9a+//prBgweHLNOwqGrcPYAJwNRu3bppTb333ns13rcmPB6PqqqWlJXrxKlL9Y3V24NvvHCKal4r1TuOcP5dOCXi8dV3ecSDaJXJF198Ub0dvvlQ9a4M51y5K8NZrqXdu3erqur+/fu1b9++WlBQoBs2bNDMzEzdsGFDhW3uuOMOvffee337XnLJJfrCCy/4lps3b66qqqWlpVpYWKiqqrt27dKuXbv6/l94t/G3dOlSPeecc1RVdcSIETpkyBAtKSnRvLw8feSRR3Tbtm2alZWlO3fu1NLSUh09erS+/PLLqqoK6KxZs1RV9cCBA5qZmalr165Vj8ej5557rp566qmHvd/GjRu1b9++vuUTTjhBr776at/yxIkTddGiRaqqunnzZu3Ro4eqql577bV65513qqrq3LlzFdBdu3Yddrx7771X77jjDlVVHTNmjK5du1ZVVT/44AMdPXq0r+zOOeccLS8v188//1y7du2qqqrz5s3T448/Xvft21eh7HNzc/Xjjz9WVdVbbrlFH3zwwcM+V6DzCVihQb5f47I+pKpzgDk5OTlxMy+gt9awu6iEfSXloe+E9nZEey9ZtY5oE0qgq9lq2eT44IMP8vLLLwOwZcsWvv76a7755htGjRpF586dAQJOwhOKqnLrrbeycOFCEhIS2LZtGzt27ODII48MuP3gwYNZuXIle/fupUmTJgwaNIgVK1awaNEiHnzwQZYvX05ubi7t2rUDYNKkSSxcuJAzzjiDxMREzj77bAC+/PJLOnfuTPfu3QG48MILK9R+QjnvvPN8z99++22++OIL3/LevXvZu3cvCxcu9NU+Tj31VFq1ahXymEVFRSxZsoRzzz3Xt+7gwYO+52eccQYJCQn06dOHHTt2+N770ksvpVkzZ84Xb9lffvnlPPnkk9x3333MmjWLZctqX1OMy6QQz45skcrLVw/3DcX9zAebWbN9D3dM6EOTJLfd09sRbX0KJhx1/CMiPz+ft99+m6VLl9KsWTNyc3MpLi5GVcMaEC8pKQmPxwM4iaCkpASAGTNmsGvXLlauXElycjLZ2dkUFwef9Me7zZNPPsnw4cPp378/7733HuvXr6d3796sXbs26L6pqakV+hFqOpBf8+bNfc89Hg9Lly6ladOmgJMU0tPTgx7fvxwA32f1eDy0bNmSTz75JOB7+g9Jrm6/ZLCyP/vss7nzzjsZM2YMgwcPpk2bNtX9iIexPoUo8J+bYdfeg2z94QAplUdezRoKI38bOCHUcfuxiXN1fDVbYWEhrVq1olmzZnz55Ze+tvOhQ4eyYMECNm7cCMD3338PQHp6eoVxkbKzs1m5ciUAr7zyCqWlpb7jtm/fnuTkZN57772wJuUZNWoUU6ZMYdSoUYwcOZJHHnmEgQMHIiIMGzaMBQsWUFBQQHl5OTNnzuSEE0447Bi9evVi48aNrF/vXCI+c+bMGpXLySefXGEQwM8++8wX44wZMwB4/fXX+eGHHwDIyMhg586d7N69m4MHDzJ37lwAjjjiCDp37swLL7wAOF/4n376aZXv/cQTT7B//37gUNmnpqby05/+lKuvvppLL720Rp+rMksKUXbj2B48OXkIIkLh/lIufPxDVm8LcVNesE5oSxSNW6gfEdU0btw4ysrK6N+/P7fffjvHHXccAG3btmXq1KmcddZZDBgwwNe0MmHCBF5++WVfB+0VV1zBggULGDp0KB9++KHv1/akSZNYsWIFOTk5zJgxg169elUZy8iRI9m+fTvHH388GRkZpKamMnKkUxPq0KEDf/vb3xg9ejQDBgxg0KBBnH766YcdIzU1lalTp3LqqacyYsQIOnXqVKNyefDBB1mxYgX9+/enT58+PPHEEwDccccdLFy4kEGDBvHmm29y9NFHA05Nx9vZPX78+Aqfd8aMGUybNo0BAwbQt2/fKjvdx40bx2mnnUZOTg4DBw5kypQpvtcmTZqEiFSYFa82xFs9iUc5OTm6YsWKGu2bn58fc0P9frb1R3414yOmXpTjm/ntMIv+4SQELXdufhvzR6e5oJbDZsRieURbtMpkzZo19O7du97ftyr+zSUmeHlkZ2ezYsUK2rZtWy9xTJkyhcLCQu66666Arwc6n0RkparmBNre+hRiSP/MluT/Ltc3ic/9b60lPTWJy0d2ObRRoPbjCHQ0GmNi35lnnsn69et599136+yYcZkUojWfQn3wJgRV5cvv9tCqWUrFDYJ1QtvVSsbEhE2bNtXbe3mvEKtLcZkU4vGS1OoSER69KIeSMufqhY0F+5jy5lf8aXwfMirfDW1XKxlj6khcJoXGJMUdP2nN9j0s2/g9Qa+sq2rYDBNXwr3805hQatJnbFcfxYmf9evAoj+Mpn16KgB5r35uc0U3UKmpqezevdtm9DO1oqrs3r2b1NTUau1nNYU4kprs3Iyzp7iUJesLaJfehBN7Z0Q5KlPXMjMz2bp1K7t27Yp2KBUUFxdX+wumIYuH8khNTSUzM7Na+1hSiENHpCYz7zcj8bg/JD/YsJv3vtzJdSd1p1mK/UnjXXJysm8oiViSn5/PscceG+0wYkZDLQ9rPopTSYkJvv6G5Ru/57VV2xEqtUFXvqHNbnAzxlTBflY2ANee2J1LR3SmaUoi5R7l9ldWc0V2AZ1fm3joMtVx98D8m21eaGNMSDFVUxCR5iKyUkQCT4tkgkpr4uT3Tbv3MX/1dxz4ekHFG9rWvFJx+dPnrNZgjDlMRJOCiDwhIjtFZHWl9eNE5CsRWSciN/u9dBMwO5IxNXRd26WR//tceh93CiSm4CGRMknG0+u0Q/NCJyTCx88dGj9pxXSO3vyiJQhjTMSbj6YDDwFPe1eISCLwMDAW2AosF5FXgaOAL3BmszS1cERqMhw9DC55lTn/m8VST2/+OngyHNnXucGtcCusfMqpNZQdhHm/pbPHA0+96DQzHdh96K7oTYugaZtD66zJyZgGLaJJQVUXikh2pdVDgXWqugFARJ4HTgfSgOZAH+CAiMxTVQ+m5rKGcto1QzjhQCkJCcK+9oOY8klzru3xPa0/mek0I4mAehA8vgSBqlObQKC8FPCAJEBiE+uLMKaBi0ZHc0dgi9/yVmCYql4DICKTgYJgCUFErgSuBGe88vz8/BoFUVRUVON949Vnu8p45uODdCxLZVC/PFr+uJrS5HS6rXsc0TIQQTzlCIqn3IOgCKCAqAdP2UE2vfs033TaH+2PUi8a4zkSipVHRQ21PKKRFALdu++7dVNVp4faWVWnAlPBGTq7pkMbN8ahonOBC045SNu0JsCJPPfhN3Rq04yeI89kw7tP06Vvju8KpQS/moK4NYWExCZ0GXMxXRpJTaExniOhWHlU1FDLIxpJYSuQ5becCXxbnQM05FFSI81JCFBW7mH6ko30PPIIfjJxKN902k+XnFzI6HNoYD2o2KfQtI2zvOML62MwpoGKRlJYDnQXkc7ANuB84ILqHKAxjJIaaUmJCbx6zQiKS8sB2H3Aw0Pvfs3lIweTWnkEVjg041vZQXx9DAlJcOyFMGCiJQdjGohIX5I6E1gK9BSRrSJymaqWAdcAbwBrgNmq+nk1jztBRKYWFoaYttJUKTU5kZbufA0rd5Tzr3fXsWvvwcAbeyfywe3qUY+zvOJJmH4qzL3BLmk1pgGI9NVHE4OsnwfMq8VxraZQx07OTubXZ4ygY8umADzx/kZye7ajS7s0ZwPvjG/emoKvC1oPJYePn4XuJ0NaezhyAHz3ibOd1SSMiRtxOcyF9SlEhjch7C46yP1vr6Wg6CB/GOdONu4/kU/TNs4X/sfPubUHv+Tw5dzDD/zRMzDoIksOxsSBuEwKVlOIrDZpTXjntyf4hs5Yva2Q9buKOG3AEMT/S33ABc5wGRWSQwCeUqcm8clMu8/BmBgXl0nBRJ53Mh+AZz/YzNtrdnJi7wxfogAOzfbmTQ4fPeskgIAUyg44l7wee7E1LRkTo+IyKVjzUf26+8x+/PKE/aQ1SUJVmfb+Rs4dnEWLZsnOBpWTA3KoT6FoF6x941Cy2LbSeXitmA6djod2PS1BGBMD4jIpWPNR/UpMEDq3bQ7AZ1sL+eu8NRyRmszPh2RV3DDYPNFzb4AVTwQ5ugc2L3YeH8+AyXMtMRgTRTE1dLaJfQOyWjL/+lGcM9iZ4m/xugI+2fJjFTtNdK5cqkr5QbemYYyJlrhMCnafQnT1yEgnIUFQVaa8+RW3vrQq9CTzWUNh8muQcynk/ALGPwC9TiXg6ffxc3a/gzFRZM1HpsZEhGcuG8bOPcWICMWl5bz88TbOHZxJUmKlL/zKTUs5k50v/0+fg81LYddXOJe1HnQ6o8fdY81IxkRBXCYFEzvSmiSR5t7g9vrq7dzy0iq6tktjaOfWVe/sTRRbljl3RZeXOOu3rYRpP7UOaGOiIC6bj0xsOmNgR1761XBfQnhnzQ6+/fFA1TtmDXXGUKrA7YBe8QRMH29NSsbUk7hMCtanEJtEhEFHtwKguLScP7z4GX957Yvwdg7VGe1tUrLEYEzExWVSUNU5qnplixYtoh2KCSI1OZH//fon3Pqz3gAUFB3k3S93BN/B2xkdrAN620qYdjK8dUdkAjbGAHGaFEx8yGrdjMxWzQB4cvFGrnx6JdtCNSdlDYXzn4PL3oCOgwNsoLD4n84Nb8aYiLCkYOrFdSf24JnLhvkG3Xvrix3sLykLvHHWUOfqo2DNSW/9CR4aCk/+zIbsNqaOWVIw9SIlKYHju7YBYNuPB7jq2ZX8+731wXfwv7fhyH4VXztYCAVfWUe0MREQl5ek2thH8a1jy6bMuvI4umekA7BhVxFlHqWHu+zjf2/D85MCD8sNTkf0pkV22aoxdSAuawrW0Rz/crJb06KpM6De/83/kgse+8A3NWhAP7ku9FAZxXvqOEJjGqcqk4KI/FdEThWRuEwgJvbdc1Z//jVxEKnJiagq7321E4+n0rAZ/s1JnX4CzdpWfH3xA9YBbUwdCOeL/j/ABcDXInKPiPSKcEymkWnVPMXX37B0w24ufXI5//tk2+EbZg2F8f+ES+fBxJkgiX4vKsy9Dp48xfoXjKmFKpOCqr6tqpOAQcAm4C0RWSIil4pIcqQDNI3LcZ3b8PAFg5gw4CgAvvh2D7uLDh6+YdZQ6HnK4es3L4EnxlliMKaGwmoSEpE2wGTgcuBj4AGcJPFWxCIzjVJCgnBq/w4kJyagqlw/62N+8dSKwKOw/uQ6SAjwu0TLYcY51pxkTA2E06fwErAIaAZMUNXTVHWWql4LpEU6QNN4iQj/njSIP43vg4hQWu5h5eYfDm2QNdRpSuo0/PCdiwud5qT/y3auXLKagzFhCaem8KS+mScAACAASURBVJCq9lHVv6nqdv8XVDUnQnGFZGMfNR7d2qczuJMzntLsFVs4+z9L+NR/Up+soXDp6/CT6wMf4MAPzqWsT/7MEoMxYQgnKfQWkZbeBRFpJSK/imBMVbJLUhuns47N5N5z+tM/0/m7r95WyMEy9zLWsXcGTwzgzBG9+J/1EKUx8S2cpHCFqvp+mqnqD4BNbmPqXdOURM7NyUJE2FtcyoXTPuSPL68+tMHYO51Z3ZpnBD7Al6/BlB7W12BMCOEkhQQREe+CiCQCYUy4a0zkpKcm88/zBvLLUV0A2FNcyje79zszuv1+rZMcKt/LAFC0w+lr+GtHG3HVmADCSQpvALNF5EQRGQPMBOZHNixjqpbbs71vqIx/vfM1P/3nwkOXr+ZMdu5lSAgykktJkdOc9F+r9BrjL5yxj24CfglcDQjwJvB4JIMyprouG9GFHhnptElrAsDaHXvpnjkEufR1eO1G+G5V4B1XzXb+PfuxeorUmNgWzs1rHlX9j6qeo6pnq+qjqhpikBpj6t+RLVI5NycLgI0F+zj1wUU8tmiDc3XSVe+H7oReNRumjqmnSI2JbVXWFETkJ0Ae0MndXgBV1S6RDc2Ymslq1ZQ/je/DuGM6ALBzbzFNR95Geq9TnSajde9AWXHFnb5dCXmtIKUpdBgAJ90ZhciNib5wmo+mATcAKwGrIZiYl5SYwEXHZ/uW//jyar7esZe3bzyBpPOfc+5XmPZTwFNpTw+U7HOGypg2luOSW0FantM/YUwjEU5Hc6Gqvq6qO1V1t/dR14GISG8ReUREXhSRq+v6+KbxunZMN24Y24OkROd0/6bZMc6Un+lHhdyvSekPzpVKDxxrN76ZRiOcpPCeiNwrIseLyCDvI5yDi8gTIrJTRFZXWj9ORL4SkXUicjOAqq5R1auAnwNRuVPaNEz9M1ty+sCOACxZV8AJU97j3X2d4Ldr4KhAc0E7fNdh/7ABpo21S1hNoxBOUhiG8yX9V+Af7mNKmMefDozzX+He5/AwcArQB5goIn3c104D3gfeCfP4xlTLMZkt+M2Y7gzv6tzD8N158ygffh00aVFpKO4AFv8T/tzGLmM1DZoEHH2yLt9AJBuYq6rHuMvHA3mq+lN3+RYAVf2b3z6vqeqpQY53JXAlQEZGxuDnn3++RnEVFRWRlmbj+Xk1xvLwqHLHkmLaNhWuG5QKQIdv36Dz+qdJLi/ybeetMfj/T1GEHe1H8VWfG+sv4ChrjOdIKPFcHqNHj14ZbOy6cK4+ysCpJRylqqe4v+qPV9VpNYynI7DFb3krMExEcoGzgCbAvGA7q+pUYCpATk6O5ubm1iiI/Px8arpvQ9QYy0NVOdDmO5qmJJDbKwOPR9m973hS0v8GK6ZTOu9mkj0HfNuL376C0mHnAjoULIWf3dsoOqMb4zkSSkMtj3Caj6bj3NXs7ZVbC4S46LtKEmCdqmq+qv5GVX+pqg+HPICNkmrqgIgzd8OYXs5YSS+s3ELuve+xbudeyJnM4lHPQ5cq7l/wlDid0U+fWQ8RGxN54SSFtqo6G/f6PVUto3aXpm4FsvyWM4Fvq3MAGyXVRMKwzm24eHg2Xds5TQJ7ShQufjn0IHteG96FvBZwT7YNuGfiWjhJYZ8785oCiMhxQG1+oi8HuotIZxFJAc4HXq3OAaymYCIhu21zbhrXyzcK6+2LDzDlja8ODbJ32VtwZL/QByl2L2N9aGi9xGxMXQsnKdyI86XdVUQWA08D14ZzcBGZCSwFeorIVhG5zK1pXIPTJLUGmK2qn1cnaKspmEhLSUpg7NFJnNTHqSEUl5ZTfORgZ8iMvEJo2zP0AQq+cmoO1qxk4kw4Yx99BJwADMcZGK+vqn4WzsFVdaKqdlDVZFXN9HZOq+o8Ve2hql1V9e7qBm01BRNpTZISGd81hYFZzvxSD727jrH3L2BvcamzwTXLoN/Pqz6Qt1nJLmM1cSKcOZovBi4ABgODcO4ruDjSgYViNQVT30Z0b8uZx2aSnpoMOPM3cPZjTn9DaquqD7BqNtzbI8JRGlN74TQfDfF7jMQZHO+0CMZkTMw5rksbbhzrfKlv/WE/P/nbu/zv421Of8PNm5wmpapqDvt2WK3BxLxwmo+u9XtcARxLlGdes+YjE03NUpKYMPAohnRuDcD+kjJU1ak5XPZWlWMqsWo23NXOxlMyMSmcmkJl+4HudR1IdVjzkYmm1s1T+OuZ/ejYsikAt7y0ioumLXMSQ9ZQZ0ylvMLQl7GWlzjjKdk8DibGhNOnMEdEXnUfc4GvgFciH5ox8eEnXduS27Md3qnMi0vd23h+vzb05D7gzONwV/sIR2hM+MKZT8F/8LsyYLOqbo1QPGERkQnAhG7dukUzDGMA+PmQQ/difvTND1z+1AoevySHQUe3grF3Qq9TYfoEKC8OfIDyg05fAwL9zrWpQU1UhdOnsMDvsTjaCcGNyZqPTExKb5LE8V3a0CMjHYCDZeVOk9LtO6quNaBOf0NeS7sr2kRNOM1He0VkT4DHXhHZUx9BGhMvumek8/CkQaQ1ScLjUSY99iF3v/aF8+LYO6vuawBAnbui7RJWEwXhdDTfD9yMM7ppJnAT8BdVTVfVIyIZnDHxrMyjDOncmt4dnP8mqkppucfpaxj/QNUH2LfDqTUYU4/CSQo/VdV/q+peVd2jqv8Bzo50YKHYJakmHqQkJXDTuF6cNSgTgLmfbeeUBxbx7Y8HnPsbwq015LWwsZRMvQknKZSLyCQRSRSRBBGZRO1GSa0161Mw8ahls2S6t08j4whnQp8yb60hr7DqIboLvoK8MO6cNqaWwkkKF+DMm7zDfZzrrjPGVMPI7u34z4WDSUwQikvLOeWBRcxa/o3z4sUvO8kh5JAZHqfWkNfS5os2ERPO1UebVPV0VW2rqu1U9QxV3VQPsRnTYO0vKadHRjpZrZsB4PGoc/PbzZvCu0pp8T/hz20jHqdpfMK5+qiHiLwjIqvd5f4iclvkQzOm4WrdPIWHJw1ieFfni/0/C9Yz+cnlzo1v3quUEpJDH8RT6tYcbIhuU3fCaT56DLgFKAVwh80+P5JBVcU6mk1D06JpMm3SUkhNTgScK5X4UwEcNTi8A3iH6LbxlEwthZMUmqlq5TOtLBLBhMs6mk1Dc+Fxnbjv5wMB2Lm3mLH3L2TJ+gK48t0wmpP8TBtrVyqZWgknKRSISFcOTcd5DrA9olEZ04jtOVBKi6bJHOlepaQn5TnNSXmFkJJe9QG8s74ZUwPhjH30a2Aq0EtEtgEbgUkRjcqYRqxb+3T+e/Vw3/Ltr6wmJTGR28f3Rm7d6gyBMfe6qg+U1wJIgLwfIharaXhCJgURSQByVPUkEWkOJKjq3voJzRijqiQlJJCUKL5RWHXwJUjOZGeDe3s4dz4H5TlUa2jb05lG1JgQQjYfqaoHuMZ9vs8SgjH1S0TIO60vt5zSC4A12/dw1n+WsLFgn7OB9+a3cFizkglDOH0Kb4nI70QkS0Raex8Rj8wY4+OtJRQUHeRASTmtmlW6XDWvEBKbhHcwGzbDhBBOUvgFTr/CQmCl+1gRyaCqYpekmsZqZPd2vH7dSFo2S0FV+dWMlcxc5t4VffvO8K9U8tYabIIfU0k4dzR3DvDoUh/BhYjJLkk1jZa31rC/pJyig+UcLPUbisx749tlb4V3MO8EPzauknEFTQoi8le/52PrJxxjTLiaN0niqUuHcPHx2QC8++UOfv3cRxTuL3Um9vFexhoWt0P6v1dELF4TH0LVFMb5Pf+/SAdijKk+ESEhwak5bC8sZvPufTRNSay4UV6hc+VROFbNts7oRi6cPgVjTByYNKwTr/x6BClJCZSUebhs+nKWrCtwXrxmWXjjKXl5x1QyjU6o+xTai8iNgPg991HV+yIamTGm2hLdWsOOPcVs2r2PA6WVpj75k5sk7smG4jBuastrAQjk/VincZrYFSopPAakB3hujIlxWa2b8cb1o0hKdBoDZny4mW9/PMANJ/Vw1t28ydnw/n5Q+E0VR3Nmf+vQ41dAbgSjNrEgaFJQ1TvrMxBjTN3yJgSAr3cUsX5Xka8m4XPDKuffMJqKeqz9N+T9x2oNDZz1KRjTCOSd1pdplwxBRPhxfwlXPbOS9buK/DYohKRmYRzJnTP6Lx0iFquJrphKCiJyhog8JiKviMjJ0Y7HmIYkJcn57752RxErNn/PwVJPxQ1u2+4khyBzOFSoY5Ttt47oBiriSUFEnhCRnd6Z2/zWjxORr0RknYjcDKCq/1PVK4DJwHmRjs2Yxmho59a8f9MY+hx1BAD/eudrXlix5dAGV74b/rAZ3jmjTYMRtE+h8tVGlVXj6qPpwEPA037HTgQeBsYCW4HlIvKqqn7hbnKb+7oxJgK8M7yVe5T31xXQpV1zzs3JqrjR7Tudf90agVKptuDjNik1z3AG6DNxTVQ18Asid7hPewJDgFfd5QnAQlW9POw3EckG5qrqMe7y8UCeqv7UXb7F3fQe9/GWqr4d5FhXAlcCZGRkDH7++efDDaOCoqIi0tLSarRvQ2TlcbjGUiaqSqkHUhKF7/Z5mL+xlHN6pJCWcigFjMg/nQQOJQX/5OD/DVIOvJ/7SuSDjgHxfH6MHj16parmBHotaFLwbSDyJnC2d9hsEUkHXlDVcSF3rHiMbComhXOAcd7EIiIXAcOAtcAlwHLgE1V9JNRxc3JydMWKmo3Nl5+fT25ubo32bYisPA7XGMvkhRVbuHveGt68YRTt01MrvJafn09u/unhHyzsITbiUzyfHyISNCmE06dwNFDit1wCZNc2pgDrVFUfVNXBqnpVqIRgo6QaExnn5mTx/k1jfAnhb6+vYdnG7w9tkFcI4x8I72B5LWDqmAhEaSIpnKTwDLBMRPLcJqUPgadq+b5bAf8GzEzg23B3tlFSjYmctCZOV+MP+0qY++l2lm/6vuIGOZPDrwV8u9KuUoozVc7RrKp3i8jrwEh31aWq+nEt33c50F1EOgPbgPOBC8LdWUQmABO6detWyzCMMcG0ap7C2zeeQIL70/Gr78v5In8dl4/o4lze6k0M4Xzpe7dJauZc+mpiVsiagogkiMhqVf1IVR9wH9VKCCIyE1gK9BSRrSJymaqW4Uzz+QawBpitqp+He0yrKRhTP5qmJNIkyblS6ZNd5Tz34Td4KvdDhn3jG3Z/QxwIZ47mT0Xk6Jq+gapOVNUOqpqsqpmqOs1dP09Ve6hqV1W9u6bHN8bUj/N6pjDnmhGkJidS7lH+MvcLNu9254r23vgWLksMMavK5iOgA/C5iCwD9nlXquppEYuqCtZ8ZEx0tGqeAsBX3+3luWXf0D+rJZ3aND+0QU2alBr4VUrxJpykEHMD46nqHGBOTk6OTRNlTBT0OeoIFvx+NG3TnCTx+qrtKHDKMUc604XmFcJbd8Dif1Z9sLwWlhhiSDhzNC8AvsQZOjsdWOOuixq7JNWY6GuX3sQ3X/Rzy77hsUUbqNDd4J0vOpwvfJvUJ2ZUmRRE5OfAMuBc4OfAh+7NZ1FjHc3GxJYnJw/h0YsGk5Ag7DtYxj/e/Io9xaWHNgi3JpDXAv6aGZkgTVjCuU/hj8AQVb1EVS8GhgK3RzYsY0w8SUpM8N3wtujrXTz83jq+3lFUcaO8Quj386oPVrLXag1RFE5SSFDVnX7Lu8PczxjTCI07pgML/zCawZ1aAc7QGau2ujWFsx+rXq3BkkO9C+fLfb6IvCEik0VkMvAaMC+yYYVmfQrGxLbMVs59CwfLyvnn21/z+PsbKm6QVwgpYc7wa4mhXoXT0fx7YCrQHxgATFXVmyIdWBUxWZ+CMXGgSVIir18/kj+N7wPAtz8e4Jmlmyj3KNy6tXq1BlMvQs2ncD2wGPhYVf8L/LfeojLGNBhHpCb7nv935VYeem8dY3pn0LFlU2dluPc2eF9vcfShuaVNnQtVU8gEHgB2iki+iPxVRE4Vkdb1FJsxpoG5Zkw3XvvNCF9CmPHhZnbsKXZeDLfWUPiN1RwiKGhSUNXfqepw4EjgVuB74BfAahH5Ith+9cH6FIyJTyJCt/ZOX8J3hcXcOecLZnyw+dAG4d7XAJYYIiScjuamwBFAC/fxLc7w2VFjfQrGxL8jW6Ty1g2juCq3KwCrtxWycO0u58XqJIa72kcowsYpaFIQkakishiYBRwPLAHOVdUcVb20vgI0xjRcndo0p1mK07X56MIN/PaFTzlQUu68GG5iKD9oiaEOhRr76GigCfA1zpwHW4Ef6yMoY0zjM+Xc/mws2EfTlERUleeXb+HMP35ParIzdHfI5qLygzaGUh0J1acwDhgCTHFX/RZYLiJvikjMDZJnjIlvTZIS6XXkEQAs3/QDt7y0itc+85uQJ9wxlFZMj0yAjURV8ymoqq7GuVntdZxLVLsC19VDbEFZR7MxDdvQzq156VfDOfPYjgB8uGE363YWhZcY5l5niaEWQvUp/EZEnheRLcBCYDzwFXAWENXLUq2j2ZiGb9DRrUhIEFSVO179nBtnf4Kqhp8Y7OqkGgnVp5ANvAjcoKo2qaoxJipEhGcvH8YP+0oQEYpLy3n7rDWc+lJvpKqdrZ+h2kL1Kdyoqi9aQjDGRFvbtCZ0z3Dub3jpo21c89zHfPyLTeH3M5iw2Winxpi4cv6QLJ69bBiDjnZGYV00aR1lVe1kiSFslhSMMXElIUEY0b0tAPtLyvj1jI/4Xa/8qne0xBCWcOZoNsaYmNQsJYnZVx1P85QkOL+QgqKDtJzSPvgXW14LG1CvCnFZU7BLUo0xXr2OPIKs1s78Df/JX0+f0udDNyfZgHohxWVSsEtSjTGB3Di2B49ePJgktwNa3UdAlhgCisukYIwxgTRvksTons44SJuu+ZbyqnqgLTEcxpKCMaZBym7bnIUXrsUdXi94rcHmgq7AkoIxpsEa0yuDJHeOBv/kEJAlBsCSgjGmkdhx/XeUA0IViaGRJwdLCsaYRqFjy6Yk+d0BrWq1hkAsKRhjGhe3KUmrHDipcYqZpCAiXURkmoi8GO1YjDENW1JeIQl5hb6mpICd0I20KSmiSUFEnhCRnSKyutL6cSLylYisE5GbAVR1g6peFsl4jDGmAr/E4K/CciNLDJGuKUwHxvmvEJFE4GHgFKAPMFFE+kQ4DmOMCSyvkASoMAz3YS1LjSgxRDQpqOpC4PtKq4cC69yaQQnwPHB6JOMwxpiQ/Dugqdic5HveSBKDqAbtf6+bNxDJBuaq6jHu8jnAOFW93F2+CBgG3AHcDYwFHlfVvwU53pXAlQAZGRmDn3/++RrFVVRURFpaWo32bYisPA5nZVJRYyiPEfmn+2oN/rUFb2LwAO/nvgLEd3mMHj16parmBHotGqOkBurzV1XdDVxV1c6qOhWYCpCTk6O5ubk1CiI/P5+a7tsQWXkczsqkokZRHrmFFWoE/j+ZvcliRP7pJOUVNtjyiMbVR1uBLL/lTODb6hzARkk1xkSMX1OScHitIQEoa8BNSdFICsuB7iLSWURSgPOBV6tzABsl1RgTUSGm+RScJpYR+aez5fv99RZSfYn0JakzgaVATxHZKiKXqWoZcA3wBrAGmK2qn1fzuFZTMMZEVoDE4F9rSAA63NeBZz/YXJ9RRVykrz6aqKodVDVZVTNVdZq7fp6q9lDVrqp6dw2OazUFY0zkVVFjSEyC8+f3B2B30UGKS8uDbh8vYuaO5uqwmoIxpt4ESQzeWoP3ap2b/vsZZ/17CR5PZK/ojLS4TApWUzDG1KsAiaHyXc+PbDyJX4zoTEKC08C0c29x/cRWx+IyKRhjTL0L0ZQETo3hnDl9AVi8roAR97zHknUF9RBY3YrLpGDNR8aYqKh0uWow3TPSuPj4Tgzq1AqA7/eVEOkbhetKXCYFaz4yxkRbqLkY2v8jg9vG9yE1OZFyjzLp8Q+5cfan9RlejcVlUjDGmKhxawtV/u73u8HtwuOO5mf9OgDg8Sh7iksjFFztxWVSsOYjY0xU5RWy0B0DqSqJCcKkYZ0Y2ycDgNkrtjD63ny+2R2bN77FZVKw5iNjTLzql9mCCQOOIqt1UwD2xlitIS6TgjHGxIQqrkjyzd7m15TU96gW5J3WFxFhb3EpJ923gEcXrI9woOGzpGCMMbWRV1h1coCA8zEkiHD6wI4c16UNAMWl5ZSVe+o6wmqJy6RgfQrGmIageZMkbv1ZbwZktQTgX+9+zfh/vc+BkugNlxGXScH6FIwxDdGAzJbk9mxP05REgKiMpRSXScEYY2JOlf0LVbdsnNz3SG4+pRcAW77fz/B73uXNz7+ri+jCFo2Z14wxpmGq/MXv348QaGKeEIkiMUEY3rUNx3R09jtYVk6TpMS6iDIkqykYY0wkhDM7W4htjmrZlIcuGMRRLZ1LV3//wmf8esZHER8uIy5rCiIyAZjQrVu3aIdijDERp6oc0/EIyj0gEmrUpdqLy6SgqnOAOTk5OVdEOxZjjIk0EeHKUV3r5b2s+cgYYyIhrHsXYu+y+risKRhjTFyIwS/9qlhNwRhjjI/VFIwxpj6FdVVS9GoYVlMwxpj6Ek5CqM52ERCXScHGPjLGmMiIy6RgYx8ZY0xkxGVSMMaYuBRuX0EU+xSso9kYY+pTjF+majUFY4wxPpYUjDHG+FhSMMYY42NJwRhjjI8lBWOMMT4xc/WRiDQH/g2UAPmqOiPKIRljTKMT0aQgIk8A44GdqnqM3/pxwANAIvC4qt4DnAW8qKpzRGQWYEnBGNN41HRoizq+xDXSzUfTgXH+K0QkEXgYOAXoA0wUkT5AJrDF3aw8wnEZY0zsqM1YR3U8TlJEawqqulBEsiutHgqsU9UNACLyPHA6sBUnMXxCiGQlIlcCVwJkZGSQn59fo9iKiopqvG9DZOVxOCuTiqw8KqrL8hgFiPuoDnUfC+vw7xKNPoWOHKoRgJMMhgEPAg+JyKnAnGA7q+pUYCpATk6O5ubm1iiI/Px8arpvQ2TlcTgrk4qsPCqq0/LIr9lu3kRSl3+XaCSFQMlQVXUfcGlYBxCZAEzo1q1bnQZmjDFRkVcYM30K0UgKW4Esv+VM4NvqHEBV5wBzcnJyrqjLwIwxJmpiZEykaNynsBzoLiKdRSQFOB94tToHsPkUjDEmMiKaFERkJrAU6CkiW0XkMlUtA64B3gDWALNV9fPqHNfmUzDGmMiI9NVHE4OsnwfMq+lxrU/BGGMiIy6HubCagjHGREZcJgVjjDGREZdJwTqajTEmMkRVox1DjYlIIfB1pdUtgMIgy/7P2wIFdRxS5feu7fahXg/0WjjrrDzCKw+o+zKp6/IItU2466uzbOXRcMqjk6q2C7iHqsbtA5ha1Tr/5UrPV9RHPLXZPtTr4Xx2K4+al0ckyqSuyyPUNuGur86ylUfDKY9Qj7hsPvITaDiMyuvmhHitrlX3+FVtH+r1cD57oHVWHsGX4608Qm0T7vrqLtclK4/aHbs25RFUXDcf1YaIrFDVnGjHESusPA5nZVKRlUdFDbU84r2mUBtTox1AjLHyOJyVSUVWHhU1yPJotDUFY4wxh2vMNQVjjDGVWFIwxhjjY0nBGGOMjyUFl4g0F5GnROQxEZkU7XiiTUS6iMg0EXkx2rHEAhE5wz03XhGRk6MdT7SJSG8ReUREXhSRq6MdT6xwv0dWisj4aMdSUw06KYjIEyKyU0RWV1o/TkS+EpF1InKzu/os4EVVvQI4rd6DrQfVKQ9V3aCql0Un0vpRzfL4n3tuTAbOi0K4EVfN8lijqlcBPwca3GWZXtX8DgG4CZhdv1HWrQadFIDpwDj/FSKSCDwMnAL0ASaKSB+cGeC8c0eX12OM9Wk64ZdHYzCd6pfHbe7rDdF0qlEeInIa8D7wTv2GWa+mE2aZiMhJwBfAjvoOsi416KSgqguB7yutHgqsc38JlwDPA6fjTBOa6W7TIMulmuXR4FWnPMTxf8DrqvpRfcdaH6p7fqjqq6o6HGiwza3VLJPRwHHABcAVIhKX3yPRmKM52jpyqEYATjIYBjwIPCQipxL54Q5iScDyEJE2wN3AsSJyi6r+LSrR1b9g58e1wElACxHppqqPRCO4KAh2fuTiNLk2oRYTZsWpgGWiqtcAiMhkoEBVPVGIrdYaY1KQAOtUVfcBl9Z3MDEgWHnsBq6q72BiQLDyeBDnh0NjE6w88oH8+g0lZgQsE98T1en1F0rdi8vqTS1tBbL8ljOBb6MUSyyw8qjIyqMiK4/DNegyaYxJYTnQXUQ6i0gKcD7wapRjiiYrj4qsPCqy8jhcgy6TBp0URGQmsBToKSJbReQyVS0DrgHeANYAs1X182jGWV+sPCqy8qjIyuNwjbFMbEA8Y4wxPg26pmCMMaZ6LCkYY4zxsaRgjDHGx5KCMcYYH0sKxhhjfCwpGGOM8bGkYGKKiNwvItf7Lb8hIo/7Lf9DRG6sxfHzROR34a6v4lj5IlLrYaPd46zwW84RkfzaHtc91mQReagujmUaB0sKJtYsAYYDuKNMtgX6+r0+HFgczoHcIY7jRXsROSXaQVQWZ2Vo6oAlBRNrFuMmBZxksBrYKyKtRKQJ0Bv42B3K+l4RWS0iq0TkPAARyRWR90TkOWCVu+6P7oQobwM9qwrA/eX+fyKyTETWishId31TEXleRD4TkVlAU799ThaRpSLykYi8ICJpItLCfd+e7jYzReSKIG97L85cDZVjqfBLX0TmuiOUIiJFbpwrReRtERnqxr7BnevAK0tE5rux3OF3rAvdz/iJiDzqTQDucf8sIh8Cx1dVXqZhsaRgYoqqfguUicjROMlhKeD9csoBPnPHsD8LGAgMwBnS+l4R6eAeZijwR1XtIyKDccamOdbdZ0iYoSSp6lDgesD7RXo1sF9V++MMKz4YQETa4nyhn6Sqg4AVwI2qWogzHMJ0ETkfaKWqjwV5v6XAQREZHWZ8AM2BfFUdDOwF/gKMBc4E/uy3csr3kwAAAnJJREFU3VCcOQ8GAue6zVO9cWaQ+4mqDsSZWGqS33FXq+owVX2/GvGYBqAxDp1tYp+3tjAcuA9n/PrhQCFO8xLACGCmqpYDO0RkAc4X/h5gmapudLcbCbysqvsBRCTcgctecv9dCWS7z0fhDp+tqp+JyGfu+uNwZuBaLCIAKThf8qjqWyJyLs5MXQOqeM+/4CSXm8KMsQSY7z5fBRxU1VIRWeUXM8Bb7lDoiMhLOGVXhpPUlrsxNwV2utuXA/8NMwbTwFhSMLHI26/QD6f5aAvwW5wv/CfcbQKNae+1r9JyTQb4Ouj+W07F/yeBjiU4X7wTD3vB6RfpDRwAWuMMuxyQqr4rInfhJBmvMirW6FP9npfqocHLPN6YVdUjIqFiVjfmp1T1lgChFLvJ1jRC1nxkYtFiYDzwvaqWq+r3QEucJqSl7jYLgfNEJFFE2uH8il8W4FgLgTPd/oB0YEIt4lqI28QiIscA/d31HwA/EZFu7mvNRKSH+9oNOCNpTgSeEJHkKt7jbuAPfsubgIEikiAiWThNQdU1VkRai0hT4Ayc8n0HOEdE2rsxtxaRTjU4tmlgrKZgYtEqnKuOnqu0Lk1VC9zll3GSxKc4v3z/oKrfiUgv/wOp6kdup/AnwGZgUS3i+g/wpNts9AluElLVXeJMwTjT7QwHuM1tlrkcGKqqe0VkIU7z0B2HHflQvPNEZJffqsXARpzPvxqoyfzQ7wPPAN2A51R1BYCI3Aa86dZmSoFf45SRacRs6GxjjDE+1nxkjDHGx5KCMcYYH0sKxhhjfCwpGGOM8bGkYIwxxseSgjHGGB9LCsYYY3wsKRhjjPH5f96CKoayinwYAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "def zipf_plot(bag):\n", " M = max(bag.values()) # Count for most common word\n", " X = range(1, len(bag) + 1)\n", " plt.yscale('log'); plt.xscale('log'); \n", " plt.title('Frequency of n-th most frequent word and 1/n line.')\n", " plt.xlabel('Word Index Number'); plt.ylabel('Word Frequency')\n", " plt.plot(X, [M/i for i in X], ':', label='1/n')\n", " plt.plot(X, [c for (w, c) in bag.most_common()], '.', label='actual word frequency')\n", " plt.grid(); plt.legend(loc=1)\n", " \n", "zipf_plot(BAG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Task: Spelling Correction\n", "\n", "Given a word *w*, find the most likely correction *c* = `correct(`*w*`)`.\n", "\n", "**Approach:** Try all candidate words *c* that are known words that are near *w*. Choose the most likely one.\n", "\n", "How to balance *near* and *likely*?\n", "\n", "For now, in a trivial way: always prefer nearer, but when there is a tie on nearness, use the word with the highest `WORDS` count. Measure nearness by *edit distance*: the minimum number of deletions, transpositions, insertions, or replacements of characters. By trial and error, we determine that going out to edit distance 2 will give us reasonable results. Then we can define `correct(`*w*`)`:\n", " \n", " \n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def correct(word) -> Word:\n", " \"\"\"Find the best spelling correction for this word.\"\"\"\n", " # Prefer edit distance 0, then 1, then 2; otherwise default to word itself.\n", " candidates = (known({word}) or \n", " known(edits1(word)) or \n", " known(edits2(word)) or \n", " {word})\n", " return max(candidates, key=BAG.get)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The functions `known` and `edits0` are easy; and `edits2` is easy if we assume we have `edits1`:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def known(words) -> Set[Word]:\n", " \"\"\"Return the subset of `words` that are in the known `BAG` of words.\"\"\"\n", " return words.intersection(BAG)\n", "\n", "def edits2(word) -> Set[str]:\n", " \"\"\"Return all strings that are two edits away from this word.\"\"\"\n", " return {e2 for e1 in edits1(word) for e2 in edits1(e1)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for `edits1(word)`: the set of candidate words that are one edit away. For example, given `\"wird\"`, this would include `\"weird\"` (inserting an `e`) and `\"word\"` (replacing a `i` with a `o`), and also `\"iwrd\"` (transposing `w` and `i`; then `known` can be used to filter this out of the set of final candidates). How could we get them? One way is to *split* the original word in all possible places, each split forming a *pair* of words, `(a, b)`, before and after the place, and at each place, either delete, transpose, replace, or insert a letter:\n", "\n", "\n", "
pairs: Ø, wird w, ird wi, rd wir, dwird, ØNotes: (a, b) pair\n", "
deletions: Ø+ird w+rd wi+d wir+ØDelete first char of b\n", "
transpositions: Ø+iwrd w+rid wi+drSwap first two chars of b\n", "
replacements: Ø+?ird w+?rd wi+?d wir+?Replace char at start of b\n", "
insertions: Ø+?+wird w+?+ird wi+?+rd wir+?+d wird+?+ØInsert char between a and b\n", "
" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def edits1(word) -> Set[str]:\n", " \"\"\"Return all strings that are one edit away from this word.\"\"\"\n", " edits = set()\n", " for a, b in splits(word):\n", " if b: edits.add(a+b[1:]) # deletion\n", " if len(b) >= 2: edits.add(a+b[1]+b[0]+b[2:]) # transposition\n", " for c in alphabet:\n", " edits.add(a+c+b[1:]) # replacement\n", " edits.add(a+c+b) # insertion\n", " return edits \n", " return set(deletes + transposes + replaces + inserts)\n", "\n", "def splits(word) -> List[Tuple[str, str]]:\n", " \"\"\"Return a list of all possible (first, rest) pairs that comprise word.\"\"\"\n", " return [(word[:i], word[i:]) \n", " for i in range(len(word)+1)]\n", "\n", "alphabet = 'abcdefghijklmnopqrstuvwxyz'" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('', 'wird'), ('w', 'ird'), ('wi', 'rd'), ('wir', 'd'), ('wird', '')]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits('wird')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'aird awird bird bwird cird cwird dird dwird eird ewird fird fwird gird gwird hird hwird iird ird iwird iwrd jird jwird kird kwird lird lwird mird mwird nird nwird oird owird pird pwird qird qwird rird rwird sird swird tird twird uird uwird vird vwird waird ward wbird wbrd wcird wcrd wdird wdrd weird werd wfird wfrd wgird wgrd whird whrd wiad wiard wibd wibrd wicd wicrd wid widd widr widrd wied wierd wifd wifrd wigd wigrd wihd wihrd wiid wiird wijd wijrd wikd wikrd wild wilrd wimd wimrd wind winrd wiod wiord wipd wiprd wiqd wiqrd wir wira wirad wirb wirbd wirc wircd wird wirda wirdb wirdc wirdd wirde wirdf wirdg wirdh wirdi wirdj wirdk wirdl wirdm wirdn wirdo wirdp wirdq wirdr wirds wirdt wirdu wirdv wirdw wirdx wirdy wirdz wire wired wirf wirfd wirg wirgd wirh wirhd wiri wirid wirj wirjd wirk wirkd wirl wirld wirm wirmd wirn wirnd wiro wirod wirp wirpd wirq wirqd wirr wirrd wirs wirsd wirt wirtd wiru wirud wirv wirvd wirw wirwd wirx wirxd wiry wiryd wirz wirzd wisd wisrd witd witrd wiud wiurd wivd wivrd wiwd wiwrd wixd wixrd wiyd wiyrd wizd wizrd wjird wjrd wkird wkrd wlird wlrd wmird wmrd wnird wnrd woird word wpird wprd wqird wqrd wrd wrid wrird wrrd wsird wsrd wtird wtrd wuird wurd wvird wvrd wwird wwrd wxird wxrd wyird wyrd wzird wzrd xird xwird yird ywird zird zwird'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "' '.join(sorted(edits1('wird')))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bird',\n", " 'gird',\n", " 'sird',\n", " 'ward',\n", " 'weird',\n", " 'wid',\n", " 'wild',\n", " 'wind',\n", " 'wire',\n", " 'wired',\n", " 'wirt',\n", " 'wiry',\n", " 'word'}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "known(edits1('wird'))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "24254" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(edits2('wird'))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'speling': 'spelling',\n", " 'errurs': 'errors',\n", " 'in': 'in',\n", " 'somethink': 'something',\n", " 'whutever': 'whatever',\n", " 'unusuel': 'unusual',\n", " 'misteakes': 'mistakes',\n", " 'everyware': 'everywhere'}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Speling ERRURS in \"somethink.\" Whutever; unusuel misteakes everyware?'\n", "\n", "{w: correct(w) for w in tokens(s)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we make the output prettier than that? Can we preserve the punctuation and capitalization in a text?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def correct_text(text) -> str:\n", " \"Correct all the words within a text, leaving the rest untouched.\"\n", " return re.sub('[a-zA-Z]+', correct_match, text)\n", "\n", "def correct_match(match: re.Match) -> Word:\n", " \"Spell-correct word in match, and preserve proper upper/lower/title case.\"\n", " word = match.group()\n", " return case_of(word)(correct(word.lower()))\n", "\n", "def case_of(word) -> Callable:\n", " \"\"\"Guess what function would give the capitalization of `word`.\"\"\"\n", " return next(c for c in (str.upper, str.lower, str.capitalize, str)\n", " if c(word) == word)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Spelling ERRORS in \"something.\" Whatever; unusual mistakes everywhere?'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "correct_text(s)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Audience says: TUMBLER ...'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "correct_text('Audiance sayzs: TUMBLR ...')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far so good. You can probably think of ways to make this better; we'll consider some later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model: Probabilities with the Bag of Words Model\n", "\n", "In the bag of words model, what's the probability of picking a particular word out of the bag? What's the probability of a sequence of words?\n", "\n", "We'll denote that probability of a single word as `Pword(word)` in Python; the probability of a sequence of words will be `Pwords(words)`. I'll redefine `Bag` to make bags be both a Counter and a `ProbabilityFunction`: a callable object, `P`, that will give you the probability of some outcome when invoked with `P(outcome)`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "class ProbabilityFunction:\n", " def __call__(self, outcome):\n", " \"\"\"The probability of `outcome`.\"\"\"\n", " if not hasattr(self, 'total'):\n", " self.total = sum(self.values())\n", " return self[outcome] / self.total\n", " \n", "class Bag(Counter, ProbabilityFunction): \"\"\"A bag of words.\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the following assignment allows us to get the probability of a word with `Pword(word)`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "Pword = Bag(WORDS)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.07241060756724281" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Pword('the')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "80029" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Pword['the']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above says that the probability of picking `the` randomly out of the bag of words is about 7%, and the actual count of `the` is 80,029. Below we see probabilities for more words:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(9.04804602921976e-07, 'lamentations'),\n", " (9.04804602921976e-07, 'victuals'),\n", " (1.357206904382964e-05, 'fashioned'),\n", " (7.5098782042524e-05, 'rare'),\n", " (0.00019091377121653694, 'english'),\n", " (0.0002596789210386071, 'common'),\n", " (0.00026963177167074883, 'word'),\n", " (0.00041621011734410896, 'words'),\n", " (0.0008215625794531542, 'most'),\n", " (0.000977188971155734, 'like'),\n", " (0.0010676694314479317, 'old'),\n", " (0.0032844407086067727, 'are'),\n", " (0.008843560188959394, 'is'),\n", " (0.019948227080620804, 'in'),\n", " (0.03466487394714674, 'and'),\n", " (0.07241060756724281, 'the')]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted({(Pword(w), w) for w in tokens('''\"The\" is the most common word in English; \n", " old-fashioned words like \"victuals\" and lamentations\" are rare.''')})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, what is the probability of a *sequence* of words? In the general case, this is defined by the definition of joint probability:\n", "\n", "$P(w_1 \\ldots w_n) = P(w_1) \\times P(w_2 \\mid w_1) \\times \\ldots P(w_n \\mid w_1 \\ldots w_{n-1}) = \\Pi_{i=1}^n P(w_i \\mid w_1 \\ldots w_{i-1})$\n", "\n", "In the bag of words model, each word is drawn from the bag *independently* of the others. So $P(w_i \\mid w_1 \\ldots w_{i-1}) = P(w_i)$, and we have:\n", " \n", "$P(w_1 \\ldots w_n) = P(w_1) \\times P(w_2) \\times \\ldots P(w_n) = \\Pi_{i=1}^n P(w_i)$\n", "\n", "Now clearly this model is wrong; the probability of a sequence of words in English *does* depend on the order of the words. But, as the statistician George Box said, *All models are wrong, but some are useful.* The bag of words model, wrong as it is, has many useful applications.\n", "\n", "
\n", " \n", "
All models are wrong, but some are useful
—George Box (1919-2013)

\n", "\n", "\n", "Below we define `Pwords` to compute the product of the individual word probabilities:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def Pwords(words: List[Word]) -> float:\n", " \"Probability of a sequence of words, assuming each word is independent of others.\"\n", " return Π(Pword(w) for w in words)\n", "\n", "def Π(nums) -> float:\n", " \"Multiply the numbers together. (Like `sum`, but with multiplication.)\"\n", " result = 1\n", " for num in nums:\n", " result *= num\n", " return result" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.1925826058314966e-20" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Pwords(tokens('this is a test of the good words'))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.3963205369077944e-30" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Pwords(tokens('here lies another sentence with occasionally unusual terms'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although both sentences have eight words, the second is judged to be 10 billion times less probable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Task: Word Segmentation\n", "\n", "**Task**: *given a sequence of characters with no spaces separating words, recover the sequence of words.*\n", " \n", "Why? Some languages have no word delimiters: [不带空格的词](http://translate.google.com/#auto/en/%E4%B8%8D%E5%B8%A6%E7%A9%BA%E6%A0%BC%E7%9A%84%E8%AF%8D)\n", "\n", "In hastily-written text there might be spelling errors that include [wordsruntogether](https://www.google.com/search?q=wordsruntogether). \n", "\n", "Some specialized sub-genres have no word delimiters; an example is domain names in URLs. Sometimes this can go [poorly](https://www.boredpanda.com/worst-domain-names/?utm_source=google&utm_medium=organic&utm_campaign=organic), as in the website `choosespain.com` which once was a tourist information site encouraging you to \"choose Spain\" for your travel, but then they noticed that the domain name could also be parsed as \"chooses pain\".\n", "\n", "To find the best segmentation, we can again take advantage of the bag-of-words assumption that words are independent of each other. If we segment the text into a first word and remaining characters, then the best segmentation with that first word is arrived at by finding the best segmentation of the remaining characters (which does not depend on the first word), where \"best\" means having maximum probability according to `Pwords`. Thus if we try all possible splits, the split with the maximum probability will be the overall best.\n", " \n", " segment('choosespain') ==\n", " max((['c'] + segment('hoosespain'),\n", " ['ch'] + segment('oosespain'),\n", " ...\n", " ['choose'] + segment('spain'),\n", " ...\n", " ['choosespain'] + segment('')),\n", " key=Pwords)\n", "\n", "To make this somewhat efficient, we need to avoid re-computing the segmentations of the remaining characters. This can be done explicitly by *dynamic programming* or implicitly with *memoization* or *caching* of function results, as is done by `functools.lru_cache`. Also, we needn't consider all possible lengths for the first word; we can impose a maximum length. What should it be? A little more than the longest word seen so far." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "18" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max(len(w) for w in BAG)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def splits(text, start=0, end=20) -> Tuple[str, str]:\n", " \"\"\"Return a list of all (first, rest) pairs; start <= len(first) <= L.\"\"\"\n", " return [(text[:i], text[i:]) \n", " for i in range(start, min(len(text), end)+1)]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('', 'word'), ('w', 'ord'), ('wo', 'rd'), ('wor', 'd'), ('word', '')]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits('word')" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('c', 'hoosespain'),\n", " ('ch', 'oosespain'),\n", " ('cho', 'osespain'),\n", " ('choo', 'sespain'),\n", " ('choos', 'espain'),\n", " ('choose', 'spain'),\n", " ('chooses', 'pain')]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits('choosespain', 1, 7)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "@lru_cache(None)\n", "def segment(text) -> List[Word]:\n", " \"\"\"Return a list of words that is the most probable segmentation of text.\"\"\"\n", " if not text: \n", " return []\n", " else:\n", " candidates = ([first] + segment(rest)\n", " for (first, rest) in splits(text, 1))\n", " return max(candidates, key=Pwords)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['choose', 'spain']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('choosespain')" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['speed', 'of', 'art']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('speedofart')" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "decl = ('wheninthecourseofhumaneventsitbecomesnecessaryforonepeoplet' +\n", " 'odissolvethepoliticalbandswhichhaveconnectedthemwithanother' +\n", " 'andtoassumeamongthepowersoftheearththeseparateandequalstati' +\n", " 'ontowhichthelawsofnatureandofnaturesgodentitlethem')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'when in the course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth the separate and equal station to which the laws of nature and of natures god entitle them'" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence(segment(decl))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks good! It should be \"nature's\" not \"natures,\" but our approach does not allow apostrophes. Some more examples:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['small', 'and', 'insignificant']" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('smallandinsignificant')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks good. What about:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['large', 'and', 'insignificant', 'numbers']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('largeandinsignificantnumbers')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's not how I would segment it. Let's look at the probabilities:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4.1121373608995186e-10, 1.0663880482050684e-11)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(Pwords(['large', 'and', 'insignificant']),\n", " Pwords(['large', 'and', 'in', 'significant']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The probabilities are close, but the segmentation with fewer words is preferred. The bag of words model does not know that \"small\" goes better with \"insignificant\" and \"large\" goes better with \"significant,\" because the model assumes all words are independent.\n", "\n", "Summary:\n", " \n", "- Overall, looks pretty good!\n", "- The bag-of-words assumption is a limitation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data: Billions of Words\n", "\n", "Let's move up from millions to [*billions and billions*](https://en.wikipedia.org/wiki/Billions_and_Billions) of words. I happen to have a word count data file available in the format `\"word \\t count\"`. Let's arrange to read it in:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "def load_counts(lines, sep='\\t', fn=str.lower) -> Bag:\n", " \"\"\"Return a Bag initialized from key/count pairs, one on each line.\"\"\"\n", " bag = Bag()\n", " for line in lines:\n", " word, count = line.split(sep)\n", " bag[fn(word)] += int(count)\n", " return bag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll make `P1w` be a bag of billions of words (and, as a `Bag`, also a probability function):" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(333333, 588.124220187)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P1w = load_counts(open('count_1w.txt'))\n", "\n", "len(P1w), sum(P1w.values())/1e9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A third of a million distinct words with a total count of 588 billion tokens." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 23135851162),\n", " ('of', 13151942776),\n", " ('and', 12997637966),\n", " ('to', 12136980858),\n", " ('a', 9081174698),\n", " ('in', 8469404971),\n", " ('for', 5933321709),\n", " ('is', 4705743816),\n", " ('on', 3750423199),\n", " ('that', 3400031103),\n", " ('by', 3350048871),\n", " ('this', 3228469771),\n", " ('with', 3183110675),\n", " ('i', 3086225277),\n", " ('you', 2996181025),\n", " ('it', 2813163874),\n", " ('not', 2633487141),\n", " ('or', 2590739907),\n", " ('be', 2398724162),\n", " ('are', 2393614870)]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P1w.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Zipf plot for 588 billion word tokens is similar to the one for a million word tokens:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEaCAYAAADkL6tQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deXhU5dn48e+djQBhX4IYIGyKoGyJoAgYcEMFca2itmKtWN9qa+2vdaktsWpr69b6aou0KtoqiFZflVLcA4hRBDdQEGWTsK+BACHb/fvjnIQhzCRnJjOTmcn9ua65MufMWZ5n5mTueZbzPKKqGGOMMf4kNXYCjDHGxC4LEsYYYwKyIGGMMSYgCxLGGGMCsiBhjDEmIAsSxhhjArIgYTwTkUwRWSAi+0TkoQidY52InBmJY0eSiNwrIjtEZEtjpyVeiUi+iPwrSudqlOvMN48i0l1ESkQkOdrpCIYFCWoumIPuB1b96NrY6YpBU4AdQGtV/UVDDyYiM0Tk3oYnK7JEREWkTx2vdwN+AfRX1S7RS1nDRfOLOV6IyBgReU9EikVkXYBtRojIBw05j6p+p6oZqlrZkONEmgWJwya4H1j1Y1PtDUQkpTESFkN6AF+p3YFZWw9gp6pu8/eiXTdHEkcsf/fsB54CflnHNucBc6OTnEamqk3+AawDzvSzPhtQ4DrgO2CBu/4U4ANgD/A5kOezT09gPrAPeAt4DPiX+1oeUBTo3DhB+3ZgNbATmA20r5WWa9y07AB+7XOcZOBOd999wFKgG/A48FCtc74O3BLgvRgBfAwUu39HuOtnAOVAGVAS4P2a4Z7vP24aPgJ6BzjPlFrHe93n/fh/wBduGl4A0gMcYzKwCHjE/SzWuOmfDGwAtgHX+GzfBngW2A6sB+4CktzX+rifW7H73r7grl/gvu/73XReXisNZwIHgSr39Rnxct0A49z3v9xN++d+3uNrqz8bd/lbYLbP8gZgcF3XjvtaAXCf+3kddN/vgHn2k452wBz3s9vtPs+qdfx73OPvA94EOvq8/n33M98J/JoA//N+Ptt1AV77BBjqPlfgx8A3btoeByTAfvk+n2v1Z5PiMQ8Br5+Ifj9G4ySx/gh0wfh8iM8CLYHmwLHuhXae+895lrvcyd2nEHgYaAaMdj9sr//stwAfAlnu/k8AM2ul5e9uOgYBh4AT3Nd/CSwDjgfEfb0DMAzYxOEvw47AASDTT37buxf594EUYJK73MF9fQZwbx3v4wxgl3vOFOA5YFY929/r5/1YDHR107MC+HGA/ScDFThfZMnAvThfhI+779/Z7vuf4W7/LPAq0Mp9P1cB17mvzcT58kgC0oGRPudRoE8d+Tjic42z6yafAF/M7uu9cL6UkoBjcL5oN/q8ttt9rb5rp8D9bAa4r6fWlWc/6egAXAK0cD+/F4H/83m9ACdIHufmswC4332tP04QHO2e62Gc6yakIOG+DxtxA4H7/s4B2gLdcQLZuADHrHm/8R8kAuWhzusnot+PkT5BPDzcf7gS959hT/XF5/Mh9vLZ9jbgn7X2fwPnl1p39+Jr6fPa83j/Z18BnFHrYix3/6mq0+L762kxcIX7/GtgYoD8rQDOcp/fBMwNsN33gcW11hUCk93nM6g/SPzDZ/k8YGU92/sLElf7LP8JmBZg/8nANz7LJ7nvUabPup3AYJwgcgin3aD6tRuAAvf5s8B03/fXZ7tQg0Q8XDf51BEk3G02AEOBK9z3aDHQDyc4v+bx2ikAfufzWp159vA/OxjY7bNcANzls/w/wDz3+W/x+bGCE7jLCD1IXAc8Wev68P1RMRu4PcAxa95v/AeJQHkIeP14eb8a8ojlesFou1BV27qPC2u9tsHneQ/gMhHZU/0ARuL8Y3bFuXD3+2y/Pog09ABe8TnuCqASyPTZxrf3zAEgw33eDedXiD/PAFe7z68G/hlgu65+0rse51eMV37TJyJ3+nQKmBbKMQLY6vP8IICq1l6XgVOCSuPI/Pnm7Vc4JbDFIvKliPywnjR6EQ/XjRfzcQLVaPd5AXC6+5jvbuPl2vF9P4LKs4i0EJEnRGS9iOzFqQZsW6tnUKA8dvU9t3vOnYHO5YG/9oiGvL9ejlPX9RNRFiS8UZ/nG3AielufR0tVvR/YDLQTkZY+23f3eb4fp7gMgHuBd6p17HNrHTtdVTd6SOMGoHeA1/4FTBSRQcAJwP8F2G4TzsXoqztO0bpBVPX3erhTwI+rVzf0uEHYgfPr2jd/NXlT1S2qer2qdsUpYfy1rh5NHsXDdePlM6gOEqPc5/M5Okh4uXZ8z1Vfnmv7BU5V6nBVbY0TsMAJ7PXZjPMjytlBpAVO9VXQRCQVJ99vhbJ/A9R1/USUBYng/QuYICLniEiyiKSLSJ6IZKnqemAJcLeIpInISGCCz76rgHQROd+92O7CqSOtNg24T0R6AIhIJxGZ6DFd/wDuEZG+bu+RgSLSAUBVi3AaEv8J/FtVDwY4xlzgOBG5UkRSRORynPrcOR7TEKytOPXaEadON8PZOO9vK/c9vhXn80RELhORLHfz3ThfaNVdE8ORzli9brYC2fX0NpoPjAGau9fSQpxG7w7Ap+42QV07HvJcWyucUuEeEWkPTPWYP4CXgPEiMlJE0oDfUcd3n4gkiUg6TruJuJ9VmvvyKOALVd0bxPnDIeD1E+kTW5AIkqpuACbi9CTajhPhf8nh9/JKYDhOA+5UnLru6n2LceoZ/4HzC2s/UORz+L8ArwFvisg+nMbI4R6T9jDOl+CbwF7gSZzGr2rP4NTZB6pqQlV3AuNxfrXtxKmCGa+qOzymIVhPAv3d4nOg0k043Yzznq8B3sepA3/Kfe1k4CMRKcH5DH6mqmvd1/KBZ9x0fi+UE8fwdfOi+3eniHwSIO2rcNrsFrrLe3Hew0Vu8A312gmYZz/+jHM978DJ3zwvmXPT9iXwE5zPezPOj4CiOnYZjROQ5uKUbg7i/F9BI3V9re/6EZFpHqpxQ1LdOm8iRETycRo9r65v2winYzTOr5FsVa1qzLSY+sXKdWOOJCJfAZeq6leNnZZosZJEE+BWUfwMp+eRBQhjQuBWOT3blAIEWJBIeCJyAk633mNwiuzGmBCoalk0GopjjVU3GWOMCchKEsYYYwKyIGGMMSaghBqdsmPHjpqdnR3y/vv376dly5b1bxiHLG/xyfIWn+Itb0uXLt2hqp38vZZQQSI7O5slS5aEvH9BQQF5eXnhS1AMsbzFJ8tbfIq3vIlIwCFREqK6SUQmiMj04uLixk6KMcYklIQIEqr6uqpOadOmTWMnxRhjEkpCBAljjDGRkVBtEsYkovLycoqKiigtLW3spIRVmzZtWLFiRWMnIyJiNW/p6elkZWWRmprqeZ+ECBIiMgGY0KdPQ0d2Nib2FBUV0apVK7KzsxHxMjJ2fNi3bx+tWrVq7GRERCzmTVXZuXMnRUVF9OzZ0/N+CVHdFJY2iQ2L6b7+JdiwOHwJMyYMSktL6dChQ0IFCBN9IkKHDh2CLpEmRJBosA2L0WcuIHvtc/DMBRYoTMyxAGHCIZTryIIEwLqFaMUhkqhCK8tg3cLGTpExMeWHP/whnTt35sQTTzxifWFhIddff30jpcpEgwUJgOxRkJxGJUmQnAbZoyg+WN7YqTImZkyePJl5846e52fevHmMGzeuEVJkoiUhgkSDb6brNoykya+zvudVyDWvUdolh/H/u5A/zlsZ3oQaE6dGjx5N+/btj1r/zjvvcOaZZzJjxgwuvvhixo0bR9++ffnVr37VCKk0kZAQQSIsDdfdhvFdj0uh2zAAvpfTjVF9OwJwqKKSkkMV4UiqMQ12+ROFvLhkAwDllVVc/kQhr3zqzMZ5sKySy58o5PXPNwGwt7Scy58oZN7yzQDs2l/G5U8U8vZXWwHYti/0brU7duwgNTWV6v+7zz77jBdeeIFly5bxwgsvsGHDhpCPbWJHQgSJcEtPTebmM/oyorcTJJ56fx15DxSwfd+hRk6ZMbHjzTff5Oyzz65ZPuOMM2jTpg3p6en079+f9esDDgdk4khC3CcRaSN6d6DkUDmdWjUDYPu+QzXPjYm2F244teZ5anLSEcvN05KPWG6dnnrEcvuWaUcsd26VHnI6/vvf/3LrrbfWLDdrdvh/Ijk5mYoKK30nAitJeDCoW1t+eU4/AHaUHGLsgwVMX7D6yI02LIaFD9XffdbrdsbEMFXliy++YPDgwY2dFBNhVpIIUsu0FK4b1ZMzTsgEoPhgOWmbltB85kVQWeb0jrrmtZq2jSNsWOzchxFouw2Lne632aP8729MI5k0aRIFBQXs2LGDrKwsbr75ZoYMGWL3bzQBMRMkRKQX8Gugjape6q5rCfwVKAMKVPW5Rkwi4BTnbznzuJrlP81bSbcv/8UNlWWIVjoBYN1C/1/y6xY6r/vbrr4AYkwjmjlz5hHL99577xFdXydPnszkyZNrlufMmROtpJkIi2h1k4g8JSLbRGR5rfXjRORrEflWRG4HUNU1qnpdrUNcDLykqtcDF0QyraG6JCeLbkPOQpLTQJKpSk51SgL+uPdjIMk192PU8BdAjIlRd911F1dccUVjJ8NEQaRLEjOAx4Bnq1eISDLwOHAWUAR8LCKvqepXfvbPApa5zysjm9TQDO3eDrpfCAO7sn3ZO9y4qDmXbz2Gy7r52bjbMKeE4K9KqTqAVJckAgUaY4yJoogGCVVdICLZtVYPA75V1TUAIjILmAj4CxJFOIHiM2K9kb3bMFpmDmV0s7Wc6bZXbN1bSqv0FFqkpRyxnd9qpEABxNopjDGNqDHaJI4FfO+yKQKGi0gH4D5giIjcoap/AF4GHhOR84HX/R1MRKYAUwAyMzMpKCgIOWElJSUN2h9gYDJ8/vFGAB5cUsqe0ip+d1pzkjw38OXA6gOwuoDWxSsZ9PlvSKqqoCophc8H3cPeNv1oXbyStnuWs6ftiext08/TUcORt1iV6Hlr06YN+/bta+ykhF1lZWVC5gtiO2+lpaVB/b80RpDw922pqroT+HGtlfuBa+s6mKpOF5HNwIRWrVrlNGTy8XBPXp6RvYste0sZO7ArAN9s3UffzCDGmF+41GmjoIpkrWRo+/2Q3QKeyQ+6gTveJmYPRqLnLT09PebmJgiHWJxzIVxiOW/p6ekMGTLE8/aNUYVTBPjW2GcBmxpywFid4zo3uz3j3QBR8PU2znpkAe+u3Or9AP4auq2B2xgTRY0RJD4G+opITxFJA64AXmvIARs8wF8UnJzdnjvO7cfIPp0AWLO9hINl9bTFV7dTjP314RJDXT2kjIkBBQUFfPDBBw06RkZGRphSU7+CggLGjx/v97VJkyYxcOBAHnnkkailJ9ZEtLpJRGYCeUBHESkCpqrqkyJyE/AGkAw8papfNuQ8qvo68Hpubm7MDmzfslkKN5zeG4CqKmXKP5fSKaMZM6ecUveOtRu66+ohZUwMKCgoICMjgxEjRjR2UvyqrKwkOTm53u22bNnCBx984HcMqoqKClJSYuY2s4iKaElCVSep6jGqmqqqWar6pLt+rqoep6q9VfW+hp4nHkoSvpKShN9fdBI3n+HMyV1RWcXyjUGkvdswGPUL568N82H8CfN1ceGFF5KTk8OAAQOYPn16zfp58+YxdOhQBg0axBlnnMG6deuYNm0ajzzyCIMHD2bhwoVMnjyZl156qWaf6lJCSUkJZ5xxBkOHDuWkk07i1VdfrTMNf/rTn3j00UcB+PnPf87YsWMBZ7jyq6++GnBu+jvppJM48cQTue222444529/+1uGDx9OYWEh8+bNo1+/fowcOZKXX37Z7/nOPvtstm3bVpOPvLw87rzzTk4//XT+8pe/sH37di655BJOPvlkTj75ZBYtWgTAzp07mThxIkOGDOGGG26gR48e7Nixg3Xr1h0xadODDz5Ifn4+AKtXr2bcuHHk5OQwatQoVq50pimYPHkyP/3pTxkxYgS9evU64n3805/+xEknncSgQYO4/fbbWb16NUOHDq15/ZtvviEnJ6fO99QTVU2YR05OjjbEe++916D9Q/Xch+u1x21z9PMNu4Pb8buPVO/JVM1v5/z97qOAmzZW3qIh0fP21VdfBbdTENeFVzt37lRV1QMHDuiAAQN0x44dum3bNs3KytI1a9Ycsc3UqVP1gQceqNn3mmuu0RdffLFmuWXLlqqqumvXLi0uLlZV1e3bt2vv3r21qqrqiG18FRYW6qWXXqqqqiNHjtSTTz5Zy8rKND8/X6dNm6YbN27Ubt266bZt27S8vFzHjBmjr7zyiqqqAvrCCy+oqurBgwc1KytLV61apVVVVXrZZZfp+eeff9T51q5dqwMGDKhZPv300/XGG2+sWZ40aZIuXLhQVVXXr1+v/fr1U1XVm2++We+8805VVZ0zZ44Cun379qOO98ADD+jUqVNVVXXs2LG6atUqVVX98MMPdcyYMTXv3aWXXqqVlZX65Zdfau/evVVVde7cuXrqqafq/v37j3jv8/Ly9NNPP1VV1TvuuEMfffTRo/Ll73oClmiA79WEKC+JyARgQp8+fRo7KSG5YHBXFOWkY52G92VFxfTNzCA9tZ4icV3DfJimKwLXxaOPPsorr7wCwIYNG/jmm2/Yvn07o0ePpmfPngB+JyWqi6py5513smDBApKSkti4cSNbt26lS5cufrfPyclh6dKl7Nu3j2bNmjF06FCWLFnCwoULefTRR/n444/Jy8ujUyen3e+qq65iwYIFXHjhhSQnJ3PJJZcAsHLlSnr27Enfvn0BuPrqq48oHdXl8ssvr3n+9ttv89VXh2/v2rt3L/v27WPBggU8+6xz//D5559Pu3bt6jxmSUkJH3zwAZdddlnNukOHDk9LcOGFF5KUlET//v3ZunVrzbmvvfZaWrRoARx+73/0ox/x9NNP8/DDD/PCCy+weHHDS5IJESQ0Dtok6pLRLIWrhvcA4EBZBdc8vZiRfTry6KR6uqnVvku7eQenisHaKpq2MN+9X1BQwNtvv01hYSEtWrQgLy+P0tJSVNXTAH8pKSlUVVUBTmAoKysDYPbs2Wzfvp2lS5eSmppKdnY2paWBJ0Gq3ubpp59mxIgRDBw4kPfee4/Vq1dzwgknsGrVqoD7pqenH9EOEerAhC1btqx5XlVVRWFhIc2bNz9qO3/H930fgJq8VlVV0bZtWz777DO/5/Qdgt350U/A9/6SSy7h7rvvZuzYseTk5NChQwePOQsstu9i9ije2iTq0iIthcevHMr/jHEaufeWlrOsKEC+fHs/jbsf5t0O797nDBRobRRNl79ecQ1QXFxMu3btaNGiBStXruTDDz8E4NRTT2X+/PmsXbsWgF27dgHQqlWrI24ky87OZunSpQC8+uqrlJeX1xy3c+fOpKam8t5773mapGj06NE8+OCDjB49mlGjRjFt2jQGDx6MiDB8+HDmz5/Pjh07qKysZObMmZx++ulHHaNfv36sXbuW1aud4f5rD17o1dlnn81jjz1Ws1z9JT969Ghmz54NOHNu7N69G3Bu9t22bRs7d+7k0KFDNYMgtm7dmp49e/Liiy8CTgD4/PPP6z33U089xYEDB4DD7316ejrnnHMON954I9deW+ctZp4lRJDQGL1PIlSn9u5Avy6tAfj7gjVc+NdFbNpz0P/G1Y3YB3fa/RPmMN/ODQ00btw4KioqGDhwIL/5zW845RSnR16nTp2YPn06F198MYMGDaqpipkwYQKvvPJKTYPv9ddfz/z58xk2bBgfffRRza/xyy+/nCVLlpCbm8tzzz1Hv371jx4watQoNm/ezKmnnkpmZibp6emMGuWUlI455hj+8Ic/MGbMGAYNGsTQoUOZOHHiUcdIT09n+vTpnH/++YwcOZIePXqE9L48+uijLFmyhIEDB9K/f3+mTZsGwNSpU1m0aBFDhw7lzTffpHv37oBTEqpuPB8/fvwR+X3uued48sknGTRoEAMGDKi3EX/cuHFccMEF5ObmMnjwYB588MGa16666ipE5IhZAxtCqosviSA3N1eXLFkS8v6xeOfu3tJyFq7awfkDjwHgg9U7GNq93dHtFbWHGh93vxM43KqnWMxbuCR63jIzMznhhBMaOylhF8t3JTeUb96ys7NZsmQJHTt2jMq5H3zwQYqLi7nnnnv8vr5ixYqjricRWaqquf62T4g2iUTWOj21JkBs21vK5Kc+5poRPfj1+f2P3ND3/onmHZyqJ9+hO4wxCe+iiy5i9erVvPvuu2E7ZkIEiXjv3eRV59bpPH3tyfTp7PQzL9p9gD0HyjnR7RVVc+PdwoeOrHr6/Hm676yAJeuc0kXzDkeUMowxkbNu3bqonau6B1o4JUSQiPfeTcE4rc/hIuuf3/6Gecu3UHjHWFqlpx7eyLd3S1IyfPo8PSvLYO0/ccZXVJAkSEqBIVdDl0EWPIwxfiVEkGiqfjO+PxMHd60JEG98uYW84zvRzLfqqbgIlj6DUN325P7VKieILHnKXe8GD8QJLOc9BLmTo5shE5DX7qbG1CWUNuiE6N2USF1gg9GmeSqj+jo3Dq3cspcb/rmUfxa63Qire7cMmgTJaVTVjNBe+281nyBSVQH/uRVmXQVzfm7daRtZeno6O3fuDOkf3JhqqsrOnTtJT08Par+EKEk0peqmQPp1ac2/rhtOTg/n7s7lG4tJThJOcEsV6959ll4Dcg9XK235DD59HirLgSoOlyRcWgkr3cnsP30OJs+xKqhGkpWVRVFREdu3b2/spIRVaWlp0F9Y8SJW85aenk5WVlZQ+yREkDCOkX0Pt1f8cd5K1mzfz/xf5pHSbRjf9ThAr9y8I3cYdOXh3lAHd0LpXih8DKoqOSJgVB5yeksN+YETXEq2Q0Znp5RigSPiUlNTa4a+SCQFBQVBTX4TTxIpbxYkEtRjk4aybud+UpKTqKpSFhaVM6KiirQUnxpGf/Nt9zsfPn8ePvkXVJUfXr9xqfPwZSUMYxJeQrRJmKO1aZHKoG5tAXj/2x08ubyMN7/aUv+O3YbB+D/DtXPh2HqGGa4uYVibhTEJy4JEEzD6uE7cMSyd8050bsp7/5sdrNyyt+6dug1z7tpOTqt7u41LYcZ4CxTGJKiEqG5qKjfTNcTx7ZNJShJUlXv/8xXN05J5+cYRdXer7DYMJv/HqX5CnPsptnwG6wth+8rD29kw5cYkrIQIEta7yTsRYdaUU9i1vwwR4WBZJS8t3cDlJ3c/sr2imr92iw2LYcb5TnAA58a85g0fktgYE3usuqkJatsijV6dnKE95i7bzG9e/ZJlwU6fOvk/TiO3JDvdZefcAtNG2n0VxiQYCxJN3MVDj2XOzSNr7q947fNNfLN1Xz174QSKY3OcO7cBUNiyzLmD+8lznBvxLFgYE/csSDRxIlIzQGBZRRW//88K/vzON952zh4Ffts0qpwb8Z4+zwKFMXHOgoSpkZaSxNyfjWLqeGcY8q17S3l60VrKK6v879BtGIz4aeADVpXDrEmwZEb4E2uMiYqYDhIi0l9EZovI30Tk0sZOT1PQvmUanVs7wwn836cb+cPclWwpDjzvMGfdDeP/4lQ9dTmJo8aE2r8D5vwMnj7XShXGxKGoBwkReUpEtonI8lrrx4nI1yLyrYjc7q4+F/hfVb0R+EG009rUTRndi//eMopu7VsAMGPRWr7d5qe9IncyXP8u/Ph950a8owYPBNZ/AE+eBY8Pt5KFMXGkMUoSM4BxvitEJBl4HCco9AcmiUh/4J/AFSLyAGB9LKNMROjt9oLac6CMR97+hheXFNW9U+5kN1AEuLS2r3RKFr/Pgj8PtAZuY2Jc1IOEqi4AdtVaPQz4VlXXqGoZMAuYqKrbVPUnwO3Ajign1fho2yKNd39xOjeNdW5YXL6xmGc+WEeFv/aK3Mlw3RvQY0TgA5btgz3rnQbuJ8+20oUxMUoaY4x6EckG5qjqie7ypcA4Vf2Ru/x9YDjwIHAn0BL4m6q+7+dYU4ApAJmZmTmzZs0KOV0lJSVkZGSEvH8sC3feZq0sY9HGcu4f3YKWqYHv2j5m0xv0XfU3n0mP/M9koUBlcks2dT2Htb2vCSot9rnFJ8tb7BgzZsxSVc3191qs3HHt71tGVXUdbgAIRFWni8hmYEKrVq1y8vLyQk5EQUEBDdk/loU7b6efrmwqLuXYts1RVR5+axUXDTm25ia9w/Jgw0XO0B5r5sOuNUcdS9xHUuV+emx4mR6pe+AH3ufqtc8tPlne4kOs9G4qArr5LGcBm7zurKqvq+qUNm3ahD1hxj8R4di2zQH4btcBnl60jkWrd/rfuHpk2Z9+6vSE6ng8NG8f+OBr3oW/DLG2CmNiQKwEiY+BviLSU0TSgCuA17zu3FSnL40VPTq0pOCXeUw62Ynz7329jX99uJ7KKj9VmbmT4abFcNtaJ2D4LUQCu9c4vaH+NtKChTGNqDG6wM4ECoHjRaRIRK5T1QrgJuANYAUwW1W/jHbaTOg6ZjQjJdm5nP7zxWae+WBd/XMy506G6950xoBKa+V/m63LnGDxuw7w+652v4UxURb1NglVnRRg/VxgbojHtFFgY8gDlw5k5/4yUpKTKKuo4ndzvuRHI3uR3bHl0Rt3GwZXPO88f/Yip6rJn6oKKKs4fL9FUgp0OgHGPxy5jBhjYqa6ySQQEaFjRjMAvtxUzCufbGTdzv317/iDV+Ck73k7SVVFTSmj5+pnGpBaY0xdEiJIWJtE7BrSvR3v3zaWvOM7AzBr8Xc8/9F3gauiLvk7XPcWtO/l+RzdN7zs3Jxn91oYE3YJESSsd1Nsa9fy8BSob321lTe+3FL/jHjVPaHadIeU5tR7qZbtc+7k/lNva7MwJoxi5T6JBrHpS+PHP67JZd+hCgB27S/jnjlf8YuzjyOrXYujN86d7DyqvTUVljwNh/ZCoJvzDuxw2iwQ6DUmqPstjDFHs5KEiSoRoXV6KgDLNhbz9oqtHCir9LbzWXfDHd9B/h7oNbZmtf+KK3UawfPbwD2dnABjjAlaQgQJE59OP64TH95xBsdlOt1f//z2Kl5cssHbzj94xamOCtR11ldlGSz6swUMY0KQEEHCGq7jV8tmTo1nRWUVhbr/7yUAACAASURBVKt38nnRHu87506GO4tYddz/QFKqt31qAkZb+Lf1mDamPgkRJKy6Kf6lJCcxa8op3HW+Myvet9tKuOn5T9i6t44Jj1ybu54Dv93hdJ+VZI9nVFg22yldPDasASk3JrElRMO1SQwiQnqq8yW/YvNePl63i5SkOnpB1XbJ350H1H1jXm07vnaCBUBSGpz3wJEN5sY0YQlRkjCJZ8Kgriz41Rg6uDfl3fbSF7z2uecxH502i/xiOO0W54vfq6oypyvt3e3svgtj8BAkROTfInK+iMRsQLE2icTULMUpVZQcqmDVtn1s3nMw+IOcdTf8drsTMLzezQ2gVU6wyG8D92cHf15jEoSXL/6/AVcC34jI/SLSL8JpCpq1SSS2jGYp/PvHI7huZE8A3v9mBzc9/wm795cFd6BL/n64dIHXtgugdLcTLH6fFdz5jEkA9QYJVX1bVa8ChgLrgLdE5AMRuVZEPHYpMaZhkpKkZpTZDbsP8M3WElo0C+KL3tdZd0P+Lmf4j2ZB/LAo2+cEi/w28LuOoZ3bmDjjqeFaRDoAVwPfBz4FngNGAtcAeZFKnDH+TBrWnctyskhJTqKisoqHlpRyqNMWzhnQJbgDdRvm3JxX7bFhTiO2F1Xlhxu709vB7euCO7cxccJLm8TLwEKgBTBBVS9Q1RdU9WYgfiZxNQmlulSxa38ZJeXqf4KjYN202KmOatM9uP2qq6OsK61JQF5KEo+pqt++hIEmzjYmWjq3Tuc3p6Qz5kSnFDFr8XcUrtnJHy4+iRZpIfbw/vky5++GxfDkOMDjsCE1XWnFGTrEmATgpeH6BBFpW70gIu1E5H8imKagWe+mpi1JpGZU2X2lFezaX0Zz936LemfHq0u3YYfbLiSYgKPuECCdQz+3MTHCS5C4XlVrfhap6m4gpsYzsN5Nptr1o3vx7A+HISLsKy1n4uOLWPjN9oYdtNswmLrTqYrKLw5iCJBDbkN3u4ad35hG5CVIJInP4P8ikgwEcXeSMdFVfbnuLHG6yFaPOtugUoWv3+5wgkXXHI87VDnB4t5jwnN+Y6LIS5B4A5gtImeIyFhgJjAvsskypuGyO7bk1Z+cxqBuTm3pQ2+u4tbZn4WnkRtgyrtOsGiZ6W37igNOsLCBBU0c8VLRehtwA3AjzvwubwL/iGSijAkX3xnwUpKFZilJJLvjQVVVKUnBjA0VyC9XOX83LHYnPKrHstnOo9dYmxTJxDwvN9NVqerfVPVSVb1EVZ9QVY/dPYyJHbeceRy/v+gkADbsOsCZD89n6fpd4TtBt2HBVUNVT4pk062aGOblPonTROQtEVklImtEZK2IrIlG4kSku4i8JiJPicjt0TinSWzVJYuSQxW0b5nGMW2aA4SvCgoOV0P5zJ5XpyfPOnxjnjExxkubxJPAwzh3WJ8M5Lp/Q+J+4W8TkeW11o8Tka9F5FufgHAc8B9V/SHQP9RzGlPbCce05qUbR9C1rRMkfv7CZ/z6lWXhPUn1SLQpfubv9qd6yA9jYoiXIFGsqv9V1W2qurP60YBzzgDG+a5we0w9DpyLEwwmiUh/nCFArhCRd4H3GnBOYwKqqlK6tm1eEzDAmSkvbO7a7AQLryxYmBgi9XULFJH7cYbMfBk4VL1eVT8J+aQi2cAcVT3RXT4VyFfVc9zlO9xNy4HFqrpARF5S1Uv9HGsKMAUgMzMzZ9asWaEmi5KSEjIyEnOkEcubd9/urmTaF4f42dB0urUK/wj5IwsmkoTTC6Ra9XPf/0YFqoD3814NexpigV2TsWPMmDFLA42g4aV303D3r+8BFPBY4erJscAGn+Ui97zTgHwRuRJnBNqjqOp0YDpAbm6u5uXlhZyIgoICGrJ/LLO8edd2wx4G7F7FxWcPpWWzFMoqqkhLCWOwyHNLFX5KC7UDhwB5BROBJMjfHb40xAC7JuNDvUFCVcdEIR3++iGqqi4Hjio9HLWzyARgQp8+fcKeMNP0DO7Wlmd+6AzWV1WlXDG9kOG9OnDbuDBPpZIfOFgczb0hL5hqK2PCwEvvpkwReVJE/usu9xeR68KcjiKgm89yFhDEXJXGREZ5VRW52e05LtOpOlBVDlWEuQd4frEzPpQfR/16svYKE2VeytAzcO667uourwJuCXM6Pgb6ikhPEUkDrgBe87qzjd1kIqVZSjJ3nncCFw1xZqV77fNNnPnwfDbsOhDeE1XfY5HW6ojVAVsMLViYKPESJDqq6mycNjRUtQLPYycfTURmAoXA8SJSJCLXuce8CScYrQBmq+qXQRzTRoE1UdGldTo53dvV9IQ6UFYR3hPcWWQ9oUxM8dJwvd+dmU4BROQUIORvY1WdFGD9XGBuiMd8HXg9NzfXBsUxETW8VweG9+oAQGl5Jef8eQFXDuvBjXm9w3siN1BU5rfxNn1kdaCwNgsTZl5KErfiVP30FpFFwLPAzRFNVZCsJGEaQ0WVcnb/Lgzp7gwgeKiiMuztFe/nvWolC9OovIzd9AlwOjACZ6C/Aar6RaQTFgxrkzCNIaNZCr8Z359T3JLF3xes4ZxHFlB8sDz8Jwu2hGCjzZowqbckKyI/qLVqqIigqs9GKE3GxKXB3dqxt7SCNs2d+SuKD5bXPA+LoLrMcni0WauCMg3gpbrpZJ/HKCAfuCCCaQqaVTeZWDCyb0fuPO8EALbtLWXkH99l5uLvwn+i6hnyPG/fBu5uH/50mCbBS3XTzT6P64EhxNjMdFbdZGJNs9RkLs3J4lS3KmpfaTllFWEcDwqCCxZaaW0VJiShjDVwAOgb7oQYk0jaNE9l6oQBZHdsCcC9c1Yw4X/fpzycAwdWCyZYWMO2CZKXNonXOXxPTxLOKK2zI5moYNmwHCbWjTupC30zM0hNdn6Xbd93iE6tmoX3JMG0WdgQH8YjLyWJB4GH3McfgNGqGlMTAFl1k4l1Y47vzI9G9QLgq017Oe3+d/nvss2ROZmVKkwYeRngb340EmJMU3FMm3SuGdGDEb07Ak6pom2L1JpSRlgEW6rw3ccYH14G+NsnInv9PPaJyN5oJNKYRNKuZRq/Pr8/bVqkoqrc8sKnXDH9Q+qb2yUk+cUgyR63tVKFOZqXny6PALfjzPmQBdwG3KuqrVS1dSQT55V1gTXx7Ien9eQHp/bAvf8o/IMHTt0VXBXUkhnhPb+Ja16CxDmq+ldV3aeqe1X1b8AlkU5YMKxNwsQrEeGMEzKZOPhYAN5ZsY28Bwv4cE1DZggOIL8YOh5f/3ZzfmalClPDS5CoFJGrRCRZRJJE5CoaMAqsMSawId3b8pMxfcjp0Q6A7Qeqwttt9qbFwZUqTJPnJUhcCXwP2Oo+LnPXGWPCrENGM2496zhSk5Mor6zioSWl/M9zIU8nH5jXeyusB1ST5+WO63WqOlFVO6pqJ1W9UFXXRSFtxjRpKUnC5f3SuHZENgDllVWs2V4S3pNYqcLUw0vvpuNE5B0RWe4uDxSRuyKfNO+s4dokIhFhSOcURvRxuso+9+F6zn5kAd9s3RfeEwVTqjBNjpfqpr8DdwDlAO4w4VdEMlHBsoZr0xSMH9SVO847gT6dnfm2v96yj4pwtldY9ZPxw0uQaKGqi2utC/OcjcaY+nTMaMZ1I3siIuwtLefy6YX85lXPs/x6k18MLTM9bGeBoqnwEiR2iEhvDk9feikQofEEjDFetGqWwv0XD2Sy215RfLCctTv2h+fgv1xl1U+mhpcg8RPgCaCfiGwEbgF+HNFUGWPqJCKMO7ELx3dpBcDj733LuD8vYPu+Q+E7iddAcX92+M5pYk6dQUJEkoBcVT0T6AT0U9WRqro+Kqkzxnjyo1E9+cPFJ9WMLLt0/W4qq8IwzIeXQFG620oVCazOIKGqVcBN7vP9qhrmbhV1E5FRIjJNRP4hIh9E89zGxJPOrdK5eGgWABt2HeDyJwr5yzvfhOfg1vupSfNS3fSWiPw/EekmIu2rH6GeUESeEpFt1V1qfdaPE5GvReRbEbkdQFUXquqPgTnAM6Ge05imJKtdcx6dNITvn9IDgKLdB1gXjvYKCxRNkpcg8UOcdokFwFL3saQB55wBjPNdISLJwOPAuTiTGk0Skf4+m1wJzGzAOY1pMkSE8046pqbq6f7/ruSivy6itDwMo+lYN9kmx8t8Ej3DeUJVXSAi2bVWDwO+VdU1ACIyC5gIfCUi3YFiVbVhyY0JwW/H9+fLTXtJT3WGDF+wajun9elIcpKEdkCvc1XY7HcJQQKNYS8iv1fVO93nZ6nqW2E7qRMk5qjqie7ypcA4Vf2Ru/x9YLiq3iQidwNvqKrfNgkRmQJMAcjMzMyZNWtWyOkqKSkhIyMj5P1jmeUtPoU7b2v2VPK7D0v5Qf80xnZPbfDxRhZMpHq2Ct+QU/2tUgm8n/eq333tc4sdY8aMWaqquf5eq6skMQ64033+RyBsQcIPfz9pFEBVp9a1o6pOF5HNwIRWrVrl5OXlhZyIgoICGrJ/LLO8xadw5+10Vbr22ULe8Z1JT03mq017aZWeQrf2LUI7YF6x3xJF9T90CpBXMBHadIefLztiG/vc4kMY50tskCKgm89yFrDJ6842LIcx3jj3VxxTU/X021eXc81Ti6lqSHdZL1VKxd9ZO0Wcqqsk0VlEbsX5UVD9vIaqPhzGdHwM9BWRnsBGnLGhPA9HLiITgAl9+vQJY5KMSXyPXTmUjXsOkpQkVFYp85ZvYdyJXYJvr7B2ioRVV0ni70ArIMPnue8jJCIyEygEjheRIhG5TlUrcO7HeANYAcxWVc+D0lhJwpjQdGmTXjPB0VtfbeEnz3/Cuyu3hX5A6yabcAKWJFT17kicUFUnBVg/F5gbyjGtJGFMw50zoAtPTc5lzPGdAShcvZOsds2Db6/I999OceQ2bSBAg7aJLbHSJtEgVpIwpuFEhLH9MhFxqp5uf/kLfvnS56EdzMNd2iMLJoZ2bBNVCREkbNIhY8IrOUmYef0p3HvhiQDsP1TB/326MfgG7joCRTLYjXdxICGChJUkjAm/rm2b06ez0/z470+KuOWFz/hqcwj3tFo7RVwL2CZRuzdTbWHu3dQg1iZhTGRdPbwH/bq05sRjnS/z/y7bzMBubTm2bXNvB/DTTnFU/ynr+RST6ipJVPdiygVuBI51Hz/GGV8pZlhJwpjISkoShvV0xvU8UFbBHa8s46E3vg7uILUCgN+KKytRxJyAQUJV73Z7OHUEhqrqL1T1F0AOzs1uxpgmqEVaCv/56ShuO7cfAJv2HOSVT4u8tVfkF0PXnHq2sUARS7y0SXQHynyWy4DsiKQmRNZwbUx0Hdu2OZmt0wGYufg7bvv3MrbuK/W285R3Ib+YOsektUARM7wEiX8Ci0UkX0SmAh8RY3M7WHWTMY3n52cex8s3juCYNk77xPMffcfm4oP17hdo4L8aFihiQr1BQlXvA64FdgN7gGtV9Q+RTpgxJj4kJUlNg/a2vaX8bs6XPPfhd952rq+h2gJFo6tzPgl3jusv3CG9P4lOkowx8apz63Te+vnptGuZBsDyjcWs3l7CBYO6IhJgPKj67tC2Xk+Nyssc15+7E//ELGuTMCZ2dGvfgoxmzu/P5z5azz1zVrC/rJ5Z8axEEbO8tEkcA3wpIu+IyGvVj0gnLBjWJmFMbLr3wpOYfcMpZDRLQVV57N1v2Lo3QAO3BYqYVO/0pUBEBvozxiS+5CShVydnhrZVW0t49J1v6ZDRjEnDAlROWNVTzPHScD0fWMnhm+tWuOuMMcaz47u04p1fnM73cp35xb7YXsGcLzZx1BTKVqKIKfUGCRH5HrAYuAz4HvCROye1McYEpVv7FjUTGr23oYLH3v2WSn834VmgiBle2iR+DZysqteo6g+AYcBvIpssY0yiu3lIM2ZcO4yU5CQOVVTy+7kr2ObbXmGBIiZ4CRJJquo7VdVOj/tFjfVuMib+JInQpY1z1/Yn6/fw9KK1fL1135EbWaBodF6+7OeJyBsiMllEJgP/IcQZ5CLFejcZE99O7d2B928by6i+nQCY/fEG5i7b7LRXWKBoVF4arn8JTAcGAoOA6ap6W6QTZoxpWqrHglJVXly6gX8vLTp8A54FikZT13wStwCLgE9V9d/Av6OWKmNMkyXizIq3r7QCgJ0lh/jz29/w019so9NDnQPvaN1jI6KukkQW8Bdgm4gUiMjvReR8EWkfpbQZY5qolOSkmqE9Fq/dxUtLiyg+WG4likZQ13wS/09VRwBdgDuBXcAPgeUi8lWU0meMaeLOPekYCu8YS5/Ozk15j41eQkVdO1igCCsvDdfNgdZAG/exCWe48IgTkSQRuU9E/ldEronGOY0xsadtC6dUUVZRxX+WbeHeoR/UvUN+GwsWYRIwSIjIdBFZBLwAnAp8AFymqrmqem2oJxSRp0Rkm4gsr7V+nIh8LSLfisjt7uqJOFOmlgNFoZ7TGJMY0lKSeP2m0/jVuOMhv5gKnGlQA86JZ4GiweoqSXQHmgFbgI04X9J7wnDOGcA43xUikgw8DpyLM3/2JBHpDxwPFKrqrTjzbBtjmriU5CRapDl9bl46b1ndM9yBBYoGkqPGTfF90el/NgAY4T5OxGmbKFTVqSGfVCQbmOPOU4GInArkq+o57vId7qYbgDJVnS0iL6jq5X6ONQWYApCZmZkza9asUJNFSUkJGRkZIe8fyyxv8cnyVr/SCuXM9y8mmaqaddUzV1R/u1XiYSa8MIq3z23MmDFLVTXX32t1BomajUSygNNwAsV4oIOqtg01QX6CxKXAOFX9kbv8fWA48Cvgf4EDwEpVfbyu4+bm5uqSJUtCTRYFBQXk5eWFvH8ss7zFJ8ubdxX3ZZFUvg/hcJA4SpS6yMbb5yYiAYNEXfdJ/BQnKJyG0yawCCgEngKWhTuNftapqh4Arqt3Z5EJwIQ+ffqEOVnGmHiR8usiyiurkHvakYJTijjqi8XupQhaXW0S2cBLwDBV7aWq31fVv6rq5+6MdeFUBHTzWc7C6UVljDGepSYnkeITBPzWk1gbRVDquk/iVlV9SVU3RyEdHwN9RaSniKQBVwCeZ7+zsZuMMUfIL/ZfPVHzun1XeBX10VxFZCZOtdXxIlIkItepagVwE/AGsAKYrapfBnFMGwXWGHMkn0BRHRyOCBwWKDyJepBQ1Umqeoyqpqpqlqo+6a6fq6rHqWpvVb0vyGNaScIYczS36kk4suqp+t6KCgsU9YqpeSFCZSUJY0xAPoGiWvXzZLASRT0SIkhYScIYUyc/PZqqu8pWlyiWrt8d7VTFhYQIEsYYU686ur4mA4Oezo5aUuJJQgQJq24yxngSoEQB7k1j+W144I2V3DPnK7zcaNwUJESQsOomY4xndQQKgJ8XDmf/oYqaWfGaerBIiCBhjDFBqaPqKQX4w7JRAKzfuZ8Jj73Pl5uabi1FQgQJq24yxgStjkAhAPlt2FFSRmUVdMxoBjTNUkVCBAmrbjLGhKSecZxyns5m7k9Hktk6HYCfzfqMB9/4OhopixkJESSMMSZk9QQKudsZ8LqisooWacmkpx7+2qyqSvyShQUJY4ypb2TY/DakJCdx/yUD+ckYZ7TpxWt3cd6jC1m7Y38UEth4EiJIWJuEMabBPAQKoKbXU3llFa2bp9LFrYqqTNBSRUIECWuTMMaEhcdAAXBan47MvuFUmqclU1WlfO+JQqbNXx3hBEZfQgQJY4wJmyACRbXSikp6d2pZU6qoUk2YkoUFCWOMqS3IQNEiLYU/XTqIC4ccC0DhpgrG/+/7bNtXGqkURo0FCWOM8SeEEkW1lqlCz44t6NjSub+irCLck3lGT0IECWu4NsZERIiBYnDnFP56VQ5JScKBsgrOemQ+//pwfQQSGHkJESSs4doYEzENKFGAU4rI7dGe47u0qlmOp/aKhAgSxhgTUQ0IFG1bpPHQ9wZxcnZ7AP5WsJoLHnufA2UV4UxhxFiQMMYYLxpYoqh2XGYGJ2e3p0VaCgAHyyobmrKIsiBhjDFehSFQnHvSMeRfMACArXtLGXH/O7z2+aZwpC4iLEgYY0wwwlSiABCBsf0yGZzljA+1/1BFzI0HFdNBQkTyRGShiEwTkbzGTo8xxgD1BoqRBRM9HaZzq3Qe+t4gundoAcDU177ksicKY6phO+pBQkSeEpFtIrK81vpxIvK1iHwrIre7qxUoAdKBomin1RhjAqpnzuxgShTVRvXtyNn9M0lOcsaHKj5YHmLiwqcxShIzgHG+K0QkGXgcOBfoD0wSkf7AQlU9F7gNuDvK6TTGmLqFseoJYOLgY7nh9N4ALN9YzCm/f4f5q7aHmrqwiHqQUNUFwK5aq4cB36rqGlUtA2YBE1W1+jbF3UCzKCbTGGO8qWfO7FBKFAAdMtK4cMixDOnutFfs3l/WKO0V0hjT8YlINjBHVU90ly8Fxqnqj9zl7wPDgXeBc4C2wN9UtcDPsaYAUwAyMzNzZs2aFXK6SkpKyMjICHn/WGZ5i0+Wt/gxsmCiU83E4SBR/e1aCbyf92rIx1ZV7l9cSrMU4dac9Aak0r8xY8YsVdVcf6+lhP1soRE/61RVXwZermtHVZ0uIpuBCa1atcrJy8sLOREFBQU0ZP9YZnmLT5a3OJJXXFNqUJwvteovthQgr2Bi/dVTAagqu1pvJDlJyBtyLKrK9pJDdG4V/oBRW6z0bioCuvksZwGeOw7bsBzGmJgQ5jaKaiLCJTlZNaPMvvHlVkb/6T0+37AnpOMFI1aCxMdAXxHpKSJpwBXAa153tgH+jDExI7+YOu+hDjFQ+Drx2NZcPbwHA7q2BmBLcSmRajpojC6wM4FC4HgRKRKR61S1ArgJeANYAcxW1S+9HtNKEsaYWFJv+0MDA0VWuxbcNb4/KclJHKqo5IrphXwaoVJFY/RumqSqx6hqqqpmqeqT7vq5qnqcqvZW1fuCOaaVJIwxMSdCVU+1pSYlcc+FJzK0e7uwHK+2WKluahArSRhjYlIUAkVSkjCqb6cGHyfg8SN25CiykoQxJmZFqUQRKQkRJKwkYYyJaXEcKBIiSFhJwhgT8+I0UCREkLCShDEmLsRhoEiIIGGMMXEjzgJFQgQJq24yxsSVOAoUCREkrLrJGBN34iRQJESQMMaYuBQHgcKChDHGNKYYDxQJESSsTcIYE9e8BIpGChYJESSsTcIYE/e8zDXRCIEiIYKEMcYkhBgMFBYkjDEmlsRYoLAgYYwxsSaGAoUFCWOMiUX5xTHR8ykhgoT1bjLGJCwvpYoISoggYb2bjDEJra5AEeHSREIECWOMSXiNFCgsSBhjTCKIUKCwIGGMMfGiERqyLUgYY0w8iXKgiPkgISItRWSpiIxv7LQYY0xMiGKPp6gHCRF5SkS2icjyWuvHicjXIvKtiNzu89JtwOzoptIYY2JclAJFY5QkZgDjfFeISDLwOHAu0B+YJCL9ReRM4Ctga7QTaYwxMc9foAhz8BBVDesBPZ1UJBuYo6onusunAvmqeo67fIe7aQbQEidwHAQuUtWqWseaAkwByMzMzJk1a1bI6SopKSEjIyPk/WOZ5S0+Wd7iU7zlbcyYMUtVNdffaynRTkwAxwIbfJaLgOGqehOAiEwGdtQOEACqOh2YDpCbm6t5eXkhJ6KgoICG7B/LLG/xyfIWnxIpb7ESJMTPupoijqrOqHNnkQnAhD59+oQ5WcYY07TFSu+mIqCbz3IWsKmR0mKMMcYVK0HiY6CviPQUkTTgCuA1rzvb2E3GGBMZjdEFdiZQCBwvIkUicp2qVgA3AW8AK4DZqvplEMe0UWCNMSYCot4moaqTAqyfC8wN8ZivA6/n5uZe35C0GWOMOVKjdIGNFBHZDqx3F9sAvkUL3+VAzzsCOxqQhNrnDGU7f6/Vt66p5M13OZx5C5SOYLYJZ958n8dq3vytD/Z/zvJWt4b+zwWTt7aq2snv0VU1IR/A9EDLdTxfEs5zhrKdv9fqW9dU8ua7HM68ec1ftPJWK58xmbf60l9XXi1v4ctbMHnwmrfaj1hpuI6E1+tYDvQ83OcMZTt/r9W3rqnkzXc5nHnzerxo5c1reryKRN78rY/V/7lEzltd24WatyMkVHVTQ4nIEg1w12G8s7zFJ8tbfEqkvCVySSIU0xs7ARFkeYtPlrf4lDB5s5KEMcaYgKwkYYwxJiALEsYYYwKyIGGMMSYgCxIBuNOmPiMifxeRqxo7PeEkIr1E5EkReamx0xIJInKh+7m9KiJnN3Z6wklEThCRaSLykojc2NjpCbdEna5YRPJEZKH72eU1dnqC0aSCRJBTp14MvKSq1wMXRD2xQQomb6q6RlWva5yUhibI/P2f+7lNBi5vhOQGJci8rVDVHwPfA2K+i2UiT1ccZN4UKAHScUa9jh8NvSswnh7AaGAosNxnXTKwGugFpAGf48yEdwcw2N3m+cZOezjz5vP6S42d7gjn7yFgaGOnPdx5w/nR8gFwZWOnPZx5A87EGQF6MjC+sdMe5rwlua9nAs81dtqDeTSpkoSqLgB21Vo9DPhWnV/XZcAsYCJOtM9yt4n59ynIvMWdYPInjj8C/1XVT6Kd1mAF+9mp6muqOgKI+WrQIPM2BjgFuBK4XkRi+v8umLzp4Vk1dwPNopjMBouVmekak9+pU4FHgcdE5HzCPwREtPjNm4h0AO4DhojIHar6h0ZJXcMF+uxuxvlV2kZE+qjqtMZIXAMF+uzycKpCmxHiqMkxIOTpiuNAoM/tYuAcoC3wWGMkLFQWJAJMnaqq+4Fro52YMAuUt53Aj6OdmAgIlL9HcYJ8PAuUtwKgILpJCbsGTVcc4wJ9bi8DL0c7MeEQ08W5KEnkqVMTOW+Q2PmzvMWnhMubBYkGTp0a4xI5b5DY+bO8xaeEy1uTChKRmDo1ViRy3iCx82d5s7zFMhvgzxhjTEBNqiRhjDEmOBYkjDHGBGRBwhhjTEAWJIwxxgRkQcIYY0xAFiSMMcYEXmguFgAABBRJREFUZEHCxDwReUREbvFZfkNE/uGz/JCI3NqA4+eLyP/zur6eYxWISIOH8HaPs8RnOVdEChp6XPdYk0UkrsYPMo3HgoSJBx8AIwDckUE7AgN8Xh8BLPJyIBFJDnvqIqeziJzb2ImoLc7eQ9NAFiRMPFiEGyRwgsNyYJ+ItBORZsAJwKfuEOEPiMhyEVkmIpdDzaxg74nI88Ayd92v3Ylh3gaOry8B7i/7P4rIYhFZJSKj3PXNRWSWiHwhIi8AzX32OVtECkXkExF5UUQyRKSNe97j3W1misj1AU77AHCXn7QcURIQkTnu6LCISImbzqUi8raIDHPTvkZEfCfP6iYi89y0TPU51tVuHj8TkSeqA4J73N+JyEfAqfW9XyZxWJAwMU9VNwEVItIdJ1gUAtVfVrnAF+7Y/RcDg4FBOEOFPyAix7iHGQb8WlX7i0gOzpg6Q9x9TvaYlBRVHQbcAlR/sd4IHFDVgTjDr+cAiEhHnC/4M1V1KLAEuFVVi3GGbZghIlcA7VT17wHOVwgcEpExHtMH0BIoUNUcYB9wL3AWcBHwO5/thuHMRzEYuMytzjoBZya/01R1MFDJ4TkrWuJMrjNcVd8PIj0mztlQ4SZeVJcmRgAP44zbPwIoxqmOAhgJzFTVSmCriMzHCQB7gcWqutbdbhTwiqoeABARrwOwVQ/1vBTIdp+Pxh2WXFW/EJEv3PWn4MxItkhEwJmlrNDd7i0RuQx4HCeg1eVenGBzm8c0lgHz3OfLgEOqWi4iy3zSDPCWO2Q8IvIyzntXgRPkPnbT3BzY5m5fCfzbYxpMArEgYeJFdbvESTjVTRuAX+AEgKfcbfyN5V9tf63lUAYtO+T+reTI/x1/xxKcL+JJR73gtKucABwE2lPHnMeq+q6I3IMTdKpVcGQtQLrP83I9PCBbVXWaVbVKROpKs7ppfkZV7/CTlFI3+JomxqqbTLxYBIwHdqlqparuwpnl61TcX+jAAuByEUkWkU44v/IX+znWAuAitz2hFTChAelagFslIyInAgPd9R8Cp4lIH/e1FiJynPvaz3FGCJ0EPCUiqfWc4z7gVz7L64DBIpIkIt1wqo6CdZaItBeR5sCFOO/vO8ClItLZTXN7EekRwrFNArGShIkXy3B6NT1fa12Gqu5wl1/BCRqf4/wy/pWqbhGRfr4HUtVP3Ebmz4D1wMIGpOtvwNNuNdNnuEFJVbeLMw3nTLdxHeAutxrnR8AwVd0nIgtwqpOmHnXkw+mdKyLbfVYtAtbi5H85EMo83u8D/wT6AM+r6hIAEbkLeNMt7ZQDP8F5j0wTZUOFG2OMCciqm4wxxgRkQcIYY0xAFiSMMcYEZEHCGGNMQBYkjDHGBGRBwhhjTEAWJIwxxgRkQcIYY0xA/x+78F4ngMSLEQAAAABJRU5ErkJggg==\n", "text/plain": [ "

" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "zipf_plot(P1w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model: the Bigram Model\n", "\n", "Even with billions of words of data, we won't have enough information to create a full joint probability model, where\n", "\n", "$P(w_1 \\ldots w_n) = \\Pi_{i=1}^n P(w_i \\mid w_1 \\ldots w_{i-1})$\n", "\n", "Why not? Consider a 30-word sentence. We've probably never seen those words in that exact order before, so we have no information about $P(w_{30} \\mid w_1 \\ldots w_{29})$. But we are much more likely to have seen the two-word sequences that make up the sentence. We call these two-word sequences **bigrams**. The **bigram model** says that the probability of each word depends on the immediately previous word but is independent of the words that came before that. It is less-wrong than the bag of words model. We can write an equation for the probability of a sequence of words (making the assumption that $w_0$ is a special \"start\" or \"separator\" symbol that we denote as ``):\n", " \n", "$P(w_1 \\ldots w_n) = P(w_1 \\mid \\mbox{}) \\times P(w_2 \\mid w_1) \\times \\ldots P(w_n \\mid w_{n-1}) = \\Pi_{i=1}^{n} P(w_i \\mid w_{i-1})$\n", "\n", "The conditional probability of each word given the previous word is given by:\n", "\n", "$P(w_i \\mid w_{i-1}) = \\frac{P(w_{i-1}w_i)}{P(w_{i-1})} $\n", "\n", "That is, the bigram probability of the word and the previous word, divided by the unigram probability of the previous word. \n", "\n", "The bigram model can be thought of as a \"multiple bags of words\" model, where there are multiple bags, one for each distinct word, and a special bag marked ``. (We assume that each sentence boundary in the text is also annotated with a ``.) To build a model, take a text,\n", "cut it up into slips of paper with two words on them, put each slip into the bag labelled with the first word on the slip. To generate language from the model, first pick a slip from the bag labeled `` and output the second word on the slip. Now pick a slip from the bag labeled with that second word, and output the second word on *that* slip. Continue in this manner.\n", "\n", "\n", "The file `count_2w.txt` has bigram data in the form `\"word1 word2 \\t count\"`. We'll load this into the bag `P2w`." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "P2w = load_counts(open('count_2w.txt'))" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(258437, 225.955251755)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(P2w), sum(P2w.values())/1e9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Over a quarter million distinct two-word pairs, with a total count of 225 billion. Let's see the most common two-word sequences:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('of the', 2772205934),\n", " ('in the', 1735111785),\n", " ('to the', 1147345124),\n", " ('on the', 841095584),\n", " ('for the', 726714015),\n", " ('and the', 644282998),\n", " ('to be', 513603140),\n", " ('with the', 480181610),\n", " ('is a', 479118206),\n", " ('from the', 454051070),\n", " ('at the', 449390032),\n", " ('by the', 428337874),\n", " ('do not', 400755693),\n", " ('it is', 391172862),\n", " ('of a', 387456600),\n", " ('in a', 387077847),\n", " ('will be', 357181303),\n", " ('if you', 350999373),\n", " ('that the', 337117243),\n", " ('is the', 313123377)]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P2w.most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use `Pwords2` for the probability of a word sequence in the bigram model, and `cPword` for the conditional probability of a word given the previous word. \n", "\n", "But we run into a problem: in `cPword` it is natural to divide by the probability of the previous word, but we don't want to divide by zero in the case where we haven't seen that word before. So we will use a **backoff model** that says if we don't have counts for both words, we'll fall back to the unigram probability of the word (ignoring the previous word)." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "def Pwords2(words, prev='') -> float:\n", " \"\"\"The probability of a sequence of words, using bigram data, given previous word.\"\"\"\n", " return Π(cPword(w, (prev if (i == 0) else words[i-1]) )\n", " for (i, w) in enumerate(words))\n", "\n", "def cPword(word, prev) -> float:\n", " \"\"\"Conditional probability of word, given previous word.\"\"\"\n", " bigram = prev + ' ' + word\n", " if P2w(bigram) > 0 and P1w(prev) > 0:\n", " return P2w(bigram) / P1w(prev)\n", " else: # Average the back-off value and zero.\n", " return P1w(word) " ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.012268827179134844" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P2w('of the')" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.575887217554441e-06" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P2w('the of')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll look at all 24 permutations of the words in \"this is a test\" and compute the probability for each under the bigram model. We see that \"this is a test\" is the most probable permutation, and \"this a is test\" is the least probable, almost a million times less probable: " ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(3.472808344777707e-07, 'this is a test'),\n", " (1.127597078828602e-07, 'test this is a'),\n", " (6.965779368966287e-08, 'this test is a'),\n", " (3.156249379346401e-08, 'a test this is'),\n", " (2.2987584450049622e-08, 'is a test this'),\n", " (1.5938698734133065e-08, 'is a this test'),\n", " (1.340683369945385e-08, 'test is a this'),\n", " (1.306996404175386e-08, 'a test is this'),\n", " (4.2119901013695514e-09, 'a this is test'),\n", " (4.0586505617016795e-09, 'a this test is'),\n", " (2.225067906668548e-09, 'test a this is'),\n", " (2.2250679066685475e-09, 'this is test a'),\n", " (1.7086438151095857e-09, 'is this test a'),\n", " (9.36681886482603e-10, 'this a test is'),\n", " (7.464591940646502e-10, 'is this a test'),\n", " (6.790746260498466e-10, 'test is this a'),\n", " (9.442288029367219e-11, 'is test a this'),\n", " (7.210056993667689e-11, 'a is this test'),\n", " (6.959024970571533e-11, 'is test this a'),\n", " (1.0936134003445812e-11, 'this test a is'),\n", " (7.330795983439265e-12, 'test a is this'),\n", " (6.215026001209826e-12, 'a is test this'),\n", " (1.551284601334948e-12, 'test this a is'),\n", " (9.945186564405085e-13, 'this a is test')]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted((Pwords2(t), sentence(t)) for t in permutations(tokens('this is a test')))[::-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Under the bag of words model, all permutations have the same probability (except for roundoff error in the least-significant digit):" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(2.9841954327098435e-11, ('this', 'test', 'is', 'a')),\n", " (2.9841954327098435e-11, ('this', 'test', 'a', 'is')),\n", " (2.9841954327098435e-11, ('this', 'is', 'test', 'a')),\n", " (2.9841954327098435e-11, ('test', 'this', 'is', 'a')),\n", " (2.9841954327098435e-11, ('test', 'this', 'a', 'is')),\n", " (2.9841954327098435e-11, ('test', 'is', 'this', 'a')),\n", " (2.9841954327098435e-11, ('test', 'a', 'this', 'is')),\n", " (2.9841954327098435e-11, ('is', 'this', 'test', 'a')),\n", " (2.9841954327098435e-11, ('is', 'test', 'this', 'a')),\n", " (2.9841954327098435e-11, ('a', 'test', 'this', 'is')),\n", " (2.984195432709843e-11, ('this', 'is', 'a', 'test')),\n", " (2.984195432709843e-11, ('this', 'a', 'test', 'is')),\n", " (2.984195432709843e-11, ('this', 'a', 'is', 'test')),\n", " (2.984195432709843e-11, ('test', 'is', 'a', 'this')),\n", " (2.984195432709843e-11, ('test', 'a', 'is', 'this')),\n", " (2.984195432709843e-11, ('is', 'this', 'a', 'test')),\n", " (2.984195432709843e-11, ('is', 'test', 'a', 'this')),\n", " (2.984195432709843e-11, ('is', 'a', 'this', 'test')),\n", " (2.984195432709843e-11, ('is', 'a', 'test', 'this')),\n", " (2.984195432709843e-11, ('a', 'this', 'test', 'is')),\n", " (2.984195432709843e-11, ('a', 'this', 'is', 'test')),\n", " (2.984195432709843e-11, ('a', 'test', 'is', 'this')),\n", " (2.984195432709843e-11, ('a', 'is', 'this', 'test')),\n", " (2.984195432709843e-11, ('a', 'is', 'test', 'this'))]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted((Pwords(t), t) for t in permutations(tokens('this is a test')))[::-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Task: Segmentation with a Bigram Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's re-do segmentation using a bigram model. To make `segment2`, we copy `segment`, and make sure to pass around the previous token, and to evaluate probabilities with `Pwords2` instead of `Pwords`." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "@lru_cache(None)\n", "def segment2(text, prev=''): \n", " \"Return best segmentation of text; use bigram data.\" \n", " if not text: \n", " return []\n", " else:\n", " candidates = ([first] + segment2(rest, first) \n", " for (first, rest) in splits(text, 1))\n", " return max(candidates, key=lambda words: Pwords2(words, prev))\n", " \n", "def splits(text, start=0, end=20) -> Tuple[str, str]:\n", " \"\"\"Return a list of all (first, rest) pairs; start <= len(first) <= L.\"\"\"\n", " return [(text[:i], text[i:]) \n", " for i in range(start, min(len(text), end)+1)]" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['choose', 'spain']" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment2('choosespain')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['speed', 'of', 'art']" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment2('speedofart')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the following example, both `segment` and `segment2` perform perfectly:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('a dry bare sandy hole with nothing in it to sit down on or to eat',\n", " 'a dry bare sandy hole with nothing in it to sit down on or to eat')" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tolkein = 'adrybaresandyholewithnothinginittositdownonortoeat'\n", "sentence(segment(tolkein)), sentence(segment2(tolkein))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But for the next example, `segment` gets three words wrong, while `segment2` only misses one word, `unregarded`, which is not in `P1w`." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'far out in the un chart ed back waters of the un fashionable end of the western spiral arm of the galaxy lies a small un regarded yellow sun'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adams = ('faroutintheunchartedbackwatersoftheunfashionableendofthewesternspiral' +\n", " 'armofthegalaxyliesasmallunregardedyellowsun')\n", "sentence(segment(adams))" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the galaxy lies a small un regarded yellow sun'" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence(segment2(adams))" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'unregarded' in P1w" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion? The bigram model is a little better, but these examples haven't demonstrated a large difference. Even with hundreds of billions of words as data, we still don't have a complete model of English words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Theory: Evaluation\n", "\n", "So far, we've got an intuitive feel for how this all works. But we don't have any solid metrics that quantify the results. Without metrics, we can't say if we are doing well, nor if a change is an improvement. In general,\n", "when developing a program that relies on data to help make\n", "predictions, it is good practice to divide your data into three sets:\n", "
    \n", "
  1. Training set: the data used to create our spelling\n", " model; this was the big.txt file.\n", "
  2. Development set: a set of input/output pairs that we can\n", " use to rank the performance of our program as we are developing it.\n", "
  3. Test set: another set of input/output pairs that we use\n", " to rank our program after we are done developing it. The\n", " development set can't be used for this purpose—once the\n", " programmer has looked at the development test it is tainted, because\n", " the programmer might modify the program just to pass the development\n", " test. That's why we need a separate test set that is only looked at\n", " after development is done.\n", "
\n", "\n", "For this program, the training data is the word frequency BAG, the development set is the examples like `\"choosespain\"` that we have been playing with, and now we need a test set." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "def test_segmenter(segmenter, tests):\n", " \"Try segmenter on tests; report failures; return fraction correct.\"\n", " return sum([test_one_segment(segmenter, test) \n", " for test in tests]), len(tests)\n", "\n", "def test_one_segment(segmenter, test):\n", " words = tokens(test)\n", " result = segmenter(cat(words))\n", " correct = (result == words)\n", " if not correct:\n", " print('expected:', sentence(words))\n", " print(' got:', sentence(result))\n", " return correct\n", "\n", "cat = ''.join\n", "\n", "proverbs = (\"\"\"A little knowledge is a dangerous thing\n", " A man who is his own lawyer has a fool for his client\n", " All work and no play makes Jack a dull boy\n", " Better to remain silent and be thought a fool that to speak and remove all doubt;\n", " Do unto others as you would have them do to you\n", " Early to bed and early to rise, makes a man healthy, wealthy and wise\n", " Fools rush in where angels fear to tread\n", " Genius is one percent inspiration, ninety-nine percent perspiration\n", " If you lie down with dogs, you will get up with fleas\n", " Lightning never strikes twice in the same place\n", " Power corrupts; absolute power corrupts absolutely\n", " Here today, gone tomorrow\n", " See no evil, hear no evil, speak no evil\n", " Sticks and stones may break my bones, but words will never hurt me\n", " Take care of the pence and the pounds will take care of themselves\n", " Take care of the sense and the sounds will take care of themselves\n", " The bigger they are, the harder they fall\n", " The grass is always greener on the other side of the fence\n", " The more things change, the more they stay the same\n", " Those who do not learn from history are doomed to repeat it\"\"\"\n", " .splitlines())" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "expected: power corrupts absolute power corrupts absolutely\n", " got: power corrupt s absolute power corrupt s absolutely\n", "expected: the grass is always greener on the other side of the fence\n", " got: the grass is always green er on the other side of the fence\n" ] }, { "data": { "text/plain": [ "(18, 20)" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_segmenter(segment, proverbs)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20, 20)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_segmenter(segment2, proverbs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This confirms that both segmenters are good, and that `segment2` is slightly better, getting 20 out of 20 examples right, compared to 18 out of 20 for `segment`. There is much more that can be done in terms of the variety of tests, and in measuring statistical significance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Theory and Practice: Smoothing\n", "\n", "Here are some more test cases:\n" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'this is the oligonucleotide test': 4.287506893716667e-14,\n", " 'this is the neverbeforeseen test': 0.0,\n", " 'this is the zqbhjhsyefvvjqc test': 0.0}" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tests = ['this is the oligonucleotide test',\n", " 'this is the neverbeforeseen test',\n", " 'this is the zqbhjhsyefvvjqc test']\n", "\n", "{test: Pwords2(tokens(test)) for test in tests}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The issue here is the finality of a probability of zero. Out of the three 15-letter words, it turns out that \"oligonucleotide\" is in the dictionary, but if it hadn't been, if somehow our corpus of words had missed it, then the probability of that whole phrase would have been zero. It seems that is too strict; there must be some \"real\" words that are not in our dictionary, so we shouldn't give them probability zero. There is also a question of likelyhood of being a \"real\" word. It does seem that \"neverbeforeseen\" is more English-like than \"zqbhjhsyefvvjqc\", and so perhaps should have a higher probability.\n", "\n", "We can address this by assigning a non-zero probability to words that are not in the dictionary. This is even more important when it comes to multi-word phrases (such as bigrams), because many legitimate bigrams will not appear in the training data.\n", "\n", "We can think of our probability model as being overly spiky; it has a spike of probability mass wherever a word or phrase occurs in the corpus. What we would like to do is *smooth* over those spikes so that we get a model that does not depend so much on the details of our corpus. \n", "\n", "For example, Laplace was asked for an estimate of the probability of the sun rising tomorrow. From past data, he knows that the sun has risen $n/n$ times for the last *n* days (with some uncertainty about the value of *n*), so the maximum liklihood estimator is probability 1. But Laplace wanted to balance the observed data with the possibility that tomorrow, either the sun will rise or it won't, so he came up with an estimate of $(n + 1) / (n + 2)$.\n", "\n", "
\n", " \n", "
What we know is little, and what we are ignorant of is immense.
— Pierre Simon Laplace (1749-1827)
\n", "

\n", " \n", "\n", "A generalization of Laplace's approach is called [additive smoothing](https://en.wikipedia.org/wiki/Additive_smoothing): we add a **pseudocount** (which does not have to be 1) to each of the outcomes we have seen so far, plus one more for the unseen outcomes. We will redefine `Bag` and `P1w` to use additive smoothing. This time, I can't have the `ProbabilityFunction` as a mixin, because I need to supply the psuedocount as an argument when the bag is constructed, and then use it within the probability function. Thus, I will redefine `Bag` as follows:\n" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "class Bag(Counter): \n", " \"\"\"A bag of words with a probability function using additive smoothing.\"\"\"\n", " def __init__(self, elements=(), pseudocount=1):\n", " self.update(elements)\n", " self.pseudocount = pseudocount\n", "\n", " def __call__(self, outcome):\n", " \"\"\"The probability of `outcome`, smoothed by adding pseudocount to each outcome.\"\"\"\n", " if not hasattr(self, 'denominator'):\n", " self.denominator = sum(self.values()) + self.pseudocount * (len(self) + 1)\n", " return (self[outcome] + self.pseudocount) / self.denominator" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "P1w = Bag(P1w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now an unknown word has a non-zero probability, as do sequences that contain unknown words, but known words still have higher probabilities:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.7003201005861308e-12" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P1w('neverbeforeseen')" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'this is the oligonucleotide test': 4.287512523438411e-14,\n", " 'this is the neverbeforeseen test': 8.033021050553851e-20,\n", " 'this is the zqbhjhsyefvvjqc test': 8.033021050553851e-20}" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{test: Pwords2(tokens(test)) for test in tests}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many more advanced ways to do smoothing, but we won't cover them here.\n", "\n", "There is one more issue to contend with that we will mention but not fix: the smallest positive 64-bit floating point number is about $10^{-323}$. That means that if we are trying to compute the probability of a long sequence of words, we will eventually reach a point where we get **floating point underflow** and the probability is incorrectly reported as zero. We see that this happens somewhere around a 100 word sequence, depending on the exact words:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{70: 1.1733110491887828e-219,\n", " 80: 2.279383032662303e-258,\n", " 90: 1.5643367062438602e-271,\n", " 100: 7.120951572374323e-301,\n", " 110: 1.67177e-319,\n", " 120: 0.0,\n", " 130: 0.0,\n", " 140: 0.0,\n", " 150: 0.0}" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{n: Pwords(sample(WORDS, n)) for n in range(70, 151, 10)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This problem can be addressed by adding logarithms rather than multiplying probabilities, or by re-scaling numbers when they get too small." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Task: Secret Codes\n", "\n", "Let's tackle one more task: decoding secret messages. We'll start with the simplest of codes, a rotation cipher, sometimes called a shift cipher or a Caesar cipher (because this was state-of-the-art crypotgraphy in 100 BC). First, a method to encode:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "def rot(msg, n=13): \n", " \"Encode a message with a rotation (Caesar) cipher.\" \n", " return encode(msg, alphabet[n:]+alphabet[:n])\n", "\n", "def encode(msg, key): \n", " \"Encode a message with a substitution cipher.\" \n", " table = str.maketrans(upperlower(alphabet), upperlower(key))\n", " return msg.translate(table) \n", "\n", "def upperlower(text): return text.upper() + text.lower() " ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Uijt jt b tfdsfu nfttbhf.'" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "msg = 'This is a secret message.'\n", "\n", "rot(msg, 1)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Guvf vf n frperg zrffntr.'" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rot(msg, 13)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a secret message.'" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rot(rot(msg, 13), 13)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decoding a Caesar cipher message is easy: try all 26 candidates, and find the one with the maximum `Pwords`:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "def decode_rot(secret):\n", " \"Decode a secret message that has been encoded with a rotation cipher.\"\n", " candidates = [tokens(rot(secret, i)) for i in range(26)]\n", " return sentence(max(candidates, key=Pwords))" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'this is a secret message'" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "secret = rot(msg, 17)\n", "\n", "decode_rot(secret)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make it a bit harder. When the secret message contains separate words, it is too easy to decode by guessing that the one-letter words are most likely \"I\" or \"a\". So what if the encode routine squished all the letters together:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "def squish_rot(msg, n=13):\n", " \"Encode a message with a rotation (Caesar) cipher, keeping letters only.\" \n", " return encode(cat(tokens(msg)), alphabet[n:]+alphabet[:n])" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'pahdghplmaxtglpxkmablmbfxtgrhgxtgrhgxunxeexk'" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "secret = squish_rot('Who knows the answer this time? Anyone? Anyone? Bueller?', 19)\n", "secret" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks harder to decode. A decoder will have to try all rotations, then segment each rotation, and find the candidate with the best `Pwords`:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "def decode_rot(secret):\n", " \"\"\"Decode a secret message that has been encoded with a rotation cipher,\n", " and which has had all the non-letters squeezed out.\"\"\"\n", " candidates = [segment(rot(secret, i)) for i in range(26)]\n", " return max(candidates, key=lambda msg: Pwords(msg))" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'who knows the answer this time anyone anyone bu e ll er'" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence(decode_rot(secret))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Almost right, but didn't quite get the proper name.\n", "\n", "What about a general substitution cipher? The problem is that there are $26! \\approx 10^{26}$ substitution ciphers, and we can't enumerate all of them. We would need to search through the space: initially make some guess at a substitution, then swap two letters; if that looks better keep going, if not try something else. This approach solves most substitution cipher problems, although it can take a few minutes on a message of length 100 words or so.\n", "\n", "# What's Next?\n", "\n", "What to do next? Here are some options:\n", " \n", "- **Spelling correction**: Use bigram or trigram context; make a model of spelling errors/edit distance; go beyond edit distance 2; make it more efficient\n", "- **Evaluation**: Make a serious test suite; search for best parameters (e.g. $c_1, c_2, c_3$)\n", "- **Smoothing**: Implement Kneser-Ney and/or Interpolation; do letter *n*-gram-based smoothing\n", "- **Secret Codes**: Implement a search over substitution ciphers\n", "- **Classification**: Given a corpus of texts, each with a classification label, write a classifier that will take a new text and return a label. Examples: spam/no-spam; favorable/unfavorable; what author am I most like; reading level.\n", "- **Clustering**: Group data by similarity. Find synonyms/related words.\n", "- **Parsing**: Representing nested structures rather than linear sequences of words. relations between parts of the structure. Implicit missing bits. Inducing a grammar.\n", "- **Meaning**: What semantic relations are meant by the syntactic relations?\n", "- **Translation**: Using examples to transform one language into another.\n", "- **Question Answering**: Using examples to transfer a question into an answer, either by retrieving a passage, or by synthesizing one.\n", "- **Speech**: Dealing with analog audio signals rather than discrete sequences of characters." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }