{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Boring preliminaries\n", "import re\n", "import math\n", "import string\n", "import random\n", "from collections import Counter\n", "from math import log10\n", "from __future__ import division\n", "from matplotlib.pyplot import yscale, xscale, title, plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Statistical Natural Language Processing in Python.\n", "
or\n", "
How To Do Things With Words. And Counters.\n", "
or\n", "
Everything I Needed to Know About NLP I learned From Sesame Street.\n", "
Except Kneser-Ney Smoothing.\n", "
The Count Didn't Cover That.\n", "
\n", "
\n", "
*One, two, three, ah, ah, ah!* — The Count\n", "

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(1) Data: Text and Words\n", "========\n", "\n", "Before we can do things with words, we need some words. First we need some *text*, possibly from a *file*. Then we can break the text into words. I happen to have a big text called [big.txt](file:///Users/pnorvig/Documents/ipynb/big.txt). We can read it, and see how big it is (in characters):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6488666" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TEXT = open('../data/text/big.txt').read()\n", "len(TEXT)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, six million characters.\n", "\n", "Now let's break the text up into words (or more formal-sounding, *tokens*). For now we'll ignore all the punctuation and numbers, and anything that is not a letter." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def tokens(text):\n", " \"List all the word tokens (consecutive letters) in a text. Normalize to lowercase.\"\n", " return re.findall('[a-z]+', text.lower()) " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['this', 'is', 'a', 'test', 'this', 'is']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens('This is: A test, 1, 2, 3, this is.')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1105285" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "WORDS = tokens(TEXT)\n", "len(WORDS)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, a million words. Here are the first 10:\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'adventures', 'of', 'sherlock', 'holmes']\n" ] } ], "source": [ "print(WORDS[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(2) Models: Bag of Words\n", "====\n", "\n", "The list `WORDS` is a list of the words in the `TEXT`, but it can also serve as a *generative model* of text. We know that language is very complicated, but we can create a simplified model of language that captures part of the complexity. In the *bag of words* model, we ignore the order of words, but maintain their frequency. Think of it this way: take all the words from the text, and throw them into a bag. Shake the bag, and then generating a sentence consists of pulling words out of the bag one at a time. Chances are it won't be grammatical or sensible, but it will have words in roughly the right proportions. Here's a function to sample an *n* word sentence from a bag of words:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def sample(bag, n=10):\n", " \"Sample a random n-word sentence from the model described by the bag of words.\"\n", " return ' '.join(random.choice(bag) for _ in range(n))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'killing in of had is angrily prophecy throat his seller'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample(WORDS)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another representation for a bag of words is a `Counter`, which is a dictionary of `{'word': count}` pairs. For example," ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({'is': 2, 'this': 1, 'a': 2, 'test': 2, 'it': 1})" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(tokens('Is this a test? It is a test!'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `Counter` is like a `dict`, but with a few extra methods. Let's make a `Counter` for the big list of `WORDS` and get a feel for what's there:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('the', 80030), ('of', 40025), ('and', 38313), ('to', 28766), ('in', 22050), ('a', 21155), ('that', 12512), ('he', 12401), ('was', 11410), ('it', 10681)]\n" ] } ], "source": [ "COUNTS = Counter(WORDS)\n", "\n", "print(COUNTS.most_common(10))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "80030 the\n", "83 rare\n", "38313 and\n", "0 neverbeforeseen\n", "460 words\n" ] } ], "source": [ "for w in tokens('the rare and neverbeforeseen words'):\n", " print(COUNTS[w], w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 1935, linguist George Zipf noted that in any big text, the *n*th most frequent word appears with a frequency of about 1/*n* of the most frequent word. He get's credit for *Zipf's Law*, even though Felix Auerbach made the same observation in 1913. If we plot the frequency of words, most common first, on a log-log plot, they should come out as a straight line if Zipf's Law holds. Here we see that it is a fairly close fit:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEMCAYAAADK231MAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3Xd4FNX6wPHvm0aogdAhhd4EpBfp0hEsYAEVG4JYUO9V7L8rtmv3WlARBRFRFLGBUgSlSgfpSJEaegkBQvqe3x8z0U3fkM2W5P08Dw+7M2dm3z07effsmTNnxBiDUkqpoivA2wEopZQqXJrolVKqiNNEr5RSRZwmeqWUKuI00SulVBGniV4ppYo4TfTFgIhUFZGlInJeRN4spNfYLyK9CmPfhUlEXhSRUyJyzNux+CsRGSci0zz0Wl45zpzfo4hEicgFEQn0dByXyq8Tvf2hJ9iVnv6vhrfj8kGjgFNAOWPMIwXdmYhMEZEXCx5W4RIRIyL1clkfBTwCNDHGVPNcZAXnyeTqL0Skh4gsEpE4EdmfQ5mOIrKiIK9jjDlojCljjEkryH48ya8TvW2QXenp/45kLiAiQd4IzIdEA9uNXh2XWRRw2hhzIruVetxkJBZfzhnxwGRgbC5lrgLmeCYcH2KM8dt/wH6gVzbLawEGGAEcBJbayzsAK4CzwCagu9M2tYElwHlgATAemGav6w7E5PTaWF+YTwB/AaeBGUB4plhut2M5BTzttJ9A4Cl72/PAeiASeB94M9NrzgL+lUNdXAGsBeLs/6+wl08BUoBk4EIO9TXFfr2f7RhWA3VzeJ1RmfY326k+HgU22zF8DYTmsI87gN+B/9mfxV47/juAQ8AJ4Han8mHAVOAkcAB4Bgiw19WzP7c4u26/tpcvtes93o7zpkwx9AISAIe9foq/HDdAP7v+U+zYN2VTx3emfzb2893AN07PDwEtcjt27HWLgZfszyvBru8c33M2cVQAfrI/u1j7cUSm/b9g7/888AtQyWn9cPszPw08TQ5/89l8tvtzWLcBaGU/NsBou27OYv0NSA7bjXP6XNM/myAX30OOx4/HcqWnX9Ctweed6KcCpYGSQE37YBlg/4H1tp9XtrdZCbwFlAC62h+Yq3+wDwGrgAh7+4+A6Zli+diO43IgCWhsrx8LbAEaAmKvrwi0A47wT0KrBFwEqmbzfsPtP6LhQBAwzH5e0V4/BXgxl3qcYtdFO3v7L4Cv8ij/Yjb1sQaoYcezAxidw/Z3AKlYySgQeBErmb1v118fu/7L2OWnAj8CZe363AWMsNdNx0oAAUAo0NnpdQxQL5f3keFz9bPjZhw5JFd7fR2sxBJgfyYH0mOx18Xa6/I6dhbbn81l9vrg3N5zNnFUBIYApezP7xvgB6f1i7G+6BrY73Mx8Iq9rgnWF1lX+7XewjpuLinRA9WBw9jJ3K7fn4DyWL/uTgL9ctjn3/VN9ok+p/eQ6/HjsVzpyRdze/DWH80F+4A+m34AOX0QdZzKPg58nmn7+Vgtpij7ACrttO5LXP+D3QH0zHRApdh/GOmxOLdi1gBD7cc7gWtyeH87gN724weAOTmUGw6sybRsJXCH/XgKeSf6T5yeDwD+zKN8don+VqfnrwETctj+DmC30/Nmdh1VdVp2GmiB9UWQjNWPnr7uHmCx/XgqMNG5fp3KXWqi94fjZhy5JHq7zCGgFTDUrqM1QCOsL9hZLh47i4Hnndbl+p5d+JttAcQ6PV8MPOP0/D5gnv34Pzg1OLC+fJO59EQ/ApiU6fhwbhjMAJ7IYZ9/1zfZJ/qc3kOOx48r9eWuf77c3+aqa40x5e1/12Zad8jpcTRwg4icTf8HdMb646qBdfDFO5U/kI8YooHvnfa7A0gDqjqVcR7VcREoYz+OxGoNZOcz4Fb78a3A5zmUS2+xOTuA1ZpwVbbxichTTie6J1zKPnJw3OlxAoAxJvOyMli/ZILJ+P6c39tjWL+E1ojINhG5K48YXeEPx40rlmB92XS1Hy8Gutn/lthlXDl2nOsjX+9ZREqJyEcickBEzmF1qZXPNGIlp/dYw/m17dc8ndNruWAAWfvnC1K/ruwnt+PHY4pCos+NcXp8COubtbzTv9LGmFeAo0AFESntVD7K6XE81k9PAOyDtHKmfffPtO9QY8xhF2I8BNTNYd004BoRuRxoDPyQQ7kjWAeUsyisn6kFYoz5r/nnRPfo9MUF3W8+nMJq5Tq/v7/fmzHmmDFmpDGmBlZL/4PcRtq4yB+OG1c+g/RE38V+vISsid6VY8f5tfJ6z5k9gtUt2d4YUw7rSwesL+e8HMVqCFkbiJTC6grKNxEJxnrfCy5l+wLI7fjxmKKe6J1NAwaJSF8RCRSRUBHpLiIRxpgDwDrgOREJEZHOwCCnbXcBoSJylX3APIPVZ5huAvCSiEQDiEhlEbnGxbg+AV4Qkfr2qIbmIlIRwBgTg3Vy7HPgW2NMQg77mAM0EJGbRSRIRG7C6t/8ycUY8us4Vj9voTPWELYZWPVb1q7jf2N9nojIDSISYRePxUpKDjfG6avHzXGgVh6jYJYAPYCS9rG0DOtEbkXgD7tMvo4dF95zZmWxfp2dFZFw4FkX3x/ATGCgiHQWkRDgeXLJWSISICKhWL8Axf6sQuzVnYHNxphz+Xh9d8jx+PFkEMUm0RtjDgHXYI1wOYn1TTuWf+rgZqA9cAbrYJzqtG0cVr/bJ1gtnXggxmn372CNiPlFRM5jnWBr72Job2Elsl+Ac8AkrBM66T7D6sPOqdsGY8xpYCBW6+k0VnfGQGPMKRdjyK9JQBP7p2hOvzLcaQxWne8FlmP1CU+217UFVovIBazP4CFjzF573TjgMzvOGy/lhX34uPnG/v+0iGzIIfZdWOewltnPz2HV4e/2F+ilHjs5vudsvI11PJ/Cen/zXHlzdmzbgPuxPu+jWF/kMbls0hXrS2UO1q+MBKy/K/DSsMq8jh8RmeBCl2iBpZ99VpmIyDisE3m35lW2kOPoitUqiDb6Yfk8XzluVEYish243hiz3duxeEOxadH7I/vn/kNYI2I0ySt1Cezum6nFNcmDJnqfJSKNsYaMVsf6+auUugTGmGRPn/z0Ndp1o5RSRZy26JVSqojTRK+UUkWcT8zOV6lSJVOrVi1vh6GUUn5l/fr1p4wxlfMq5xOJvlatWqxbt87bYSillF8REZem3HB714191dcy+0KA7u7ev1JKqfxxKdGLyGQROSEiWzMt7yciO0Vkj4g8YS82WFfjhZL7VWxKKaU8wNUW/RSsOTL+Zk/Q9D7QH2tujGEi0gRYZozpjzU953PuC1UppdSlcCnRG2OWYs1r4awdsMcYs9cYkwx8hTWvevqEUrFknMBJKaWUFxTkZGxNMs5THQO0F5HBQF+su7aMz2ljERmFdVs6oqJym+VUKaVUQbh91I0x5jvgOxfKTRSRo8CgkJCQ1u6OQymllKUgo24O43RTAKz7XubrRhfGmNnGmFFhYWEFCEMppVRuCpLo1wL1RaS2PTvcUKy5tV0mIoNEZGJcXFwBwlBKKZUbV4dXTse6YXBDEYkRkRHGmFSsG1bPx7rX5Qz7RgEu0xa9UkoVPpf66I0xw3JYPgcv3LVFKaWU67w6qZl23SilVOHzaqLXrhullCp82qJXSqkiTlv0SilVxOmNR5RSqojTRK+UUkWc9tErpVQRp330SilVxPnErQSTEuPZu33NpW1cIoyg8pGEhgQQGhxIyeBAggO1R0oppdL5RKIvcWYXdWb0vuTtNzrqMDvtCn5K68BxwgkMEEoGBxIabCX/9C+AzM//XhYSSGhQICVDrGXt64TTqFo5N75DpZTyHjHGeO/FRQYBg+pEVh8586OXLmkfJeJjqH5oDhXitmMQjoS1ZFt4bzaV604sZUlMTiMxNY2E5DQSUxwkpKSRaP+zHlvLklMdGfbbrlY4wztG0/eyaoQE6S8EpZTvEZH1xpg2eZbzZqJP16ZNG7Nu3bqC7eTUHtj6LWydCad2QUAQ1OkBTYdAo6sgNPcWeprDkJSaRlxCCrM3HWHaqoMcPHORymVLMKxdFDe3i6JaWGjBYlRKKTcqfok+nTFwfCtsmQlbv4O4gxBYAhr0gabXQ4O+EFwyz904HIYlu04ydeV+Fu86SYAIfS+ryvAOtehQJxwRcU+8Sil1iYpvondmDMSstVv630H8CQgpY7Xwmw6xWvxBIXnu5sDpeL5YfZAZ6w5x9mIKDaqWYXiHaK5rFUGZEj5xmkMpVQxpos/MkQb7l1lJf/ssSDwLJStA46uh2fUQ3QkCAnPdRWJKGrM2HeHzlQfYcjiOMiWCGNyqJnd3rkNUxVKFG79SSmXiF4k+/WRsvXr1Ru7evdtzL5yaDH/9ZvXn/zkHUuKhTDW47DqrpR/RBnLpmjHGsPHQWT5feYCfNh/FYLi1QzRjrqxPeOm8fyEopZQ7+EWiT+eRFn1Oki/CrnlWS3/3L5CWDJUaQruR0PymPE/iHj+XyNsLd/H12kOUDgni3h51uatTbUKDc/91oJRSBaWJ/lIkxlndOusmwZE/rP78y4dC25FQpVGum+4+fp5X5/3Jwh0nqB4Wyr97N2BwqwgCA/SkrVKqcGiiL6iY9bBmImz7zmrl1+oC7UZBwwEQmPMJ2FV7T/PynB1siomjUbWyPNG/Ed0aVNZROkopt9NE7y7xp2DDVFg3GeIOQbma0OZOaHU7lKmS7SbGGOZsOcZr8//kwOmLtIgsT8OqZakWFkr1sFD7/5JUCwulXGiQfgkopS6JJnp3c6RZfflrJsLexRAQbJ28bTcSItpme/I2OdXBF6sP8MPGIxw9m8DJC0lkru7hHaJ54dqmnnkPSqkiRRN9YTq1G9Z+Ahu/hKRzULM1XPEgNB6U6xDNlDQHJ84ncSwugSNnE1mw/TizNh1h5uiOtKkV7sE3oJQqCvwi0XtteKW7JF2ATdNh5fsQuw8q1IYrHoAWt7h09e3F5FR6vrmE8NIhzHqgs564VUrli6uJXuejL4gSZayumzHr4capUCocfn4E/tcUFr8KF8/kunmpkCCeGtCYbUfO8dXagx4KWilV3Oi0jO4QEAhNroG7f4U75lhdOYv/C281gTljIXZ/jpsObF6d9rXDeWP+Ts5eTPZczEqpYkMTvTuJQK1OcMsMuG8VNB0M6z6Fd1vB96Otvv0smwjjrr6MuIQU3lqwywtBK6WKOk30haVKY7j2A3h4M3S4F7b9AO+3g5kj4MSfGYo2rl6OWztEM23VAXYcPeelgJVSRZUm+sJWrgb0fQke3gJXjIGdc+GDDjDjdji29e9i/+7dgLCSwYybtQ1fGAmllCo6NNF7SpnK0Pt5K+F3eQT2/AoTOsFXt8CRjZQvFcKjfRuyet8Zftp81NvRKqWKEB1H7y0JsbD6I1j1gTXHTv2+pHUZy9U/JHLw9EXqVS1DiSD7HrdBgTStWY5rW9YkooJOh6yUsnh1HL2IlAaWAOOMMT/lVb5YJvp0iXHW1bYr34eEWOIju/GhGcImaURSioPE1DTik1L562Q8AB3rVGRwq5r0a1qNsqHBXg5eKeVNbk30IjIZGAicMMY0dVreD3gHCAQ+Mca8Yi9/HrgAbNdE76Kk87B2Eqx4Dy6egnq94MpnoEZLAA6ducj3fxzmuw0x7D99kZCgAHo0rMzA5jXo2bgKpUL0TldKFTfuTvRdsRL31PRELyKBwC6gNxADrAWGATWBikAocEoTfT4lx1vTKyz/n9W903gQ9Hjm72mSjTFsOHiW2ZuO8POWo5w8n0TJ4EB6Nq7CwOY16N6wss6Fr1Qx4fauGxGpBfzklOg7YnXN9LWfP2kXLQOUBpoACcB1xhhHbvvWRJ+NxDhY+YHVpZN8wboJSvcnILz230XSHIY1+87w0+YjzN16jDPxyZQMDqR+1TLUq1KGqPBSlAgKJCQogAHNqlE9LO9pGZRS/sMTif56oJ8x5m77+XCgvTHmAfv5HeTSoheRUcAogKioqNYHDhxwKY5iJ/40/P621Y/vSIWWw6HbY9awTSepaQ5W/HWaRTtPsPv4BXafOM/xc0l/r69TuTQ/jemsXTxKFSFeT/T5oS16F5w7CsvegPWfgQRYc+x0/heUrpTjJqlpDlIdhtX7znDHp2sY2jaSlwc392DQSqnC5IlJzQ4DkU7PI+xlLhORQSIyMS4urgBhFBPlqsNVb1oTqDW73hqW+c7l8NuLkHA2202CAq3hmd0aVGZ0t7pMX3OIeVt1jL5SxU1BWvRBWCdje2Il+LXAzcaYbfkNQlv0l+DkLlj0Emz/AULLWxdhtRsFwaHZFk9OdXD9hBVsO3KObg0qc13LmvRuUlVP3Crlx9w96mY60B2oBBwHnjXGTBKRAcDbWMMrJxtjXspnkP49H70vOLoJfn0e9iyEsEjo8TQ0vzHbG6CcPJ/EpOX7+HHjYY7GJVK2RBADL6/Ofd3rERmuF2Ip5W/84sYj6bRF7wZ7l8CC/8DRjVC1KfQaZ43Fz+YWh2kOw6q9p/luw2F+3nIEh4HbOkTTtUFlGlUrS4XSIQQH6uwYSvk6v0j02qJ3M4cDtn0Hv71gzYFfq4s1v07NVjluciwukdfm/8n3fxzOcD/bauVCia5YitIlgujZuAq3tI8u/PiVUvniF4k+nbbo3Sw1GdZ/CktehYun4bLB0PP/ILxOjpucS0xh86E49p66wOkLyRw6c5GDZy5yJj6ZvafiGd2tLo/3a4hk8wtBKeUdmugVJJ6DFe9aF12lpUCbu6wx+LkMycwszWH4z49b+WL1QYa1i+SFa5oSpN06SvkEv0j02nXjIeePweKXYcPnEFwKOj0EHe+DkNIubW6M4a0Fu3jvtz0EBQgVy4RQKiSIcqFBjLmyPr2aVC3kN6CUyo5fJPp02qL3kJO74Nfn4M+foExVa9K0FrdkO0InO7/uOM76A7GcupBEYoqD7UfPsefEBRpVK0vj6uW4uX0UbWuFF/KbUEql00SvcnZwFfzyDMSstUbo9HkR6vbI926SUtOYtHwfa/edYVNMHMmpDmY90Ik6lcsUQtBKqcz8ItFr140XGQPbvoeFz8LZg1C/D/R+4e9ZMvPr8NkEBr67jPpVyzLjno5uDlYplR1PTIFQYMaY2caYUWFhYd4Mo3gSgaaD4f611hDMg6vgwyvg50cg/lS+d1ezfEke7FmfNfvOsG7/mUIIWCl1qXT4RHEXHGqdnH3wD2tUzrpP4d2WsPxtSEnM166Gto0ivHQIj3yziXlbj5Kcmuvs1EopD9E+epXRyZ3WFba75kFYFPR6FpoOyfYK2+ys3nuax77dzIHTF4muWIqejaoysmttnQtfqUKgffSqYPYuhvnPwPEtENEW+v4XItu5tGlqmoPf/jzB+4v/YvuRONrVDmfaiPZ6sZVSbuYXiT6dtuh9lCMNNk2HX1+AC8fgsuusOXQq1HJ5F5+v3M///biNm9pEcl2rmrSILK8zZirlJprolfskXbBuWr7iXesuV+1HW9Milyyf56YOh+GVeX/y8bK9GANlSgTx3s0t6dGwigcCV6po00Sv3O/cEetGJxu/hFLh1pTIrW6HwLxvT3gmPpkNB2J56vstVCxTgjkPdtauHKUKSBO9KjxHN8G8p+DAcqhyGfR7Gep0c2nTL1Yf4OnvtxJWMphmNcPoWLcig1vV1JO1Sl0Cv0j0ejLWjxkDO2ZZV9iePQiNBlrj8SvWzXWzNIdh+pqDbImJY/3BWPacuECrqPJ8d18nDwWuVNHhF4k+nbbo/VhKIqx6H5a+CY4Uq/++61gILefS5uN/280bv+xi6dgeRFXUu1wplR9+cWWsKgKCQ60Tsw9ugGY3WCds32sF6z+zRu3kYUCz6gQGCDd/soqLyakeCFip4kcTvXKPstXg2g9g5CIIrwuzH4SJ3WH/77luVqdyGd6/uSUxsQn86+uNxF1M8Uy8ShUjmuiVe9VsBXfNgyGT4OIZmDIAZtwGsQdy3KRf0+rc3jGahTtO0POtJbz403YcDu93KSpVVGiiV+4nAs2uhwfWQvenYPcCGN8Wfn3eGpOfjeeuacrM0R2JrliKT5bvY9amIx4OWqmiS0/GqsIXdxgWjoMtM6BMNWv+nOZDISBrO8PhMFz9/nL2nYxnTM/69GhYhQZVy+iYe6Wy4RejbnR4ZTFzaC3MexwOr4caraDfKxDVPkuxTYfO8sg3m9hzwmr91yxfkquaV2ds34YE6/1qlfqbXyT6dNqiL0YcDtjyjXXDk/NHoen10Ps5CIvIUvRYXCKLd55g7tZjLNl1kgm3tqJf0+peCFop36SJXvm25HhrzvsV7wJizYnf6SEIyTqWPiXNQcvnF5Cc6uC7+66gaU29UY1SoOPola8LKQ1XPm2dsG3YH5a8Yp2w3fqtddWtk+DAAP5vYGNKBAdw9fjlvDbvT+ISdBimUq7SFr3yDQdWwtzH4NhmiO4E/V+Fas0yFDkTn8xLP+/g2w0xlA0NokOdivRqXIWb2kZ5KWilvEu7bpT/caTBhqnWMMzEs9D6DujxDJSumKHY1sNxfLR0L38cjOXw2QR+vL8TzSPynjJZqaJGE73yXwmxsPgVWPMxlChrTYfc5q4s0yGfS0yh55tLqFg6hDdvvJzLamjfvSpetI9e+a+SFayum3t/h+qXw9yx8FEX2Lc0Q7FyocG8dG1T9p6M56p3lzP2m034QsNFKV+jLXrl24yBP3+C+U9Z0yE3uQb6vAjl/+mXP30hif/M2sbPm49SPSyU+lXLMrRtJL2bVNVx96pI81rXjYg0Bh4CKgG/GmM+zGsbTfQqTykJsGI8LHsTMPZwzIf/Ho6Z5jB8tmI/m2LO8vue05y6kERUeClu6xjNLe2jKRmi96lVRY9bE72ITAYGAieMMU2dlvcD3gECgU+MMa84rQsAphpjbs1r/5rolcviYmDBf6xhmOUioM8L1k3LnaZISElzsHjnST5cvIcNB89Ss3xJHuvXkGta1PRi4Eq5n7v76KcA/TK9QCDwPtAfaAIME5Em9rqrgZ+BOfmIWam8hUXA9ZPhjjlWX/7MO2HKQDi29e8iwYEB9G5Sle/u68S0Ee0JChQe+mojT3+/hZ3HznsxeKW8w+WuGxGpBfyU3qIXkY7AOGNMX/v5kwDGmJedtvnZGHNVDvsbBYwCiIqKan3gQM7T2CqVLUcarJ9i3bA88Sy0vhOufMa6cbmThOQ0Hvt2M3O3HCXVYejWoDJPX9WYBlXLeidupdzE7X302ST664F+xpi77efDgfbATGAwUALYbIx5P699a9eNKpCLZ6zhmGs/sYZjXvmMlfQzDcc8eT6JKSv2MW3VQeISUrisRjmeHtCYK+pV8lLgShWM14ZXGmMWG2MeNMbck1eSF5FBIjIxLi7O3WGo4qRUOAx4DUYvh+rNYc6j8FFX2LcsQ7HKZUswtm8jfnukG0/2b8SJ80nc/uka3l64i3OJOqWCKroKkugPA5FOzyPsZS4zxsw2xowKC9MLXZQbVG0Ct82CG6dC0nn4bKB1d6uzBzMUq1imBPd0q8v8h7vStlY4by/czYB3lrHnhPbfq6KpIF03QcAuoCdWgl8L3GyM2ebyi+t89KqwpCTA7+/C8v9hDcd8ONvZMY0xLNh+nIe+2khCShpdG1Smf9NqXNuipg7JVD7P3cMrpwPdscbGHweeNcZMEpEBwNtYwysnG2NeupRgtY9eFZqzh2DB/8G27yEs0hqO2eTaDMMxwZr7fsqK/fzwx2GOnUukVVR5Xr/hcupWLuOlwJXKm1/MdaMteuUx+5fD3Mfh+Fao1cW6u1W1plmKpTkM326I4Znvt5Kc5qBbg8o81Ks+raIqeCFopXLnF4k+nbbolUekpcKGKfZwzDhrorQeT2cZjglw4nwi01cf4rOV+zkTn8w1LWow5sr61KuiLXzlOzTRK5WTi2dg0X9h3SQIDftnOGZA1j75uIspvPvbbj5fdYDUNAdv3diCa1vqFbbKN/hFoteuG+VVx7Za3TkHlls3Oen/OkR3zLZoTOxF7v/yD7YejuO6ljV5vF8jKpct4eGAlcrILxJ9Om3RK68xxjpR+8szcO4wNLvRull5uRpZip5LTOGtX3bx5eqDGAyP92vEiM61kUwndpXyFE30SuVHcjwse8u6WXlAMHQbCx3ug6Csrfadx87z3zk7WLLrJDe0juCBK+sRXbG0F4JWxZ1fJHrtulE+58xemPcU7JoL4XWtG6DU752lmMNheOjrjfy8+QgAgy6vwVs3tiAwQFv3ynP8ItGn0xa98jm7F1j992f+ggb9oN/LEF4nS7HDZxN44tvNLNt9imrlQhneMZqhbSOpWEb771Xh00SvVEGlJsGqD2DJ6+BIgSvGQJdHICRjN02awzBny1G+WnuQ3/ecpmRwIHd0qsVDPesTGqxX16rCo4leKXc5d9S62cmWGVCupn2zk8FZrq4F+PPYOd5esJt5244RVjKYV4c0o1/T6l4IWhUHfnFzcJ29UvmFctVhyMdw5zzr4qqZd8Fng+B41mmdGlUrx4ThrXlnaAsqlArmwekb+XZ9jBeCVuof2qJXKj8cabD+U/vq2nPQ9m7o8aR1t6tMDp25yL1frGfr4XO0qxXO4/0b0jo661W4Sl0qv2jRK+V3AgKt5D5mA7S+A9Z+DO+1tu505UjLUDQyvBTf39eJJ/s3Ysexc9w+eS3Ldp/0StiqeNMWvVIFcXSTNTrn4Eqo3gIGvAGRbbMUO3I2gaETV3HwzEUujwijz2XVGNG5tp6sVQWiLXqlPKH65XDnXBj8MZw/BpN6wff3wvnjGYrVKF+Snx/szNi+DUGE1+fvZOB7y1m19zS+0NhSRZteMKWUuySdh6Wvw8oPICgUuj8B7e+BwOAsRedtPcrYbzZzPimVWhVLMbhVBMM7RFOhdIgXAlf+SodXKuUtp3bDvCdgz0Ko1NC6urZujyzFYuOTmb/tGDPWHWLDwbOUDQ3ioZ71uatTbQL0ClvlAk30SnmTMbBrnpXwY/dD40HQ5yWoEJ1t8Y2HzvLWgl0s3XWSFpHlmXBra6qFhXo2ZuV3NNEr5QtSEmHle9aEacZh3bu288MQXDJLUWMMM9YdYtziBtaBAAAWoUlEQVSs7YQGB/D20JZ0a1DZC0Erf6GJXilfEhdjTYW87XsIi4K+L1mt/Gyurt19/Dz3fL6evafi6dGwMrddUYuu9SvrhGkqC030Svmifctg7mNwYjvU6Q79X4PKDbMUS0hOY8KSv5i6cj+xF1MoXyqY4R2iubd7XUqFBHk6auWjNNEr5avSUq3bGC56yZoHv/1o6PY4hJbLUjQhOY25W48yZ8sxFu44Ts3yJRl/c0ta6s3KFX6S6HV4pSrW4k/Br8/Bhs+hdGXoNQ4uHwYB2V/esnz3KR766g/OXExmRKfajOlZn7CSWYduquLDLxJ9Om3Rq2Lt8AaYMxYOr4OItlZ3Ts1W2RY9fi6Rl37ewaxNRygVEsitHaK5q1NtHaFTTGmiV8qfOBywaTosfNZq6bcaDj2fhdKVsi2+OeYs7/66h4U7jhMSGMD1bawLrhpXz9r9o4ouTfRK+aPEOFjyGqyeAMGlocdT1iRqgdmfgN11/DwfLdnL93/E4DAwrF0kz1/TlOBAnd2kONBEr5Q/O7nTGp2zdzFUaWJ159TukmPx0xeSeH3+Tr5ae4i6lUvz/DVN6VQv+18DqujQRK+UvzMGdsyG+U9D3EG47Dro8yKEReS4yYy1h3h21jYSUtJoVK0sT1/VmC719aKrokoTvVJFRUoC/P4OLP8fSAB0+Td0HAPB2Z+AjbuYwgdL9jB5+T5S0gzdG1Zm3KDLqFWpdLbllf/SRK9UURN7AH552mrlV6gF/V6BBv2yvboWIC4hhdfn/8m0VQcRgZFd6vB4v0Z6hW0RooleqaLqr0XWzU5O7YR6vaDfq1CpXs7FT17gsZmbWX8glibVy/HR8NZEhpfyYMCqsHg10YvItcBVQDlgkjHml9zKa6JXKp/SUmD1R7D4FUhNhI73QdexUKJsjpt8smwvL/68A4A+Tary9FWNia6o3Tn+zO13mBKRySJyQkS2ZlreT0R2isgeEXkCwBjzgzFmJDAauCm/wSul8hAYDFc8AGPWQ/MbrT788W1h8wzrJG427u5Sh7kPdaFPk6r8sv043V5fzLRVBzwcuPIGl1v0ItIVuABMNcY0tZcFAruA3kAMsBYYZozZbq9/E/jCGLMht31ri16pAjq0FuaOhSN/QFRHazhm9eY5Fl+x5xSPfLOJo3GJdGtQmXeGtqB8Kb27lb9xe4veGLMUOJNpcTtgjzFmrzEmGfgKuEYsrwJzc0ryIjJKRNaJyLqTJ0+6GoZSKjuRbeHu32DQu3BqF0zsBj/9Gy5m/pO1XFGvEose7U7HOhVZsuskrV5YwMtzdpCQnObhwJUnFPTyuZrAIafnMfayMUAv4HoRGZ3dhsaYicaYNsaYNpUr6zhfpQosIABa325157QdCes/hfdaw7rJ4MiawEODA5k+qgOf3dWO6mEl+WjpXhr/Zx5vzN+pNywvYgrlOmljzLvGmNbGmNHGmAk5lRORQSIyMS4urjDCUKp4KlkBBrwG9yyDKo3hp3/Bxz3g4Opsi3drUJnfn7iS94a1pERQAOMX7aHr64tYfyDWw4GrwlLQRH8YiHR6HmEvc4kxZrYxZlRYWFgBw1BKZVGtKdzxMwyZBBdOwuQ+8P1oOH882+KDLq/Bhv/rzaiudTh0JoEhH67g319vJD4p1cOBK3fL1/BKEakF/OR0MjYI62RsT6wEvxa42RizzcX96Xz0SnlC0gVY9gasGA9BodD9CWh/jzV6JxsxsRe574sNbI6xfm33vawq7w5rSYmgQE9GrfLg9nH0IjId6A5UAo4DzxpjJonIAOBtIBCYbIx5Kb/B6qgbpTzk9F/WxVZ7FkClhtD/VajbI8fiX689yJPfbcFhp4kRnWvzWL+GmvB9hF4Zq5TKnjGwax7MewJi90Pjq62blZePyqG4YdqqA/zfj//8UJ96Vzu6NtBBFN7mF4leu26U8qKURFj5Hix903re+V/Q6UEILplt8dQ0B09+t4Vv1scA1kncibe11ta9F/lFok+nLXqlvOjsIfjlGdj+A5SPhn4vQ8MBOU6Wtv5ALEM+XPH388/uakc3bd17hdsvmCoMOrxSKR9QPhJu/AxumwXBpeCrm2HaEDiV/a/s1tEV2PfyAIa0subFv33yGh6ZsQmHw/uNRpU9bdErpf6RlgJrPobFL1vz4OcxWdqqvacZOnHV388/vbMtPRpW8VS0xZ523SilLt2FE7DwOdg4DcpWh94vQLPrs+3OSUxJY/S09SzeaU1l0qxmGBNva031sOz7+pX7aKJXShVchsnSrrCuuK3WLNuiW2LiuG3yamIvpgAwtm9D7u+R8zz5quD8ItHrqBul/IDDAX98Dr8+Bwmx0GYE9HgKSoVnW/yzFft5dpY1FLNm+ZLMeqATFcuU8GTExYZfJPp02qJXyg8kxMKi/8LaTyC0PPR6FloOh4CswytPXUii7/+Wcjo+GdCROYXFL0bdKKX8SMkKMOB1uGcpVG4Esx+Cj6+0uncyqVSmBOue6cXwDtGANTLn6vHLOZeY4umoFZrolVL5Va0Z3DnHniztOEzqBT/cZ53AdSIivHBtU+Y93AWAzTFxNB/3C2/9spPUNIc3Ii+2tI9eKXXpki7A0tdh5fvWFbXdn4B2o7JMlpaa5uD1+Tv5aOleAMqWCOLXR7tRpWyoN6IuMrSPXinlOaf2wLzHYc9Cq1un/2tQp1uWYqcvJNH/nWWcOJ8EwIe3tKJ/s+qejrbI0D56pZTnVKoHt8yEodOtC62mXg0zbrOmV3BSsUwJVj/Vk3u61QHg3i828O8ZG0nRrpxCpS16pZR7pSTCindh2VvW8y6PwBVjIDhjN43znDlhJYNZ/Gh3KpTWG5Tnh7bolVLeERwK3R6DB9ZAgz6w6EX4oD38OceaItnWOroCm8f1oVKZEOISUmj5wgIW7zyRy47VpdJJzZRShaN8FNw4FW77EQJLwFfD4IsbrP58W7nQYFY92ZMb21gTpN3x6VrGfqMTpLmbdt0opQpfWgqsmQiLXobUROh4vz1ZWpm/i2yJiWPQ+OUAtKsdzlcjOxAQkP1UycqiXTdKKd8RGGwl9zHrodkN8PvbML4tbJn5d3dOs4gwVj/VE4A1+85Q56k5rNt/xptRFxma6JVSnlO2Klz3IYxYAGUqw7cjYMpVcGwrAFXLhbJlXB861LHm0bl+wkreWajX2BSUJnqllOdFtoORi2Dg23BiB3zUBeaMhYRYyoYG8+XdHfjgllYA/G/hLsbN2pbHDlVuNNErpbwjIBDa3Gl157S5y5os7b3WsP4zAjAMaFaduQ9Z0ydMWbGfNi8u4PDZBC8H7Z800SulvKtUOFz1JoxaApUawOwH4ZMrIWYdjauXY/njPWgTXYFTF5Lp9Mpv/HEwFl8YROJPdHilUso3VG8Od86FwR/DuaPwSU/44X4igi8w456OXN/aGoJ53Qcr+HDJXySmpHk5YP+hwyuVUr4n6TwseQ1WfWhNltbjKRytR/Dj1hP86+tNANzcPoqxfRoW66tpdXilUsp/lSgLfV6A+1ZCRBuY9wQBE7tyXfm9zHu4C9EVS/Hl6oPc8/l6jsZpv31eNNErpXxXpfpw63dw0xeQEg+fDaLRsgf5emgkVzaqwpr9Z+j86iJiYi96O1KfpoleKeXbRKDxQLh/DXR/CnbOpdrULrwX8SuPXhlNmsPQ+dVFzN1y1NuR+ixN9Eop/xBcEro/biX8ej0pvfxl7t9xK59ecZrAAOHpH7byyIxN3o7SJ2miV0r5lwrRcNM0GP49EhhMjw1j+LXaeJqXPMUPGw/zxvydnLFvSq4smuiVUv6p7pUw+nfo8yK1Lmxm8sUxPBH8NZMWbWX6moN6I3Inbk/0IlJHRCaJyEx371sppTIICrFuajJmHQFNhzBSfuC3Eo+yY8EUer6xWC+ssrmU6EVksoicEJGtmZb3E5GdIrJHRJ4AMMbsNcaMKIxglVIqW2WrweCP4K75lKtYjfEh7/Fu0n944/PvWbNPZ8B0tUU/BejnvEBEAoH3gf5AE2CYiDRxa3RKKZUfUR0o/cByDnR4kSaBh/jXXyOI//ERSDjr7ci8yqVEb4xZCmT+WmwH7LFb8MnAV8A1bo5PKaXyJyCQ6H5jCHtsMwtL9aNb7PecfrUZW38aD47ieRPygvTR1wScb/EeA9QUkYoiMgFoKSJP5rSxiIwSkXUisu7kyZMFCEMppbJRKpyyQ95j0mWfss9RjabrniZl4pWYmOI33UqQu3dojDkNjHah3ERgIlhz3bg7DqWU6lSvEp3qXUebnaXpnLCIp45+SZVPekLL4dDzWevmJ8VAQVr0h4FIp+cR9jKX6eyVSilP+Oi2NrS/9l76O95ieZVhmE3TMe+1glUTIC3V2+EVuoIk+rVAfRGpLSIhwFBgVn52YIyZbYwZFRYWVoAwlFIqd62jwxnWLorQ0hW49eAgeiW8zF8hDWHe4/BRV9i/3NshFipXh1dOB1YCDUUkRkRGGGNSgQeA+cAOYIYxJl/3+9IWvVLKk16/vjlP9m9ESoX6PFPmeesK26Tz1n1rv7kT4vLVKeE3dD56pVSxc8ena1h/IJa2tcJpXCmIsaXnwe9vgwRA10eh4wMQVMLbYebJL+aj1xa9UsobBjStTq2Kpdl2JI73lx8htas9WVrdK+HX5+GDDrDrF2+H6TZeTfTaR6+U8oYb20Yye0xn7u5cB4CLKWmY8lGYm6ZZ899LIHx5A3x5E5zZ6+VoC04nNVNKFVulS1gjzJuP+4XaT87h3mkboF5PuHcF9H7BOkn7fnv49QVIjvdytJfO7ePo80NEBgGD6tWr580wlFLFVP+m1Yi9mExKmoN5W4+x49g5a0VQCHR6EJrdAAufhWVvwKavoO+L0ORa62YofkS7bpRSxVaF0iHc36MeD/dqQPOIMBJT0jIWKFcdBk+EO+dBqQrwzR3w2SA4scMr8V4q7bpRSikgNDiQUxeS6f/OMvq/s4xbPln1T+KP7gijlsBVb8KxLfBhJ5j3JCT6x0ASHXWjlFLA1ZfXoGejKkRUKElwoPD7ntPExCb8UyAgENreDWM2QKvbYNWH8F5r+GOaz0+WpuPolVIqk3lbjzF62nrmPNiFJjXKZV/oyEaYMxZi1kDNNjDgdajZyqNx+sU4eqWU8kUlgqzUmJSalnOhGi3grvlw7QQ4exA+vhJmjYH4Ux6K0nVeHXWjlFK+KD3R3/HpWoIDrcf9m1bjhWubZiwYEAAthkGjq2DJq7B6Amz/EXo8A23ugkDfSLHaR6+UUpm0iCrPqK51uKp5dfpcVpVSIYGs2ns65w1Cy0Hfl6zx9zVawtyxMLEb7P/dc0HnQvvolVIqDw999QebDp1l8dgeeRc2BnbMgvlPQ9whaHo99HkBytVwe1zaR6+UUm4SFBBASpqLjWIRaHKNNXdOt8dhx2x4rw0s/x+kJhVuoDnQRK+UUnkICRJS0vI5hDKkFPR4Cu5fDXW6w8Jx8EFH2L2wECLMnW+cKVBKKR8WHBjAyQtJdHntt7+XCcLDveozuFVE7huH14ZhX1oJft7j8MUQaDgA+v7XWucBOteNUkrlYUirCC4kpYJT782crUdZs+9M3ok+Xf1eUHslrPoAlrxmTZbW6SHo/C+r9V+IvJrojTGzgdlt2rQZ6c04lFIqN5dHluetyBYZlq3ed4ZURz4HswSFQOeHofmNsOA/sPQ1KFUROox2Y7TZvGyh7l0ppYqowAAhLb+JPl25GjDkE2g70hqOWcg00Sul1CUICpD8t+gzi2rvnmDyoKNulFLqElgtet+ezCydJnqllLoEgQHi+th6L9OuG6WUugRBgcLqvacZ8uGKLOtaRpbnmYFNvBBV9nSuG6WUugQ3tomkeUR5SgYHZvh35GwCM9Yd8nZ4GehcN0op5UbPzd7GzHUxbHmub6G/ls51o5RSXhAoQpoPNKCdaaJXSik3CggQHJrolVKq6AoQoaDD691NE71SSrlRgIDDxzK9JnqllHIjq0WviV4ppYosq48efGFEYzpN9Eop5UYBYv3vQ3ne/VfGikhp4AMgGVhsjPnC3a+hlFK+KlCsTO8whgDEy9FYXGrRi8hkETkhIlszLe8nIjtFZI+IPGEvHgzMNMaMBK52c7xKKeXTAuwmvS+NpXe1RT8FGA9MTV8gIoHA+0BvIAZYKyKzgAhgi10szW2RKqWUHwiwW/Rjv9lMUEDeLfqb2kbSvk7FQo3JpURvjFkqIrUyLW4H7DHG7AUQka+Aa7CSfgSwkVx+MYjIKGAUQFRUVH7jVkopn3R5ZBi1K5Xmj0OxLpXv2bhqIUdUsD76moDzzD0xQHvgXWC8iFwFzM5pY2PMRGAiWHPdFCAOpZTyGVfUrcSiR7t7O4wM3H4y1hgTD9zpSlm9ObhSShW+ggyvPAxEOj2PsJe5zBgz2xgzKiwsrABhKKWUyk1BEv1aoL6I1BaREGAoMCs/O9D56JVSqvC5OrxyOrASaCgiMSIywhiTCjwAzAd2ADOMMdvy8+LaoldKqcLn6qibYTksnwPMudQX1z56pZQqfF6dAkFb9EopVfh0rhullCri9ObgSilVxPnEzcFFJA7YnWlxGBCXw3Pnx5WAU24OKfNrF7R8buuzW+fKMq0P1+oD3F8n7q6P3Mq4ujw/z7U+ik59RBtjKuf5CsYYr/8DJua1zPl5psfrPBFPQcrntt6V9671cen1URh14u76yK2Mq8vz81zro+jUh6v/fKWPPrupEjIvm53LOnfL7/7zKp/belfee3bLtD5yfu5v9ZFbGVeX5/e5O2l9FGzfBakPl/hE101BiMg6Y0wbb8fhK7Q+stI6yUjrI6PiUB++0qIviIneDsDHaH1kpXWSkdZHRkW+Pvy+Ra+UUip3RaFFr5RSKhea6JVSqojTRK+UUkVckUv0IlJaRD4TkY9F5BZvx+NtIlJHRCaJyExvx+ILRORa+9j4WkT6eDsebxORxiIyQURmisi93o7HV9h5ZJ2IDPR2LO7gF4leRCaLyAkR2ZppeT8R2Skie0TkCXvxYGCmMWYkcLXHg/WA/NSHMWavMWaEdyL1jHzWxw/2sTEauMkb8Ra2fNbHDmPMaOBGoJM34vWEfOYQgMeBGZ6NsvD4RaIHpgD9nBeISCDwPtAfaAIME5EmWHe6Sr+XbZoHY/SkKbheH8XBFPJfH8/Y64uiKeSjPkTkauBnCjDluB+Ygot1IiK9ge3ACU8HWVj8ItEbY5YCZzItbgfssVusycBXwDVYNymPsMv4xfvLr3zWR5GXn/oQy6vAXGPMBk/H6gn5PT6MMbOMMf2BItvVmc866Q50AG4GRoqI3+cRt98c3INq8k/LHawE3x54FxgvIldR+JfC+5Js60NEKgIvAS1F5EljzMteic7zcjo+xgC9gDARqWeMmeCN4Lwgp+OjO1Z3ZwmKdos+O9nWiTHmAQARuQM4ZYxxeCE2t/LnRJ8tY0w8cKe34/AVxpjTWP3RCjDGvIvVGFCAMWYxsNjLYfgkY8wUb8fgLv78k+QwEOn0PMJeVlxpfWSk9ZGR1kdWxaZO/DnRrwXqi0htEQkBhgKzvByTN2l9ZKT1kZHWR1bFpk78ItGLyHRgJdBQRGJEZIQxJhV4AJgP7ABmGGO2eTNOT9H6yEjrIyOtj6yKe53opGZKKVXE+UWLXiml1KXTRK+UUkWcJnqllCriNNErpVQRp4leKaWKOE30SilVxGmiV0qpIk4TvVJKFXGa6JVSqoj7f0T90kKCB+39AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "M = COUNTS['the']\n", "yscale('log'); xscale('log'); title('Frequency of n-th most frequent word and 1/n line.')\n", "plot([c for (w, c) in COUNTS.most_common()])\n", "plot([M/i for i in range(1, len(COUNTS)+1)]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(3) Task: Spelling Correction\n", "========\n", "\n", "Given a word *w*, find the most likely correction *c* = `correct(`*w*`)`.\n", "\n", "**Approach:** Try all candidate words *c* that are known words that are near *w*. Choose the most likely one.\n", "\n", "How to balance *near* and *likely*?\n", "\n", "For now, in a trivial way: always prefer nearer, but when there is a tie on nearness, use the word with the highest `WORDS` count. Measure nearness by *edit distance*: the minimum number of deletions, transpositions, insertions, or replacements of characters. By trial and error, we determine that going out to edit distance 2 will give us reasonable results. Then we can define `correct(`*w*`)`:\n", " \n", " \n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def correct(word):\n", " \"Find the best spelling correction for this word.\"\n", " # Prefer edit distance 0, then 1, then 2; otherwise default to word itself.\n", " candidates = (known(edits0(word)) or \n", " known(edits1(word)) or \n", " known(edits2(word)) or \n", " [word])\n", " return max(candidates, key=COUNTS.get)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The functions `known` and `edits0` are easy; and `edits2` is easy if we assume we have `edits1`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def known(words):\n", " \"Return the subset of words that are actually in the dictionary.\"\n", " return {w for w in words if w in COUNTS}\n", "\n", "def edits0(word): \n", " \"Return all strings that are zero edits away from word (i.e., just word itself).\"\n", " return {word}\n", "\n", "def edits2(word):\n", " \"Return all strings that are two edits away from this word.\"\n", " return {e2 for e1 in edits1(word) for e2 in edits1(e1)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for `edits1(word)`: the set of candidate words that are one edit away. For example, given `\"wird\"`, this would include `\"weird\"` (inserting an `e`) and `\"word\"` (replacing a `i` with a `o`), and also `\"iwrd\"` (transposing `w` and `i`; then `known` can be used to filter this out of the set of final candidates). How could we get them? One way is to *split* the original word in all possible places, each split forming a *pair* of words, `(a, b)`, before and after the place, and at each place, either delete, transpose, replace, or insert a letter:\n", "\n", "\n", "
pairs: Ø+wird w+ird wi+rd wir+dwird+ØNotes: (a, b) pair\n", "
deletions: Ø+ird w+rd wi+d wir+ØDelete first char of b\n", "
transpositions: Ø+iwrd w+rid wi+drSwap first two chars of b\n", "
replacements: Ø+?ird w+?rd wi+?d wir+?Replace char at start of b\n", "
insertions: Ø+?+wird w+?+ird wi+?+rd wir+?+d wird+?+ØInsert char between a and b\n", "
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def edits1(word):\n", " \"Return all strings that are one edit away from this word.\"\n", " pairs = splits(word)\n", " deletes = [a+b[1:] for (a, b) in pairs if b]\n", " transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1]\n", " replaces = [a+c+b[1:] for (a, b) in pairs for c in alphabet if b]\n", " inserts = [a+c+b for (a, b) in pairs for c in alphabet]\n", " return set(deletes + transposes + replaces + inserts)\n", "\n", "def splits(word):\n", " \"Return a list of all possible (first, rest) pairs that comprise word.\"\n", " return [(word[:i], word[i:]) \n", " for i in range(len(word)+1)]\n", "\n", "alphabet = 'abcdefghijklmnopqrstuvwxyz'" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('', 'wird'), ('w', 'ird'), ('wi', 'rd'), ('wir', 'd'), ('wird', '')]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits('wird')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'wird'}\n" ] } ], "source": [ "print(edits0('wird'))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'wir', 'widr', 'waird', 'yird', 'twird', 'wgird', 'wiyrd', 'ird', 'wierd', 'wirf', 'vird', 'word', 'wiro', 'wfird', 'widrd', 'xwird', 'wire', 'dird', 'wirda', 'wirdq', 'hird', 'wprd', 'bird', 'wijd', 'wirdm', 'wizd', 'nwird', 'wirl', 'wigd', 'wirdr', 'wiri', 'wirdx', 'wiqd', 'wxird', 'wlird', 'ewird', 'wirid', 'wrird', 'iwird', 'wiwrd', 'nird', 'wirld', 'wrd', 'wirzd', 'gird', 'wrrd', 'wqrd', 'wirqd', 'wicd', 'wbird', 'wicrd', 'wircd', 'wijrd', 'wikrd', 'wixd', 'cwird', 'wifrd', 'wirdu', 'pwird', 'wied', 'wsird', 'wiid', 'sird', 'owird', 'whird', 'wizrd', 'oird', 'wiprd', 'wirdd', 'iird', 'wirdh', 'wiurd', 'wirwd', 'wpird', 'uird', 'wirh', 'wtird', 'wimd', 'fird', 'wind', 'wiud', 'wirtd', 'wirm', 'eird', 'gwird', 'wirdg', 'zwird', 'wiod', 'ward', 'wirq', 'wirj', 'wirhd', 'jird', 'wirrd', 'wirpd', 'wirn', 'wiru', 'wiird', 'wirs', 'wzrd', 'mird', 'wirg', 'wirv', 'wirp', 'wlrd', 'wdrd', 'wuird', 'wirdf', 'wirad', 'wdird', 'wkrd', 'wigrd', 'kwird', 'wikd', 'wvrd', 'wiard', 'wiord', 'wixrd', 'wirdj', 'wirw', 'witrd', 'wirdt', 'cird', 'wfrd', 'whrd', 'wzird', 'wiry', 'wqird', 'wirz', 'dwird', 'wirb', 'wvird', 'wirdc', 'xird', 'wirk', 'wyird', 'bwird', 'qwird', 'wirod', 'wirdv', 'wirgd', 'wtrd', 'wirdk', 'rird', 'wurd', 'uwird', 'wirmd', 'wisrd', 'wirfd', 'werd', 'wirdb', 'hwird', 'wmird', 'wcrd', 'wira', 'wiqrd', 'wirsd', 'wirde', 'wirds', 'fwird', 'wnird', 'wwird', 'wrid', 'wmrd', 'wiyd', 'wbrd', 'wirr', 'rwird', 'wwrd', 'qird', 'wsrd', 'wisd', 'wifd', 'wirt', 'wirvd', 'zird', 'wirc', 'winrd', 'wgrd', 'wilrd', 'weird', 'wjird', 'wirdw', 'iwrd', 'tird', 'wirdn', 'awird', 'wirdo', 'woird', 'wirdi', 'wirdy', 'wirkd', 'wihd', 'wivd', 'wcird', 'wnrd', 'mwird', 'wihrd', 'wivrd', 'vwird', 'wirud', 'wipd', 'wirxd', 'wirnd', 'wiad', 'witd', 'wibd', 'wirjd', 'wirbd', 'wjrd', 'wirdl', 'wkird', 'widd', 'wid', 'kird', 'wird', 'wibrd', 'wirx', 'pird', 'wimrd', 'wired', 'wiwd', 'wxrd', 'wirdp', 'wiryd', 'jwird', 'swird', 'lwird', 'ywird', 'lird', 'aird', 'wild', 'wyrd', 'wirdz'}\n" ] } ], "source": [ "print(edits1('wird'))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "24254\n" ] } ], "source": [ "print(len(edits2('wird')))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map(correct, tokens('Speling errurs in somethink. Whutever; unusuel misteakes everyware?'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we make the output prettier than that?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def correct_text(text):\n", " \"Correct all the words within a text, returning the corrected text.\"\n", " return re.sub('[a-zA-Z]+', correct_match, text)\n", "\n", "def correct_match(match):\n", " \"Spell-correct word in match, and preserve proper upper/lower/title case.\"\n", " word = match.group()\n", " return case_of(word)(correct(word.lower()))\n", "\n", "def case_of(text):\n", " \"Return the case-function appropriate for text: upper, lower, title, or just str.\"\n", " return (str.upper if text.isupper() else\n", " str.lower if text.islower() else\n", " str.title if text.istitle() else\n", " str)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map(case_of, ['UPPER', 'lower', 'Title', 'CamelCase'])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Spelling Errors IN something. Whatever; unusual mistakes?'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "correct_text('Speling Errurs IN somethink. Whutever; unusuel misteakes?')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Audience says: tumbler ...'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "correct_text('Audiance sayzs: tumblr ...')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far so good. You can probably think of a dozen ways to make this better. Here's one: in the text \"three, too, one, blastoff!\" we might want to correct \"too\" with \"two\", even though \"too\" is in the dictionary. We can do better if we look at a *sequence* of words, not just an individual word one at a time. But how can we choose the best corrections of a sequence? The ad-hoc approach worked pretty well for single words, but now we could use some real theory ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(4) Models: Word and Sequence Probabilities\n", "===\n", "\n", "If we have a bag of words, what's the probability of picking a particular word out of the bag? We'll denote that probability as $P(w)$. To create the function `P` that computes this probability, we define a function, `pdist`, that takes as input a `Counter` (that is, a bag of words) and returns a function that acts as a probability distribution over all possible words. In a probability distribution the probability of each word is between 0 and 1, and the sum of the probabilities is 1." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def pdist(counter):\n", " \"Make a probability distribution, given evidence from a Counter.\"\n", " N = sum(counter.values())\n", " return lambda x: counter[x]/N\n", "\n", "Pword = pdist(COUNTS)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.07240666434449033 the\n", "0.008842968103249388 is\n", "0.0008215075749693518 most\n", "0.0002596615352601365 common\n", "0.0002696137195383996 word\n", "0.019949605757790978 in\n", "0.00019090098933759167 english\n" ] } ], "source": [ "# Print probabilities of some words\n", "for w in tokens('\"The\" is most common word in English'):\n", " print(Pword(w), w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, what is the probability of a *sequence* of words? Use the definition of a joint probability:\n", "\n", "$P(w_1 \\ldots w_n) = P(w_1) \\times P(w_2 \\mid w_1) \\times P(w_3 \\mid w_1 w_2) \\ldots \\times \\ldots P(w_n \\mid w_1 \\ldots w_{n-1})$\n", "\n", "In the bag of words model, each word is drawn from the bag *independently* of the others. So $P(w_2 \\mid w_1) = P(w_2)$, and we have:\n", " \n", "$P(w_1 \\ldots w_n) = P(w_1) \\times P(w_2) \\times P(w_3) \\ldots \\times \\ldots P(w_n)$\n", "\n", " \n", " \n", "\n", "Now clearly this model is wrong; the probability of a sequence depends on the order of the words. But, as the statistician George Box said, *All models are wrong, but some are useful.* The bag of words model, wrong as it is, has many useful applications.\n", " \n", "How can we compute $P(w_1 \\ldots w_n)$? We'll use a different function name, `Pwords`, rather than `P`, and we compute the product of the individual probabilities:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def Pwords(words):\n", " \"Probability of words, assuming each word is independent of others.\"\n", " return product(Pword(w) for w in words)\n", "\n", "def product(nums):\n", " \"Multiply the numbers together. (Like `sum`, but with multiplication.)\"\n", " result = 1\n", " for x in nums:\n", " result *= x\n", " return result" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.983396332800731e-11 this is a test\n", "8.637472023018802e-16 this is a unusual test\n", "0.0 this is a neverbeforeseen test\n" ] } ], "source": [ "tests = ['this is a test', \n", " 'this is a unusual test',\n", " 'this is a neverbeforeseen test']\n", "\n", "for test in tests:\n", " print(Pwords(tokens(test)), test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes—it seems wrong to give a probability of 0 to the last one; it should just be very small. We'll come back to that later. The other probabilities seem reasonable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(5) Task: Word Segmentation\n", "====\n", "\n", "**Task**: *given a sequence of characters with no spaces separating words, recover the sequence of words.*\n", " \n", "\n", "Why? Languages with no word delimiters: [不带空格的词](http://translate.google.com/#auto/en/%E4%B8%8D%E5%B8%A6%E7%A9%BA%E6%A0%BC%E7%9A%84%E8%AF%8D)\n", "\n", "In English, sub-genres with no word delimiters ([spelling errors](https://www.google.com/search?q=wordstogether), [URLs](http://speedofart.com)).\n", "\n", "**Approach 1:** Enumerate all candidate segementations and choose the one with highest Pwords\n", "\n", "Problem: how many segmentations are there for an *n*-character text?\n", "\n", "**Approach 2:** Make one segmentation, into a first word and remaining characters. If we assume words are independent \n", "then we can maximize the probability of the first word adjoined to the best segmentation of the remaining characters.\n", " \n", " assert segment('choosespain') == ['choose', 'spain']\n", "\n", " segment('choosespain') ==\n", " max(Pwords(['c'] + segment('hoosespain')),\n", " Pwords(['ch'] + segment('oosespain')),\n", " Pwords(['cho'] + segment('osespain')),\n", " Pwords(['choo'] + segment('sespain')),\n", " ...\n", " Pwords(['choosespain'] + segment('')))\n", " \n", " \n", " \n", "To make this somewhat efficient, we need to avoid re-computing the segmentations of the remaining characters. This can be done explicitly by *dynamic programming* or implicitly with *memoization*. Also, we shouldn't consider all possible lengths for the first word; we can impose a maximum length. What should it be? A little more than the longest word seen so far." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def memo(f):\n", " \"Memoize function f, whose args must all be hashable.\"\n", " cache = {}\n", " def fmemo(*args):\n", " if args not in cache:\n", " cache[args] = f(*args)\n", " return cache[args]\n", " fmemo.cache = cache\n", " return fmemo" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "18" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max(len(w) for w in COUNTS)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def splits(text, start=0, L=20):\n", " \"Return a list of all (first, rest) pairs; start <= len(first) <= L.\"\n", " return [(text[:i], text[i:]) \n", " for i in range(start, min(len(text), L)+1)]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('', 'word'), ('w', 'ord'), ('wo', 'rd'), ('wor', 'd'), ('word', '')]\n", "[('r', 'eallylongtext'), ('re', 'allylongtext'), ('rea', 'llylongtext'), ('real', 'lylongtext')]\n" ] } ], "source": [ "print(splits('word'))\n", "print(splits('reallylongtext', 1, 4))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "@memo\n", "def segment(text):\n", " \"Return a list of words that is the most probable segmentation of text.\"\n", " if not text: \n", " return []\n", " else:\n", " candidates = ([first] + segment(rest) \n", " for (first, rest) in splits(text, 1))\n", " return max(candidates, key=Pwords)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['choose', 'spain']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('choosespain')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['speed', 'of', 'art']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('speedofart')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "decl = ('wheninthecourseofhumaneventsitbecomesnecessaryforonepeople' +\n", " 'todissolvethepoliticalbandswhichhaveconnectedthemwithanother' +\n", " 'andtoassumeamongthepowersoftheearththeseparateandequalstation' +\n", " 'towhichthelawsofnatureandofnaturesgodentitlethem')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "when in the course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth the separate and equal station to which the laws of nature and of natures god entitle them\n" ] } ], "source": [ "print(' '.join(segment(decl)))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.6043381425711275e-141" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Pwords(segment(decl))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.2991253445993077e-281" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Pwords(segment(decl+decl))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Pwords(segment(decl+decl+decl))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a problem. We'll come back to it later." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['small', 'and', 'insignificant']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('smallandinsignificant')" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['large', 'and', 'insignificant']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment('largeandinsignificant')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.111418791681202e-10\n", "1.0662753919897733e-11\n" ] } ], "source": [ "print(Pwords(['large', 'and', 'insignificant']))\n", "print(Pwords(['large', 'and', 'in', 'significant']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Summary:\n", " \n", "- Looks pretty good!\n", "- The bag-of-words assumption is a limitation.\n", "- Recomputing Pwords on each recursive call is somewhat inefficient.\n", "- Numeric underflow for texts longer than 100 or so words; we'll need to use logarithms, or other tricks.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# (6) Data: Mo' Data, Mo' Better" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's move up from millions to *billions and billions* of words. Once we have that amount of data, we can start to look at two word sequences, without them being too sparse. I happen to have data files available in the format of `\"word \\t count\"`, and bigram data in the form of `\"word1 word2 \\t count\"`. Let's arrange to read them in:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "def load_counts(filename, sep='\\t'):\n", " \"\"\"Return a Counter initialized from key-value pairs, \n", " one on each line of filename.\"\"\"\n", " C = Counter()\n", " for line in open(filename):\n", " key, count = line.split(sep)\n", " C[key] = int(count)\n", " return C" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "COUNTS1 = load_counts('../data/text/count_1w.txt')\n", "COUNTS2 = load_counts('../data/text/count_2w.txt')\n", "\n", "P1w = pdist(COUNTS1)\n", "P2w = pdist(COUNTS2)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "333333 588.124220187\n", "286358 225.955251755\n" ] } ], "source": [ "print(len(COUNTS1), sum(COUNTS1.values())/1e9)\n", "print(len(COUNTS2), sum(COUNTS2.values())/1e9)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('of the', 2766332391),\n", " ('in the', 1628795324),\n", " ('to the', 1139248999),\n", " ('on the', 800328815),\n", " ('for the', 692874802),\n", " ('and the', 629726893),\n", " ('to be', 505148997),\n", " ('is a', 476718990),\n", " ('with the', 461331348),\n", " ('from the', 428303219),\n", " ('by the', 417106045),\n", " ('at the', 416201497),\n", " ('of a', 387060526),\n", " ('in a', 364730082),\n", " ('will be', 356175009),\n", " ('that the', 333393891),\n", " ('do not', 326267941),\n", " ('is the', 306482559),\n", " ('to a', 279146624),\n", " ('is not', 276753375),\n", " ('for a', 274112498),\n", " ('with a', 271525283),\n", " ('as a', 270401798),\n", " (' and', 261891475),\n", " ('of this', 258707741),\n", " (' the', 258483382),\n", " ('it is', 245002494),\n", " ('can be', 230215143),\n", " ('If you', 210252670),\n", " ('has been', 196769958)]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "COUNTS2.most_common(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(7) Theory and Practice: Segmentation With Bigram Data\n", "===\n", "\n", "A less-wrong approximation:\n", " \n", "$P(w_1 \\ldots w_n) = P(w_1 \\mid start) \\times P(w_2 \\mid w_1) \\times P(w_3 \\mid w_2) \\ldots \\times \\ldots P(w_n \\mid w_{n-1})$\n", "\n", "This is called the *bigram* model, and is equivalent to taking a text, cutting it up into slips of paper with two words on them, and having multiple bags, and putting each slip into a bag labeled with the first word on the slip. Then, to generate language, we choose the first word from the original single bag of words, and chose all subsequent words from the bag with the label of the previously-chosen word. To determine the probability of a word sequence, we multiply together the conditional probabilities of each word given the previous word. We'll do this with a function, `cPword` for \"conditional probability of a word.\"\n", "\n", "$P(w_n \\mid w_{n-1}) = P(w_{n-1}w_n) / P(w_{n-1}) $" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "def Pwords2(words, prev=''):\n", " \"The probability of a sequence of words, using bigram data, given prev word.\"\n", " return product(cPword(w, (prev if (i == 0) else words[i-1]) )\n", " for (i, w) in enumerate(words))\n", "\n", "P = P1w # Use the big dictionary for the probability of a word\n", "\n", "def cPword(word, prev):\n", " \"Conditional probability of word, given previous word.\"\n", " bigram = prev + ' ' + word\n", " if P2w(bigram) > 0 and P(prev) > 0:\n", " return P2w(bigram) / P(prev)\n", " else: # Average the back-off value and zero.\n", " return P(word) / 2" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.983396332800731e-11\n", "6.413676294377262e-08\n", "1.1802860036709024e-11\n" ] } ], "source": [ "print(Pwords(tokens('this is a test')))\n", "print(Pwords2(tokens('this is a test')))\n", "print(Pwords2(tokens('is test a this')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make `segment2`, we copy `segment`, and make sure to pass around the previous token, and to evaluate probabilities with `Pwords2` instead of `Pwords`." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "@memo \n", "def segment2(text, prev=''): \n", " \"Return best segmentation of text; use bigram data.\" \n", " if not text: \n", " return []\n", " else:\n", " candidates = ([first] + segment2(rest, first) \n", " for (first, rest) in splits(text, 1))\n", " return max(candidates, key=lambda words: Pwords2(words, prev))" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['choose', 'spain']\n", "['speed', 'of', 'art']\n", "['small', 'and', 'in', 'significant']\n", "['large', 'and', 'in', 'significant']\n" ] } ], "source": [ "print(segment2('choosespain'))\n", "print(segment2('speedofart'))\n", "print(segment2('smallandinsignificant'))\n", "print(segment2('largeandinsignificant'))" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['far', 'out', 'in', 'the', 'un', 'chart', 'ed', 'back', 'waters', 'of', 'the', 'un', 'fashionable', 'end', 'of', 'the', 'western', 'spiral', 'arm', 'of', 'the', 'galaxy', 'lies', 'a', 'small', 'un', 'regarded', 'yellow', 'sun']\n", "['far', 'out', 'in', 'the', 'uncharted', 'backwaters', 'of', 'the', 'unfashionable', 'end', 'of', 'the', 'western', 'spiral', 'arm', 'of', 'the', 'galaxy', 'lies', 'a', 'small', 'un', 'regarded', 'yellow', 'sun']\n" ] } ], "source": [ "adams = ('faroutintheunchartedbackwatersoftheunfashionableendofthewesternspiral' +\n", " 'armofthegalaxyliesasmallunregardedyellowsun')\n", "print(segment(adams))\n", "print(segment2(adams))" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P1w('unregarded')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['a', 'dry', 'bare', 'sandy', 'hole', 'with', 'nothing', 'in', 'it', 'to', 'sit', 'down', 'on', 'or', 'to', 'eat']\n", "['a', 'dry', 'bare', 'sandy', 'hole', 'with', 'nothing', 'in', 'it', 'to', 'sit', 'down', 'on', 'or', 'to', 'eat']\n" ] } ], "source": [ "tolkien = 'adrybaresandyholewithnothinginittositdownonortoeat'\n", "print(segment(tolkien))\n", "print(segment2(tolkien))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion? Bigram model is a little better, but not much. Hundreds of billions of words still not enough. (Why not trillions?) Could be made more efficient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(8) Theory: Evaluation\n", "===\n", "\n", "So far, we've got an intuitive feel for how this all works. But we don't have any solid metrics that quantify the results. Without metrics, we can't say if we are doing well, nor if a change is an improvement. In general,\n", "when developing a program that relies on data to help make\n", "predictions, it is good practice to divide your data into three sets:\n", "
    \n", "
  1. Training set: the data used to create our spelling\n", " model; this was the big.txt file.\n", "
  2. Development set: a set of input/output pairs that we can\n", " use to rank the performance of our program as we are developing it.\n", "
  3. Test set: another set of input/output pairs that we use\n", " to rank our program after we are done developing it. The\n", " development set can't be used for this purpose—once the\n", " programmer has looked at the development test it is tainted, because\n", " the programmer might modify the program just to pass the development\n", " test. That's why we need a separate test set that is only looked at\n", " after development is done.\n", "
\n", "\n", "For this program, the training data is the word frequency counts, the development set is the examples like `\"choosespain\"` that we have been playing with, and now we need a test set." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "def test_segmenter(segmenter, tests):\n", " \"Try segmenter on tests; report failures; return fraction correct.\"\n", " return sum([test_one_segment(segmenter, test) \n", " for test in tests]), len(tests)\n", "\n", "def test_one_segment(segmenter, test):\n", " words = tokens(test)\n", " result = segmenter(cat(words))\n", " correct = (result == words)\n", " if not correct:\n", " print('expected', words)\n", " print(' got', result)\n", " return correct\n", "\n", "cat = ''.join\n", "\n", "proverbs = (\"\"\"A little knowledge is a dangerous thing\n", " A man who is his own lawyer has a fool for his client\n", " All work and no play makes Jack a dull boy\n", " Better to remain silent and be thought a fool that to speak and remove all doubt;\n", " Do unto others as you would have them do to you\n", " Early to bed and early to rise, makes a man healthy, wealthy and wise\n", " Fools rush in where angels fear to tread\n", " Genius is one percent inspiration, ninety-nine percent perspiration\n", " If you lie down with dogs, you will get up with fleas\n", " Lightning never strikes twice in the same place\n", " Power corrupts; absolute power corrupts absolutely\n", " Here today, gone tomorrow\n", " See no evil, hear no evil, speak no evil\n", " Sticks and stones may break my bones, but words will never hurt me\n", " Take care of the pence and the pounds will take care of themselves\n", " Take care of the sense and the sounds will take care of themselves\n", " The bigger they are, the harder they fall\n", " The grass is always greener on the other side of the fence\n", " The more things change, the more they stay the same\n", " Those who do not learn from history are doomed to repeat it\"\"\"\n", " .splitlines())" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "expected ['power', 'corrupts', 'absolute', 'power', 'corrupts', 'absolutely']\n", " got ['power', 'corrupt', 's', 'absolute', 'power', 'corrupt', 's', 'absolutely']\n", "expected ['the', 'grass', 'is', 'always', 'greener', 'on', 'the', 'other', 'side', 'of', 'the', 'fence']\n", " got ['the', 'grass', 'is', 'always', 'green', 'er', 'on', 'the', 'other', 'side', 'of', 'the', 'fence']\n" ] }, { "data": { "text/plain": [ "(18, 20)" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_segmenter(segment, proverbs)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20, 20)" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_segmenter(segment2, proverbs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This confirms that both segmenters are very good, and that `segment2` is slightly better. There is much more that can be done in terms of the variety of tests, and in measuring statistical significance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(9) Theory and Practice: Smoothing\n", "======\n", "\n", "Let's go back to a test we did before, and add some more test cases:\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.983396332800731e-11 this is a test\n", "8.637472023018802e-16 this is a unusual test\n", "0.0 this is a nongovernmental test\n", "0.0 this is a neverbeforeseen test\n", "0.0 this is a zqbhjhsyefvvjqc test\n" ] } ], "source": [ "tests = ['this is a test', \n", " 'this is a unusual test',\n", " 'this is a nongovernmental test',\n", " 'this is a neverbeforeseen test',\n", " 'this is a zqbhjhsyefvvjqc test']\n", "\n", "for test in tests:\n", " print(Pwords(tokens(test)), test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The issue here is the finality of a probability of zero. Out of the three 15-letter words, it turns out that \"nongovernmental\" is in the dictionary, but if it hadn't been, if somehow our corpus of words had missed it, then the probability of that whole phrase would have been zero. It seems that is too strict; there must be some \"real\" words that are not in our dictionary, so we shouldn't give them probability zero. There is also a question of likelihood of being a \"real\" word. It does seem that \"neverbeforeseen\" is more English-like than \"zqbhjhsyefvvjqc\", and so perhaps should have a higher probability.\n", "\n", "We can address this by assigning a non-zero probability to words that are not in the dictionary. This is even more important when it comes to multi-word phrases (such as bigrams), because it is more likely that a legitimate one will appear that has not been observed before.\n", "\n", "We can think of our model as being overly spiky; it has a spike of probability mass wherever a word or phrase occurs in the corpus. What we would like to do is *smooth* over those spikes so that we get a model that does not depend on the details of our corpus. The process of \"fixing\" the model is called *smoothing*.\n", "\n", "For example, Laplace was asked what's the probability of the sun rising tomorrow. From data that it has risen $n/n$ times for the last *n* days, the maximum liklihood estimator is 1. But Laplace wanted to balance the data with the possibility that tomorrow, either it will rise or it won't, so he came up with $(n + 1) / (n + 2)$.\n", "\n", "\n", " \n", " \n", "\n", "
What we know is little, and what we are ignorant of is immense.
— Pierre Simon Laplace, 1749-1827" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "def pdist_additive_smoothed(counter, c=1):\n", " \"\"\"The probability of word, given evidence from the counter.\n", " Add c to the count for each item, plus the 'unknown' item.\"\"\"\n", " N = sum(counter.values()) # Amount of evidence\n", " Nplus = N + c * (len(counter) + 1) # Evidence plus fake observations\n", " return lambda word: (counter[word] + c) / Nplus \n", "\n", "P1w = pdist_additive_smoothed(COUNTS1)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.7003201005861308e-12" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "P1w('neverbeforeseen')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But now there's a problem ... we now have previously-unseen words with non-zero probabilities. And maybe 10-12 is about right for words that are observed in text: that is, if I'm *reading* a new text, the probability that the next word is unknown might be around 10-12. But if I'm *manufacturing* 20-letter sequences at random, the probability that one will be a word is much, much lower than 10-12. \n", "\n", "Look what happens:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['this',\n", " 'is',\n", " 'a',\n", " 'test',\n", " 'of',\n", " 'segment',\n", " 'at',\n", " 'i',\n", " 'on',\n", " 'of',\n", " 'along',\n", " 'sequence',\n", " 'of',\n", " 'words']" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment.cache.clear()\n", "segment('thisisatestofsegmentationofalongsequenceofwords')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two problems:\n", " \n", "First, we don't have a clear model of the unknown words. We just say \"unknown\" but\n", "we don't distinguish likely unknown from unlikely unknown. For example, is a 8-character unknown more likely than a 20-character unknown?\n", "\n", "Second, we don't take into account evidence from *parts* of the unknown. For example, \n", "\"unglobulate\" versus \"zxfkogultae\".\n", "\n", "For our next approach, *Good - Turing* smoothing re-estimates the probability of zero-count words, based on the probability of one-count words (and can also re-estimate for higher-number counts, but that is less interesting).\n", "\n", "\n", "\n", "
I. J. Good (1916 - 2009)             Alan Turing (1812 - 1954)\n", "\n", "So, how many one-count words are there in `COUNTS`? (There aren't any in `COUNTS1`.) And what are the word lengths of them? Let's find out:\n" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(7, 1357),\n", " (8, 1356),\n", " (9, 1175),\n", " (6, 1113),\n", " (10, 938),\n", " (5, 747),\n", " (11, 627),\n", " (12, 398),\n", " (4, 368),\n", " (13, 215),\n", " (3, 159),\n", " (14, 112),\n", " (2, 51),\n", " (15, 37),\n", " (16, 10),\n", " (17, 7)]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "singletons = (w for w in COUNTS if COUNTS[w] == 1)\n", "\n", "lengths = map(len, singletons)\n", "\n", "Counter(lengths).most_common()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0012277376423275445" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1357 / sum(COUNTS.values())" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "hist(lengths, bins=len(set(lengths)));" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "def pdist_good_turing_hack(counter, onecounter, base=1/26., prior=1e-8):\n", " \"\"\"The probability of word, given evidence from the counter.\n", " For unknown words, look at the one-counts from onecounter, based on length.\n", " This gets ideas from Good-Turing, but doesn't implement all of it.\n", " prior is an additional factor to make unknowns less likely.\n", " base is how much we attenuate probability for each letter beyond longest.\"\"\"\n", " N = sum(counter.values())\n", " N2 = sum(onecounter.values())\n", " lengths = map(len, [w for w in onecounter if onecounter[w] == 1])\n", " ones = Counter(lengths)\n", " longest = max(ones)\n", " return (lambda word: \n", " counter[word] / N if (word in counter) \n", " else prior * (ones[len(word)] / N2 or \n", " ones[longest] / N2 * base ** (len(word)-longest)))\n", "\n", "# Redefine P1w\n", "P1w = pdist_good_turing_hack(COUNTS1, COUNTS)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['this',\n", " 'is',\n", " 'a',\n", " 'test',\n", " 'of',\n", " 'segment',\n", " 'at',\n", " 'i',\n", " 'on',\n", " 'of',\n", " 'a',\n", " 'very',\n", " 'long',\n", " 'sequence',\n", " 'of',\n", " 'words']" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segment.cache.clear()\n", "segment('thisisatestofsegmentationofaverylongsequenceofwords')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That was somewhat unsatisfactory. We really had to crank up the prior, specifically because the process of running `segment` generates so many non-word candidates (and also because there will be fewer unknowns with respect to the billion-word `WORDS1` than with respect to the million-word `WORDS`). It would be better to separate out the prior from the word distribution, so that the same distribution could be used for multiple tasks, not just for this one.\n", "\n", "Now let's think for a short while about smoothing **bigram** counts. Specifically, what if we haven't seen a bigram sequence, but we've seen both words individually? For example, to evaluate P(\"Greenland\") in the phrase \"turn left at Greenland\", we might have three pieces of evidence:\n", "\n", " P(\"Greenland\")\n", " P(\"Greenland\" | \"at\")\n", " P(\"Greenland\" | \"left\", \"at\")\n", " \n", "Presumably, the first would have a relatively large count, and thus large reliability, while the second and third would have decreasing counts and reliability. With *interpolation smoothing* we combine all three pieces of evidence, with a linear combination:\n", " \n", "$P(w_3 \\mid w_1w_2) = c_1 P(w_3) + c_2 P(w_3 \\mid w_2) + c_3 P(w_3 \\mid w_1w_2)$\n", "\n", "How do we choose $c_1, c_2, c_3$? By experiment: train on training data, maximize $c$ values on development data, then evaluate on test data.\n", " \n", "However, when we do this, we are saying, with probability $c_1$, that a word can appear anywhere, regardless of previous words. But some words are more free to do that than other words. Consider two words with similar probability:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.733146236612908e-05\n", "7.724949668890416e-05\n" ] } ], "source": [ "print(P1w('francisco'))\n", "print(P1w('individuals'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "They have similar unigram probabilities but differ in their freedom to be the second word of a bigram:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['San francisco', 'san francisco']\n" ] } ], "source": [ "print([bigram for bigram in COUNTS2 if bigram.endswith('francisco')])" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[' individuals', 'For individuals', 'These individuals', 'about individuals', 'affected individuals', 'all individuals', 'among individuals', 'and individuals', 'are individuals', 'as individuals', 'between individuals', 'both individuals', 'by individuals', 'certain individuals', 'different individuals', 'few individuals', 'following individuals', 'for individuals', 'from individuals', 'healthy individuals', 'help individuals', 'in individuals', 'income individuals', 'infected individuals', 'interested individuals', 'many individuals', 'minded individuals', 'more individuals', 'of individuals', 'on individuals', 'or individuals', 'other individuals', 'private individuals', 'qualified individuals', 'some individuals', 'such individuals', 'that individuals', 'the individuals', 'these individuals', 'those individuals', 'to individuals', 'two individuals', 'where individuals', 'which individuals', 'with individuals']\n" ] } ], "source": [ "print([bigram for bigram in COUNTS2 if bigram.endswith('individuals')])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Intuitively, words that appear in many bigrams before are more likely to appear in a new, previously unseen bigram. In *Kneser-Ney* smoothing (Reinhard Kneser, Hermann Ney) we multiply the bigram counts by this ratio. But I won't implement that here, because The Count never covered it.\n", "\n", "(10) One More Task: Secret Codes\n", "===\n", "\n", "Let's tackle one more task: decoding secret codes. We'll start with the simplest of codes, a rotation cipher, sometimes called a shift cipher or a Caesar cipher (because this was state-of-the-art crypotgraphy in 100 BC). First, a method to encode:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "def rot(msg, n=13): \n", " \"Encode a message with a rotation (Caesar) cipher.\" \n", " return encode(msg, alphabet[n:]+alphabet[:n])\n", "\n", "def encode(msg, key): \n", " \"Encode a message with a substitution cipher.\" \n", " table = str.maketrans(upperlower(alphabet), upperlower(key))\n", " return msg.translate(table) \n", "\n", "def upperlower(text): return text.upper() + text.lower() " ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Uijt jt b tfdsfu nfttbhf.'" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rot('This is a secret message.', 1)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Guvf vf n frperg zrffntr.'" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rot('This is a secret message.', 13)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a secret message.'" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rot(rot('This is a secret message.'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now decoding is easy: try all 26 candidates, and find the one with the maximum Pwords:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "def decode_rot(secret):\n", " \"Decode a secret message that has been encoded with a rotation cipher.\"\n", " candidates = [rot(secret, i) for i in range(len(alphabet))]\n", " return max(candidates, key=lambda msg: Pwords(tokens(msg)))" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nyf befnj kyv rejnvi?\n", "Who knows the answer?\n" ] } ], "source": [ "msg = 'Who knows the answer?'\n", "secret = rot(msg, 17)\n", "\n", "print(secret)\n", "print(decode_rot(secret))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make it a tiny bit harder. When the secret message contains separate words, it is too easy to decode by guessing that the one-letter words are most likely \"I\" or \"a\". So what if the encode routine mushed all the letters together:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "def encode(msg, key): \n", " \"Encode a message with a substitution cipher; remove non-letters.\" \n", " msg = cat(tokens(msg)) ## Change here\n", " table = str.maketrans(upperlower(alphabet), upperlower(key))\n", " return msg.translate(table) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can decode by segmenting. We change candidates to be a list of segmentations, and still choose the candidate with the best Pwords: " ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "def decode_rot(secret):\n", " \"\"\"Decode a secret message that has been encoded with a rotation cipher,\n", " and which has had all the non-letters squeezed out.\"\"\"\n", " candidates = [segment(rot(secret, i)) for i in range(len(alphabet))]\n", " return max(candidates, key=lambda msg: Pwords(msg))" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pahdghplmaxtglpxkmablmbfxtgrhgxunxeexk\n", "['who', 'knows', 'the', 'answer', 'this', 'time', 'anyone', 'bu', 'e', 'll', 'er']\n" ] } ], "source": [ "msg = 'Who knows the answer this time? Anyone? Bueller?'\n", "secret = rot(msg, 19)\n", "\n", "print(secret)\n", "print(decode_rot(secret))" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-123 p ah d g h p l max t g l p x km a bl mb f x t gr h g x un x e ex k\n", "-132 q b i eh i q m n by u hm q y l n b cm n c g y u h s i h y vo y ff y l\n", "-128 r c j f i jr no c z v in r z mo c d nod h z v it j i z w p z g g z m\n", "-115 sd k g j k so p da w j os an p de op e i a w j u k j ax q ah h an\n", "-144 t el h k l t p q e b x k p t b o q e f p q f j b x k v l k by r b ii b o\n", "-145 u f m il m u q r f cy l qu c p r f g q r g k cy l wm l c z s c j j c p\n", "-141 v g n j m n v r sg d z mr v d q sg h r s h l d z m x nm dat d k k d q\n", "-39 who knows the answer this time anyone bu e ll er\n", "-119 xi p l op x tu if b o t x f s u i j tu j n f b oz p of c v f mm f s\n", "-151 y j q m p q y u v j g c p u y g t v j ku v k o g c pa q pg d w g n n g t\n", "-151 z k r n q r z v w k h d q v z h u w k l v w l p h d q br q he x ho oh u\n", "-97 also r saw x lie r w a iv x l m w x m q i er c s r if y i pp iv\n", "-146 b mt p st b x y m j f s x b j w y m n x y n r j f sd t s j g z j q q j w\n", "-137 c n u q tu cy z n k g ty c k x z no y z os k g t eut k ha k r r k x\n", "-124 do v r u v d z a o l h u z d ly a op z apt l h u f v u li bl s sly\n", "-121 e p w s v we ab pm iv a em z b p q ab qu m iv g w v m j cm t tm z\n", "-142 f q x t w x f bc q n j w b fn a c q r bc r v n j wh x w n k d n u un a\n", "-119 gr y u x y g c dr ok x c go b dr s c d s w ok xi y x o leo v vo b\n", "-119 h s z vy z h des ply d h p c est de t x ply j z y pm f p w w p c\n", "-137 it a w z a i e ft q m ze i q d f tu e f u y q m z ka z q n g q xx q d\n", "-130 j u b x ab j f g u r na f jr eg u v f g v z r n alba r oh r y y re\n", "-125 k v cy bc k g h v sob g k s f h v w g h was o b m c b s p is z z s f\n", "-130 l w d z c d l hi w t p ch l t g i w x h ix b t p c nd c t q j t a at g\n", "-123 m x e a dem i j x u q dim u h j x y i j y c u q doe du r ku b bu h\n", "-124 ny f be fn j k y v re j n vi k y z j k z d v rep fe vs l v c c vi\n", "-134 oz g cf go k l z w s f k o w j l z ak la e w s f q g f w tm w d d w j\n" ] } ], "source": [ "candidates = [segment(rot(secret, i)) for i in range(len(alphabet))]\n", "\n", "for c in candidates:\n", " print(int(log10(Pwords(c))), ' '.join(c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about a general substitution cipher? The problem is that there are 26! substitution ciphers, and we can't enumerate all of them. We would need to search through this space. Initially make some guess at a substitution, then swap two letters; if that looks better keep going, if not try something else. This approach solves most substitution cipher problems, although it can take a few minutes on a message of length 100 words or so.\n", "\n", "(∞ and beyond) Where To Go Next\n", "===\n", "\n", "What to do next? Here are some options:\n", " \n", "- **Spelling correction**: Use bigram or trigram context; make a model of spelling errors/edit distance; go beyond edit distance 2; make it more efficient\n", "- **Evaluation**: Make a serious test suite; search for best parameters (e.g. $c_1, c_2, c_3$)\n", "- **Smoothing**: Implement Kneser-Ney and/or Interpolation; do letter *n*-gram-based smoothing\n", "- **Secret Codes**: Implement a search over substitution ciphers\n", "- **Classification**: Given a corpus of texts, each with a classification label, write a classifier that will take a new text and return a label. Examples: spam/no-spam; favorable/unfavorable; what author am I most like; reading level.\n", "- **Clustering**: Group data by similarity. Find synonyms/related words.\n", "- **Parsing**: Representing nested structures rather than linear sequences of words. relations between parts of the structure. Implicit missing bits. Inducing a grammar.\n", "- **Meaning**: What semantic relations are meant by the syntactic relations?\n", "- **Translation**: Using examples to transform one language into another.\n", "- **Question Answering**: Using examples to transfer a question into an answer, either by retrieving a passage, or by synthesizing one.\n", "- **Speech**: Dealing with analog audio signals rather than discrete sequences of characters." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'congratulations'" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "correct('cpgratulations')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }