diff --git a/ipynb/Portmantout.ipynb b/ipynb/Portmantout.ipynb index 49fc5c9..4e97467 100644 --- a/ipynb/Portmantout.ipynb +++ b/ipynb/Portmantout.ipynb @@ -4,18 +4,41 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "
Peter Norvig
\n", + "\n", "# Portmantout Words\n", "\n", - "A *portmanteau* is a word that squishes together two overlapping words, like *math* + *athlete* = *mathlete*. You can make up your own, like *eskimo* + *kimono* = *eskimono*. Inspired by Darius Bacon, I covered this as a programming exercise in my 2012 [Udacity course](https://www.udacity.com/course/design-of-computer-programs--cs212). Recently I was re-inspired by [Tom Murphy VII](http://www.cs.cmu.edu/~tom7), who added a new twist: **[portmantout words](http://www.cs.cmu.edu/~tom7/portmantout/)**, which are defined as:\n", + "A [***portmanteau***](https://en.wikipedia.org/wiki/Portmanteau) is a word that squishes together two words, like *math* + *athlete* = *mathlete*, or *tutankhamenability*, the property of being amenable to seeing the Egyptian exhibit. Inspired by [**Darius Bacon**](http://wry.me/), I covered this as a programming exercise in my 2012 [Udacity course](https://www.udacity.com/course/design-of-computer-programs--cs212). Recently I was re-inspired by [**Tom Murphy VII**](http://tom7.org/), who added a new twist: [***portmantout words***](http://www.cs.cmu.edu/~tom7/portmantout/) (*tout* from the French for *all*), which are defined as:\n", "\n", - "> Given a set of words, W, a portmantout of W is a string, S, such that:\n", - "* Each word in W is a substring of S.\n", - "* S is formed by joining together an ordered list, L, of words from W (possibly with repeats).\n", - "* For each word in L, some ending letter(s) must match the starting letters of the next word.\n", - "* When L is joined into S, those overlapping letters appear only once.\n", - "* The shorter the string S, the better.\n", + "> A **portmantout** of a set of words $W$ is a string $S$ such that:\n", + "* Every word in $W$ is a **substring** of $S$.\n", + "* The words **overlap**: each word (except the first) must start at an index that is between the beginning and end of another word.\n", + "* **Nothing else** is in $S$: every letter in $S$ comes from the overlapping words. (But a word may be repeated any number of times.)\n", "\n", - "To make sure we understand the definition, I'll define the function `S = portman(L).` To find the overlap between two words, `portman` considers each suffix of the previous word and takes the longest suffix that starts the next word; we drop that from the word. (I'll also define functions to list the prefixes and suffixes of a word.)" + "Although not part of the definition, the goal is to get as short an $S$ as possible, and to do it for a $W$ of 100,000 words or so. Developing a program to do that is the goal of this notebook. My program (also available as [portman.py](portman.py)) helped me discover:\n", + "\n", + "- **preferendumdums**: a political commentary portmantout of {prefer, referendum, dumdums}\n", + "- **fortyphonshore**: a dire weather report portmantout of {forty, typhons, onshore}; \n", + "- **allegestionstage**: a brutal theatre critic portmantout of {alleges, egestions, onstage}. \n", + "- **skymanipulablearsplittingler**: a nerve-damaging aviator portmantout of {skyman, manipulable, blears, earsplitting, tinglers}\n", + "- **edinburgherselflesslylyricize**: a Scottish music review portmantout of {edinburgh, burghers, herself, selflessly, slyly, lyricize}\n", + "\n", + "\n", + "\n", + "# Program Design\n", + "\n", + "I originally thought I would define a major function, `S = portman(W)`, to generate the portmantout string, and a minor function, `is_portman(W, S)`, to verify the result. But I found the verification process was difficult. For example, given `S = '...helloworld...'` I would reject that as non-overlapping if I parsed it as `'hello'` + `'world'`, but I would accept it if parsed as `'hell'` + `'low'` + `'world'`. It was hard for `is_portman` to decide which parse was intended, which is a shame because `portman` *knew* which was intended, but discarded the information. \n", + "\n", + "Therefore, I decided to change the interface: I'll have one function that takes $W$ as input and returns what I call a **portmantout proof**, $P$. I can gain insight by examining $P$, and I can pass $P$ to a second function that can easily generate the string $S$ while verifying the proof. I decided on the following calling and [naming](https://en.wikipedia.org/wiki/Natalie_Portman) conventions:\n", + "\n", + " P = natalie(W) # Generate a portmantout proof P from a set of words W\n", + " S = portman(P, W) # Verify that the proof is valid and compute the string S\n", + "\n", + "or in other words:\n", + "\n", + " S = portman(natalie(W), W) # Generate a portmantout of W\n", + "\n", + "The proof $P$ is in the form of an ordered list, `[(overlap, word),...]` where each `word` is a member of $W$ and each `overlap` is an integer saying how many letters in the word overlap with the previous word (this should be 0 for the first word and positive for subsequent words). For example:" ] }, { @@ -24,23 +47,25 @@ "metadata": {}, "outputs": [], "source": [ - "def portman(words):\n", - " \"Join a sequence of words, eliminating from each word the overlap with previous word.\"\n", - " prev = words[0]\n", - " result = [prev]\n", - " for word in words[1:]:\n", - " overlap = next(filter(word.startswith, suffixes(prev)))\n", - " result.append(word[len(overlap):])\n", - " prev = word\n", - " return ''.join(result)\n", + "Words = set \n", + "Proof = list\n", "\n", - "def prefixes(word) -> list:\n", - " \"All non-empty prefixes of word, shortest first.\"\n", - " return [word[:i+1] for i in range(len(word))]\n", + "W1: Words = {'anarchy', 'eskimo', 'grammarian', 'kimono', 'monogram'}\n", + "S1: str = 'eskimonogrammarianarchy'\n", + "P1: Proof = [(0, 'eskimo'),\n", + " (4, 'kimono'),\n", + " (4, 'monogram'),\n", + " (4, 'grammarian'),\n", + " (2 , 'anarchy')]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# portman\n", "\n", - "def suffixes(word) -> list:\n", - " \"All non-empty suffixes of word, longest first.\"\n", - " return [word[i:] for i in range(len(word))]" + "The function `portman(P, W)` takes a proof $P$ and a set of words $W$ and generates the portmantout string $S$ while verifying that the proof is correct (or raising an `AssertionError` if it is not). Assertions are appropriate because I'm thinking of this as one part of my program verifying the internal logic of another part. If I was running a service to verify other people's proofs, I would not use `assert` statements." ] }, { @@ -49,44 +74,7 @@ "metadata": {}, "outputs": [], "source": [ - "# An example:\n", - "W = {'on', 'one', 'two', 'won'}\n", - "S = 'twone'\n", - "L = ['two',\n", - " 'won',\n", - " 'on',\n", - " 'one']\n", - "\n", - "assert portman(L) == S\n", - "assert all(w in S for w in W)\n", - "assert set(L) == W\n", - "assert portman(['eskimo', 'kimono', 'monolith']) == 'eskimonolith'\n", - "assert prefixes('word') == ['w', 'wo', 'wor', 'word']\n", - "assert suffixes('word') == ['word', 'ord', 'rd', 'd']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that `portman(['one', 'two'])` would raise an error, because there is no overlap from `'one'` to `'two'`.\n", - "\n", - "# Problem-Solving Strategy\n", - "\n", - "I watched part of Tom Murphy's entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI), but then I stopped because I wanted to solve the problem at least partially on my own. Here's my strategy:\n", - "\n", - "1. To start with, I'll need a list of words. Murphy provides [108,709 words](wordlist.asc) that I'll read that into a `Dictionary` data structure.\n", - "2. This is clearly an NP-hard problem, so a greedy solution will suffice.\n", - "3. In my [traveling salesperson notebook](TSP.ipynb) I show a greedy nearest neighbor solution: start at one city and continually go to the nearest unvisited city until all cities are visited. We can do a similar thing here: start at one word, and repeatedly go to the previously-unused word that maximizes overlap, until all words are used. \n", - "4. However, this strategy has a problem: at some point it is likely that *none* of the remaining unused words overlap the current word. In that case, we'll need to re-use a word; I'll try to choose one that is short and that *bridges* to one of the remaining unused words.\n", - "\n", - "\n", - "# Loading the Dictionary\n", - "\n", - "My `Dictionary` data structure contains three things:\n", - "- A set of words.\n", - "- A set of what I call *unique words*: the words that are not contained within other words. For example, \"on\" is contained within \"one\", so \"on\" would be in the set of words, but not the set of unique words.\n", - "- Above I said I wanted to find \"previously-unused word that maximizes overlap.\" To facilitate that, I'll store a mapping from an overlap to a list of the words that have that overlap as a prefix, for example `{'o': ['on', 'one', ...], ...}`" + "from collections import defaultdict, Counter" ] }, { @@ -95,29 +83,19 @@ "metadata": {}, "outputs": [], "source": [ - "from collections import defaultdict, Counter\n", - "from functools import lru_cache\n", - "\n", - "class Dictionary:\n", - " \"A collection of words, with prefixes and unique words precomputed.\"\n", - " def __init__(self, stream):\n", - " self.words = {w.strip().lower() for w in stream}\n", - " self.uniq = self.words - subwords(self.words)\n", - " self.pre = multimap((p, w) for w in self.words for p in prefixes(w))\n", - " \n", - "def subwords(words) -> set:\n", - " \"All words that are hiding inside any of these words.\"\n", - " return {sub for word in words\n", - " for i in range(len(word))\n", - " for sub in prefixes(word[i:])\n", - " if sub is not word and sub in words}\n", - "\n", - "def multimap(pairs) -> dict:\n", - " \"Given (key, val) pairs, make a dict of {key: [val,...]}.\"\n", - " result = defaultdict(list)\n", - " for key, val in pairs:\n", - " result[key].append(val)\n", - " return result" + "def portman(P: Proof, W: Words) -> str:\n", + " \"\"\"Compute the portmantout string S from the proof P; verify that it covers W.\"\"\"\n", + " S = []\n", + " prev_word = ''\n", + " for (overlap, word) in P:\n", + " assert word in W, f'nothing else is allowed in S: {word}'\n", + " left, right = word[:overlap], word[overlap:] # Split word into two parts\n", + " assert overlap >= 0 and left == prev_word[-overlap:], f'the words must overlap: {prev_word, word}'\n", + " S.append(right)\n", + " prev_word = word\n", + " S = ''.join(S)\n", + " assert all(w in S for w in W), 'each word in W must be a substring of S'\n", + " return S" ] }, { @@ -128,17 +106,7 @@ { "data": { "text/plain": [ - "({'on', 'one', 'two', 'won'},\n", - " {'one', 'two', 'won'},\n", - " {'w': ['won'],\n", - " 'wo': ['won'],\n", - " 'won': ['won'],\n", - " 'o': ['one', 'on'],\n", - " 'on': ['one', 'on'],\n", - " 'one': ['one'],\n", - " 't': ['two'],\n", - " 'tw': ['two'],\n", - " 'two': ['two']})" + "'eskimonogrammarianarchy'" ] }, "execution_count": 4, @@ -147,105 +115,83 @@ } ], "source": [ - "# A small example:\n", - "W = {'on', 'one', 'two', 'won'}\n", - "d = Dictionary(W)\n", - "d.words, d.uniq, dict(d.pre)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "That looks right. Now I can load the big dictionary, call it `D`, and explore it a bit:" + "portman(P1, W1)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "'helloworld'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "D = Dictionary(open('wordlist.asc'))" + "portman([(0, 'hell'), (1, 'low'), (1, 'world')], {'hell', 'low', 'world'})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Subwords\n", + "\n", + "I want to introduce the concept of **subwords**. The following set of words `W2` has 17 more words than `W1`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(108709,\n", - " 64389,\n", - " 249623,\n", - " ['somewhats',\n", - " 'somewhen',\n", - " 'somewise',\n", - " 'someway',\n", - " 'somewhere',\n", - " 'someways',\n", - " 'somewhat'])" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "len(D.words), len(D.uniq), len(D.pre), D.pre['somew']" + "W2 = {'anarchy', 'eskimo', 'grammarian', 'kimono', 'monogram', \n", + " 'a', 'am', 'an', 'arc', 'arch', 'aria', 'gram', 'grammar', \n", + " 'i', 'mar', 'mono', 'narc', 'no', 'on', 'ram', 'ski', 'skim'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Finding Sequences of Overlapping Words\n", - "\n", - "The function `natalie` will take a Dictionary as input and produce a list of overlapping words, e.g.\n", - "\n", - " d = Dictionary({'on', 'one', 'two', 'won'})\n", - " natalie(d) ⇒ ['two', 'won', 'on', 'one']\n", - " \n", - "and we will solve the whole problem as follows:\n", - "\n", - " portman(natalie(d)) ⇒ 'twone'\n", - " \n", - "Within `natalie`, we repeatedly add words to a `result` list until we have used up all the unique `words` from the dictionary. On each iteration we either add a `new_word` (thus decreasing the size of the remaining `words`), or we re-use a `repeated_word`, choosing one that will bridge to a word that we have not used yet. " + "But our old `P1` still works as a portmantout proof of `W2`, yielding the same string:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "'eskimonogrammarianarchy'" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "def natalie(d: Dictionary) -> list:\n", - " \"Return a list of words that cover the dictionary.\"\n", - " words = set(d.uniq) # all the words we need to cover\n", - " result = [words.pop()] # a list of overlapping words\n", - " firsts = Counter(word[0] for word in words) # Count of first letters of words\n", - " while words:\n", - " prev = result[-1]\n", - " word = (new_word(d, prev, words) or repeated_word(d, prev, firsts))\n", - " result.append(word)\n", - " if word in words:\n", - " words.remove(word)\n", - " B = word[0]\n", - " firsts[B] -= 1\n", - " if not firsts[B]: del firsts[B]\n", - " return result" + "portman(P1, W2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Selecting the Next Word\n", + "This works because the 17 new words in `W2` are all **subwords** of the first five words. If a superword like `'monogram'` is included in the proof $P$ and thus in the string $S$, then subwords like `'on'`, `'no'`, and `'gram'` are automatically included, without having to explicitly list them in $P$. \n", "\n", - "Selecting a `new_word` is easy: consider the suffixes of the previous word, longest suffix first, and if that suffix is a prefix of an unused word, then take it." + "We can compute the subwords (and from that, nonsubwords) of a set of words as follows:" ] }, { @@ -254,34 +200,45 @@ "metadata": {}, "outputs": [], "source": [ - "def new_word(d: Dictionary, prev: str, words: set) -> str or None:\n", - " \"Find an overlapping word to follow the previous word (or None).\"\n", - " for suf in suffixes(prev):\n", - " if suf in d.pre:\n", - " for word in d.pre[suf]:\n", - " if word in words:\n", - " return word" + "def subwords(W: Words) -> Words:\n", + " \"\"\"All the words in W that are subparts of some other word.\"\"\"\n", + " wordparts = (subparts(w) & W for w in W)\n", + " return set().union(*wordparts)\n", + "\n", + "def subparts(word) -> set:\n", + " \"\"\"All non-empty proper substrings of this word\"\"\"\n", + " return {word[i:j] \n", + " for i in range(len(word)) \n", + " for j in range(i + 1, len(word) + (i > 0))}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now suppose we reach a situation where the previous word was `'one'`, and the only remaining unused word is `'two'`. Since there is no overlap, `new_word` will fail, but we can find the shortest previously-used word that will `bridge` from the `'e'` at the end of `'one'` to the `'t'` at the start of `'two'`:" + "(*Python trivia:* the `(i > 0)` in `subparts` means that for a four-letter word like `'skim'`, we include subparts `word[i:4]` except when `i == 0`, thus including `'kim'`, `'im'`, and `'m'`, but not `'skim'`. In Python it is considered good style to have a Boolean expression like `(i > 0)` automatically converted to an integer `0` or `1`.)\n", + "\n", + "(*English trivia:* I use the clumsy term \"nonsubwords\" rather than \"superwords\", because there are a couple dozen words, like \"cozy\" and \"pugs\" and \"july,\" that are not subwords of any other words but are also not superwords: they have no subwords.)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'i', 'im', 'k', 'ki', 'kim', 'm', 's', 'sk', 'ski'}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "@lru_cache(1000)\n", - "def bridge(d, A, B) -> str:\n", - " \"Find a word that bridges A to B: it starts with A and ends with B.\"\n", - " return shortest(word for word in d.pre[A] if word.endswith(B))\n", - "\n", - "def shortest(items): return min(items, key=len, default=None)" + "subparts('skim') # The subparts (non-empty proper substrings) of a word" ] }, { @@ -292,7 +249,7 @@ { "data": { "text/plain": [ - "'eat'" + "{'anarchy', 'eskimo', 'grammarian', 'kimono', 'monogram'}" ] }, "execution_count": 10, @@ -301,71 +258,69 @@ } ], "source": [ - "bridge(D, 'e', 't')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "But in general, there will be several unusued words that we might bridge to; we will greedily choose the shortest bridge to any of the unused words. That sounds expensive when there are thousands of unused words, so we will summarize them with a counter, `firsts`, that gives for each letter the number of unused words that start with that letter (so, if the unused words are `{'two', 'three', 'four'}`, then `firsts` will be `{'t': 2, 'f': 1}`).\n", - "\n", - "Furthermore, it might be that *no* word can bridge directly to an unused word. In that case we can take two steps, first bridging from `A` to an intermediate letter `L`, and then bridging from `L` to a letter `B` that starts some unused word. The function `repeated_word` handles all these cases:" + "W2 - subwords(W2) # These nonsubwords must be in the proof" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'a',\n", + " 'am',\n", + " 'an',\n", + " 'arc',\n", + " 'arch',\n", + " 'aria',\n", + " 'gram',\n", + " 'grammar',\n", + " 'i',\n", + " 'mar',\n", + " 'mono',\n", + " 'narc',\n", + " 'no',\n", + " 'on',\n", + " 'ram',\n", + " 'ski',\n", + " 'skim'}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "def repeated_word(d: Dictionary, prev: str, firsts: Counter) -> str:\n", - " \"Find a previously-used word that will bridge to one of the letters in firsts.\"\n", - " A = prev[-1]\n", - " word = shortest(bridge(d, A, B) for B in firsts if bridge(d, A, B))\n", - " if word:\n", - " return word\n", - " else:\n", - " candidates = [[bridge(d, A, L), bridge(d, L, B)] \n", - " for L in alphabet for B in firsts\n", - " if A != L != B and bridge(d, A, L) and bridge(d, L, B)]\n", - " return min(candidates, key=lambda seq: sum(map(len, seq)))[0]\n", - "\n", - "alphabet = 'abcdefghijklmnopqrstuvwxyz'" + "subwords(W2) # These subwords don't need to appear in a proof" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "For example, suppose the previous word is `'one'`, and the only remaining unused word is `'quit'`. There is no word that bridges from `'e'` to `'q'`. So we will have to get there in two steps. Here's the first step:" + "Now that we have the notion of subwords, we can modify `portman` to be a bit more efficient and concise: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'elhi'" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "repeated_word(D, 'one', {'q': 1})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And here's the second:" + "def portman(P: Proof, W: Words) -> str:\n", + " \"\"\"Compute the portmantout string S from the proof P; verify that it covers W.\"\"\"\n", + " assert (W - subwords(W)) <= set(w for _, w in P) <= W, \"all the words in W and nothing else\"\n", + " S = []\n", + " prev_word = ''\n", + " for (overlap, word) in P:\n", + " left, right = word[:overlap], word[overlap:] # Split word into two parts\n", + " assert overlap >= 0 and left == prev_word[-overlap:], f'the words must overlap: {prev_word, word}'\n", + " S.append(right)\n", + " prev_word = word\n", + " return ''.join(S)" ] }, { @@ -376,7 +331,7 @@ { "data": { "text/plain": [ - "'iraq'" + "'eskimonogrammarianarchy'" ] }, "execution_count": 13, @@ -385,63 +340,41 @@ } ], "source": [ - "repeated_word(D, 'elhi', {'q': 1})" + "portman(P1, W2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# A Solution\n", + "(*Python trivia:* if `X, Y` and `Z` are sets, `X <= Y <= Z` means \"is `X` a subset of `Y` and `Y` a subset of `Z`?\" We use the notation here to say that the set of words in $P$ must contain all the nonsubwords and can only contain words from $W$.)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set of 108,709 Words\n", "\n", - "We're ready to solve the problem!" + "I will fetch the set of words that Tom Murphy used, and explore it a bit:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 23.2 s, sys: 50.7 ms, total: 23.2 s\n", - "Wall time: 23.3 s\n" - ] - } - ], + "outputs": [], "source": [ - "%%time \n", - "L = natalie(D)\n", - "S = portman(L)" + "! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(101796, 64389, 575150)" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "len(L), len(D.uniq), len(S)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here we see the start of the list:" + "W = set(open('wordlist.asc').read().split())" ] }, { @@ -452,26 +385,7 @@ { "data": { "text/plain": [ - "['hyperactivity',\n", - " 'typewriting',\n", - " 'tingling',\n", - " 'lingeries',\n", - " 'escarping',\n", - " 'carpings',\n", - " 'stropped',\n", - " 'peddles',\n", - " 'lesbians',\n", - " 'answers',\n", - " 'ersatzes',\n", - " 'zestiest',\n", - " 'establisher',\n", - " 'sherlocks',\n", - " 'locksmiths',\n", - " 'swankily',\n", - " 'lyricizing',\n", - " 'zingers',\n", - " 'erstwhile',\n", - " 'whiled']" + "'W has 108,709 words (44,320 subwords and 64,389 nonsubwords)'" ] }, "execution_count": 16, @@ -480,14 +394,10 @@ } ], "source": [ - "L[:20]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And the start of the string:" + "N = len(W)\n", + "sub = len(subwords(W))\n", + "\n", + "f'W has {N:,d} words ({sub:,d} subwords and {N-sub:,d} nonsubwords)'" ] }, { @@ -498,7 +408,7 @@ { "data": { "text/plain": [ - "'hyperactivitypewritinglingeriescarpingstroppeddlesbianswersatzestiestablisherlocksmithswankilyricizingerstwhileditorializingedifiestashedableachesapeakednessayistsunamicabilitieschewalsocializedshopliftsarsaparillasersheiksignalmanacsimonizedelweissestinescapablymphomassacrestingstowagesturalliescudosserscoundrellyinglycosideswipingrassessorshipmateshipwaysidestepsonshipshapelessnessentiallylstreptococcuspidalliersnorterservilitieschewinglessonedictallyhoedownstairsicknessesamesospheredityranniseismologyratessellatediouslyesteryearshotspursierrantrystsarinasmuchnessayeditorialistlesslyerbassoonistsarismsectarianismelteriescargotsardomsilhouettinglesseesaweddingsulfurizedematadorsallyingrowthsplurgesturingtailskidskinsmendeleviumbrageoushereditarilynxesculentsunamissilerysipelasticizeducationshorelinesmandragoraphobiaxialitympanumskullsleighingelessonslaughtsktskingedgewaysinewspeakshepherdingingivaerologistsaritzassagaislesionstagehandsomesthesisesquicentenniallymphocytestifyingestantalizationiumsketchingsubfamiliescapistsetsessilencesariannulightingstungstensilesiamesestinassimilableakishkestrelsewhereforestedhorseshoestringsideswipestilencesspitsawtoothpickstretchablenchedarwinistsimmeshinglersophsuperlativelysianticipatoryxescarpmentsktskedgesturescuestashingleditorializestsarevnasturtiumsavagedlyceesquiringboltsaristshamrockshenaniganswerabilitypographerselfedscoopsfullerythematologicalumnymphsilkweedinessentialleliciteddyingscreechymicsouvenirstargazersalutessellationspeedwaysquintingeingeniousnessayersequesterselysiumteenthronementsalivateddiesesquicentennialslumberserksadistsubclassessablearingdovestingsaguarosebaywoodsiestashesitatediousnessayingsluicewaywardlyristsumpsychrotherapiestradiologistshowingspansyllabicspikeletspoonerismswazilandownershipsidetrackingshipshottingliestablishmentsolarismsturdynamistickinessayscarrierspearmenianschlussreineducabilitypifyinguinalterablyricizeducativehiclessoningrainingloriouslynessesquipedalianestheticsextsoddynamismse'" + "True" ] }, "execution_count": 17, @@ -507,14 +417,7 @@ } ], "source": [ - "S[:2000]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you want to see the whole string, I'll write it to the file [natalie.txt](natalie.txt)." + "'eskimo' in W" ] }, { @@ -525,7 +428,7 @@ { "data": { "text/plain": [ - "575150" + "False" ] }, "execution_count": 18, @@ -533,6 +436,1116 @@ "output_type": "execute_result" } ], + "source": [ + "'waldo' in W # Where's Waldo? Not in W" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'antidisestablishmentarianism'" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "max(W, key=len)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Counter({1: 2,\n", + " 2: 25,\n", + " 3: 500,\n", + " 4: 2920,\n", + " 5: 6826,\n", + " 6: 11454,\n", + " 7: 16852,\n", + " 8: 19445,\n", + " 9: 16684,\n", + " 10: 11876,\n", + " 11: 8372,\n", + " 12: 5811,\n", + " 13: 3676,\n", + " 14: 2101,\n", + " 15: 1159,\n", + " 16: 583,\n", + " 17: 229,\n", + " 18: 107,\n", + " 19: 39,\n", + " 20: 29,\n", + " 21: 11,\n", + " 22: 4,\n", + " 23: 2,\n", + " 25: 1,\n", + " 28: 1})" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Counter(sorted(map(len, W))) # word lengths" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('s', 12075),\n", + " ('c', 10287),\n", + " ('p', 8414),\n", + " ('r', 6772),\n", + " ('d', 6673),\n", + " ('a', 6333),\n", + " ('b', 6205),\n", + " ('m', 5757),\n", + " ('t', 5490),\n", + " ('f', 4691),\n", + " ('e', 4460),\n", + " ('i', 4354),\n", + " ('h', 3884),\n", + " ('g', 3576),\n", + " ('l', 3331),\n", + " ('u', 3302),\n", + " ('o', 2935),\n", + " ('w', 2697),\n", + " ('n', 2453),\n", + " ('v', 1799),\n", + " ('j', 1039),\n", + " ('k', 960),\n", + " ('q', 559),\n", + " ('y', 338),\n", + " ('z', 260),\n", + " ('x', 65)]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Counter(w[0] for w in W).most_common() # first letters" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('s', 35647),\n", + " ('d', 11139),\n", + " ('e', 10701),\n", + " ('y', 10071),\n", + " ('g', 9125),\n", + " ('r', 7643),\n", + " ('t', 5990),\n", + " ('n', 5261),\n", + " ('l', 3314),\n", + " ('c', 1819),\n", + " ('m', 1600),\n", + " ('a', 1398),\n", + " ('h', 1268),\n", + " ('k', 920),\n", + " ('p', 699),\n", + " ('o', 665),\n", + " ('i', 343),\n", + " ('w', 306),\n", + " ('x', 243),\n", + " ('f', 240),\n", + " ('b', 158),\n", + " ('u', 88),\n", + " ('z', 51),\n", + " ('v', 15),\n", + " ('j', 3),\n", + " ('q', 2)]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Counter(w[-1] for w in W).most_common() # last letters" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('e', 108278),\n", + " ('s', 83059),\n", + " ('i', 81747),\n", + " ('a', 70762),\n", + " ('r', 68864),\n", + " ('n', 64756),\n", + " ('t', 61999),\n", + " ('o', 57205),\n", + " ('l', 50145),\n", + " ('c', 37734),\n", + " ('d', 34192),\n", + " ('u', 31126),\n", + " ('g', 26858),\n", + " ('p', 26294),\n", + " ('m', 25538),\n", + " ('h', 20654),\n", + " ('b', 18169),\n", + " ('y', 15752),\n", + " ('f', 12757),\n", + " ('v', 9670),\n", + " ('k', 8109),\n", + " ('w', 7971),\n", + " ('z', 4083),\n", + " ('x', 2666),\n", + " ('j', 1746),\n", + " ('q', 1689)]" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Counter(L for w in W for L in w).most_common() # all letters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# natalie\n", + "\n", + "The function `natalie` generates a portmantout proof for a set of words. The rough outline is:\n", + "\n", + " def natalie(W: Words) -> Proof:\n", + " precompute some data structures to make things more efficient\n", + " P = a proof, initially with just the first word, that we will build up\n", + " while there are nonsubwords that have not been used:\n", + " for (overlap, word) in (unused or bridging words that overlap):\n", + " append (overlap, word) to P and update data structures\n", + " return P\n", + " \n", + "There are two choices of how to pick a word to add to `P`:\n", + "- The function `unused_word` finds a word that has not been used yet that has a maximal overlap with the previous word. If there is such a word, we will always use it, and never consider reverting that choice. That's called a **greedy** approach, and it typically leads to solutions that are not optimal (the resulting $S$ is not the shortest possible) but are computationally feasible. It seems like finding a shortest $S$ is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find an optimal solution in a reasonable amount of time. So I'm happy with the greedy, suboptimal approach.\n", + "- The function `bridging_words` is called only when there is no way to add an unused word to the previous word. `bridging_words` returns a one- or two-word sequence (called a **bridge**) that will bring us to a place where we can again consume an unused word on the following iteration. \n", + "\n", + "`natalie` keeps track of the following data structures (which we will explain in more detail below):\n", + "- `unused: Words`: a set of the unused nonsubwords in $W$. When a word is added to $P$, it is removed from `unused`.\n", + "- `P: Proof`: e.g. `[(0, 'eskimo'), (4, 'kimono'),...]`, the proof that we are building up.\n", + "- `startswith: dict`: e.g. `startswith['kimo'] = ['kimono',...]` is a list of words that start with `'kimo'`. \n", + "- `firsts: Counter`: e.g. `firsts['a'] == 3528` is the number of unused words that start with the letter `a`.\n", + "- `bridges: dict`: e.g. `bridges['a' + 'q'] == ('airaq', [(1, 'air'), (2, 'iraq')])`, a description of a way to bridge from a word that ends in `'a'` to one that begins with `'q'`." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "def natalie(W: Words, start=None) -> Proof:\n", + " \"\"\"Return a portmantout string for W, and a Proof for it.\"\"\"\n", + " prev_word = start or next(iter(W))\n", + " unused = W - (subwords(W) | {prev_word})\n", + " P: Proof = [(0, prev_word)] # The emerging Proof \n", + " startswith = compute_startswith(unused) # startswith['th'] = words that start with 'th'\n", + " firsts = Counter(word[0] for word in unused) # Count of first letters of words\n", + " bridges = compute_bridges(W) # Words that bridge from 'a' to 'b'\n", + " while unused:\n", + " for (overlap, word) in (unused_word(prev_word, startswith, unused) or\n", + " bridging_words(prev_word, firsts, bridges)):\n", + " if word not in W:\n", + " return [] # Fail\n", + " P.append((overlap, word))\n", + " if word in unused:\n", + " unused.remove(word)\n", + " firsts -= Counter(word[0])\n", + " prev_word = word\n", + " return P" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(*Python trivia:* I say `firsts -= Counter(word[0])` rather than `first[word[0]] -= 1` because when the count for a letter reaches zero, the former deletes the letter from the Counter, whereas the latter just sets it to zero.)\n", + "\n", + "How do we know that `unused` will eventually be empty, so that the `while unused` loop can terminate? On each iteration we either use `unused_word`, which reduces the size of `unused`, or we use `bridging_words`, which doesn't. But after `bridging_words` adds the bridging word(s), we are guaranteed to be able to use an `unused` word on the next iteration (at least for the word set $W$; for smaller word sets `bridging_words` might return a non-word, which causes `natalie` to return the empty proof, indicating failure. \n", + "\n", + "The most important parts of `natalie` are the two functions `unused_word` and `bridging_words`, which decide what words will be added next to the emerging proof. Both return a list of the form `[(overlap, word),...]` that define a word or sequence of words to add to the proof, and the number of letters that each word overlaps the previous word. We will discuss the two functions in turn." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# unused_word and compute_startswith\n", + "\n", + "To select an `unused_word`, consider the suffixes of the previous word, longest suffix first, and if that suffix is a prefix of an unused word, then take it. `unused_word` returns either a list of length one, `[(overlap, word)]`, or the empty list, `[]` if no overlapping word can be found. " + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "def unused_word(prev_word: str, startswith: dict, unused: Words) -> [(int, str),...]:\n", + " \"\"\"Return a [(overlap, word)] pair to follow `prev_word`, or [].\"\"\"\n", + " return next(([(len(suf), word)]\n", + " for suf in suffixes(prev_word) if suf in startswith\n", + " for word in startswith[suf] if word in unused), [])\n", + "\n", + "def suffixes(word) -> list:\n", + " \"\"\"All non-empty proper suffixes of word, longest first.\"\"\"\n", + " return [word[i:] for i in range(1, len(word))]\n", + "\n", + "def prefixes(word) -> list:\n", + " \"\"\"All non-empty proper prefixes of word, shortest first.\"\"\"\n", + " return [word[:i] for i in range(1, len(word))]\n", + "\n", + "def multimap(pairs) -> dict:\n", + " \"\"\"Given (key, val) pairs, make a dict of {key: [val,...]}.\"\"\"\n", + " result = defaultdict(list)\n", + " for key, val in pairs:\n", + " result[key].append(val)\n", + " return result\n", + "\n", + "def compute_startswith(W) -> dict: \n", + " \"\"\"A mapping of a prefix to a list of the words that start with it.\"\"\"\n", + " return multimap((pre, w) for w in W for pre in prefixes(w))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " So when the previous word is `'eskimo'`, a call to `unused_word` finds a word with a four-letter overlap; no other word overlaps more:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(4, 'kimono')]" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "startswith = compute_startswith(W)\n", + "\n", + "unused_word('eskimo', startswith, W)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some examples of usage:" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'eskimo': [(4, 'kimono')],\n", + " 'kimono': [(4, 'monomers')],\n", + " 'shoehorns': [(5, 'hornswoggling')],\n", + " 'elephant': [(5, 'phantasm')],\n", + " 'phantasies': [(4, 'siesta')],\n", + " 'dachshund': [(4, 'hundredfold')],\n", + " 'vicars': [(4, 'carsick')],\n", + " 'flimsiest': [(5, 'siesta')],\n", + " 'siesta': [(4, 'establisher')],\n", + " 'dilettantism': [(6, 'antismog')],\n", + " 'antismog': [(4, 'smoggier')],\n", + " 'seascape': [(5, 'scapes')],\n", + " 'snark': [(4, 'narks')],\n", + " 'referendum': [(3, 'dumpcart')]}" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{w: unused_word(w, startswith, W) \n", + " for w in '''eskimo kimono shoehorns elephant phantasies dachshund vicars \n", + " flimsiest siesta dilettantism antismog seascape snark referendum'''.split()}" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['kimono', 'kimonos', 'kimonoed']" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "startswith['kimo']" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['imono', 'mono', 'ono', 'no', 'o']" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "suffixes('kimono')" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['k', 'ki', 'kim', 'kimo', 'kimon']" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "prefixes('kimono')" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'uf': ['ufos', 'ufo'],\n", + " 'gj': ['gjetost', 'gjetosts'],\n", + " 'oj': ['ojibwas', 'ojibwa'],\n", + " 'ry': ['rye'],\n", + " 'pf': ['pfennig', 'pfennigs'],\n", + " 'yc': ['ycleped', 'yclept'],\n", + " 'zl': ['zloty', 'zlotys'],\n", + " 'mc': ['mcdonald'],\n", + " 'ez': ['ezekiel'],\n", + " 'fj': ['fjord', 'fjords'],\n", + " 'tc': ['tchaikovsky'],\n", + " 'xm': ['xmases', 'xmas'],\n", + " 'ie': ['ieee'],\n", + " 'dn': ['dnieper'],\n", + " 'ud': ['udder', 'udders'],\n", + " 'sf': ['sforzato', 'sforzatos'],\n", + " 'aj': ['ajar'],\n", + " 'ym': ['ymca'],\n", + " 'vy': ['vyingly', 'vying'],\n", + " 'qo': ['qophs', 'qoph'],\n", + " 'hd': ['hdqrs'],\n", + " 'zw': ['zwieback', 'zwiebacks'],\n", + " 'jn': ['jnana', 'jnanas'],\n", + " 'dv': ['dvorak'],\n", + " 'bw': ['bwana', 'bwanas'],\n", + " 'fb': ['fbi'],\n", + " 'ct': ['ctrl']}" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{pre: startswith[pre] # Rare two-letter prefixes\n", + " for pre in startswith if len(pre) == 2 and len(startswith[pre]) <= 2}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# bridging_words and compute_bridges\n", + "\n", + "Suppose we reach a situation where the previous word was `'one'`, and the only remaining unused words are `'two'`, `'three'`, and `'six'`. Since there is no possible overlap, `unused_word` will return the empty list:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "unused_word('one', startswith, {'two', 'three', 'six'})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, we can still hope to find a previously-used a word that will **bridge** from the `'e'` at the end of `'one'` to the `'t'` or `'s'` at the start of one of the unused words. The function `compute_bridges` precomputes a table that we call `bridges`, and the function `bridging_words` finds the approriate bridging word(s) in that table.\n", + "\n", + "The `bridging_words` function is simple: in this example we fetch `bridges['e' + 't']` and `bridges['e' + 's']`, decide which gives us the shortest bridge, and return the list `[(overlap, word),...]` that makes up that bridge. For `unused_word` the list was always of length zero or one; for `bridging_words` the list is always of length one or two." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To compute the bridges is a bit of work, but we only have to do it once. We want a table of `{A + B: (bridge, [(overlap, word),...])}` where `A` and `B` are the letters we want to bridge between, and `bridge` is either a single word or a portmanteau of two words. \n", + "\n", + "We start by selecting from $W$ all the short words, as well as words that end in an unusual letter. (For our 108,709 word set $W$, we selected the 10,273 words with length up to 5, plus 159 words that end in any of 'qujvz', the five rarest letters. For other word sets, you may have to tune these parameters.) We consider each of these individual words as a possible bridge between its first and last letters, keeping only the shortest. The following means that `'eat'` is a shortest possible word that starts with `'e'` and ends in `'t'`.\n", + "\n", + " bridges['e' + 't'] == ('eat', [(1, 'eat')])\n", + " \n", + "Sometimes we can't bridge with one word: no word starts with `'a'` and ends with `'q'`. Thus, we also consider two-word bridges:\n", + "\n", + " bridges['a' + 'q'] == ('airaq', [(1, 'air'), (2, 'iraq')])\n", + " \n", + "which means that the shortest possible portmanteau that starts with `'a'` and ends in `'q'` is `'airaq'`, and the first word in that portmanteau is `'air'` and the second `'iraq'`.\n", + "\n", + "Here's the code:" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "def bridging_words(prev: str, firsts: Counter, bridges: dict) -> [(int, str),...]:\n", + " \"\"\"Find a previously-used word that will bridge to one of the letters in firsts.\"\"\"\n", + " A = prev[-1] # Ending letter of previous word\n", + " (bridge, pairs) = min((bridges[A + B] for B in firsts), key=bridgelen)\n", + " return pairs\n", + "\n", + "def bridgelen(bridge) -> int: return len(bridge[0])\n", + " \n", + "def compute_bridges(W: Words, maxlen=5, endings=tuple('qujvz')) -> dict:\n", + " \"\"\"A table of {A + B: (bridge, word1)} pairs that bridge letter A to letter B, \n", + " e.g. {'e'+'t': ('eat', 'eat'), 'a'+'q': ('airaq', 'air')}.\"\"\"\n", + " long = '?' * 29 # A default long \"word\"\n", + " bridges = {A + B: (long, [(1, long)]) for A in alphabet for B in alphabet}\n", + " shortwords = [w for w in W if len(w) <= maxlen or w.endswith(endings)]\n", + " startswith = compute_startswith(shortwords)\n", + " \n", + " def consider(bridge, *pairs):\n", + " \"\"\"Use this bridge if it is shorter than the previous bridges[AB].\"\"\"\n", + " AB = bridge[0] + bridge[-1]\n", + " bridges[AB] = min(bridges[AB], (bridge, [*pairs]), key=bridgelen)\n", + " \n", + " for w1 in shortwords:\n", + " consider(w1, (1, w1)) # One-word bridges\n", + " for suf in suffixes(w1): \n", + " for w2 in startswith[suf]:\n", + " bridge = w1 + w2[len(suf):]\n", + " consider(bridge, (1, w1), (len(suf), w2)) # Two-word bridges\n", + " return bridges\n", + "\n", + "alphabet = 'abcdefghijklmnopqrstuvwxyz'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is how `bridging_words` works in our example situation. First the necessary data structures `firsts` and `bridges`:" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 7.17 s, sys: 75 ms, total: 7.24 s\n", + "Wall time: 7.36 s\n" + ] + } + ], + "source": [ + "firsts = Counter(t=2, s=2) # Counts of the first letters of unused words in this situation\n", + "\n", + "%time bridges = compute_bridges(W)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(1, 'eat')]" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bridging_words('one', firsts, bridges)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That says that we should add the word `'eat'`, which overlaps one letter with `one`. After adding `'eat'`, we'll be set up to add one of the unused words on the next step. Here's some of the calculations that got us to `'eat'`:" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('eat', [(1, 'eat')])" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bridges['e' + 't']" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('ells', [(1, 'ells')])" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bridges['e' + 's']" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('eat', [(1, 'eat')])" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "A = 'e'\n", + "min((bridges[A + B] for B in firsts), key=bridgelen)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some more examples of bridges:" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('airaq', [(1, 'air'), (2, 'iraq')])" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bridges['a' + 'q']" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('you', [(1, 'you')])" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bridges['y' + 'u']" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('xericolloq', [(1, 'xeric'), (1, 'colloq')])" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "max(bridges.values(), key=bridgelen)" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(3, 261),\n", + " (4, 256),\n", + " (5, 69),\n", + " (6, 34),\n", + " (2, 25),\n", + " (7, 21),\n", + " (8, 7),\n", + " (1, 2),\n", + " (10, 1)]" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Counter(map(bridgelen, bridges.values())).most_common() # How long are bridges?" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'bq': ('bocciraq', [(1, 'bocci'), (1, 'iraq')]),\n", + " 'gq': ('geniiraq', [(1, 'genii'), (1, 'iraq')]),\n", + " 'jq': ('jinniraq', [(1, 'jinni'), (1, 'iraq')]),\n", + " 'oq': ('obeliraq', [(1, 'obeli'), (1, 'iraq')]),\n", + " 'qq': ('quasiraq', [(1, 'quasi'), (1, 'iraq')]),\n", + " 'vj': ('vetchajj', [(1, 'vetch'), (1, 'hajj')]),\n", + " 'xj': ('xericonj', [(1, 'xeric'), (1, 'conj')]),\n", + " 'xq': ('xericolloq', [(1, 'xeric'), (1, 'colloq')])}" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{AB: bridges[AB] for AB in bridges if bridgelen(bridges[AB]) >= 8} # Longest bridges" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Note 1:* My `compute_bridges` only does one-letter-to-one-letter bridges. It just seemed simpler to only fill in a 26×26 `bridges` table, and only maintain 26 entries in `firsts`. But that means sometimes we get a long bridge when we could have found a shorter one with more work. Here's an example where the previous word is `cogito` and there's only one word left, `'question'`, which starts with `'q'`. We end up with the bridge `'oculiraq'`, adding 6 letters between the `'o'` and `'q'`:" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(1, 'obeli'), (1, 'iraq')]" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bridging_words('cogito', {'q': 1}, bridges)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But if we had arranged for multi-letter-to-multi-letter bridges, we could have noticed that the bridge `'toque'` adds *zero* letters between the `'to'` at the end of `'cogito'` and the `'que'` at the start of `'question'`. How could I find `'toque'` in this situation? One approach would be to precompute a `bridges` table with up to two letters on the left and three on the right. That would mean about $26^5 = 12$ million table entries (although most of them will be empty). Another approach would be to call `unused_word`, but give it a selection of words that have been used before, and then check if it ends up in a place that allows us to make progress on the next move. On the one hand, this seems like a promising idea to explore. On the other hand, I did a quick measurement, and the average number of letters added per bridge under the current algorithm is just about 1, so there's not that much room to improve. Maybe this approach could reduce the number of letters in $S$ by 1%." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Note 2:* I should say that I started watching Tom Murphy's highly-entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI), but then I paused it because I wanted the fun of solving the problem mostly on my own. After I finished the first version of my program, I returned to the video and [paper](http://tom7.org/portmantout/murphy2015portmantout.pdf) and I noticed that \n", + "Murphy had an approach to bridges that was very similar to mine, but in one way clearly superior: In my original program I only considered two-word bridges when there was no one-word bridge, on the theory that one word is shorter than two. My program looked something like:\n", + "\n", + " (overlap, word) = unused_word(...) or one_word_bridge(...) or two_word_bridge(...)\n", + " \n", + "But Murphy showed that my theory was wrong: I had `bridges['w' + 'c'] = 'workaholic'`, but he had `'warc'`, a portmanteau of `'war'` and `'arc'`, which saves six letters over my single word. After seeing this, I shamelessly copied his approach, and now I too get a four-letter bridge for `'w' + 'c'` (sometimes `'warc'` and sometimes `'wet' + 'etc' = 'wetc'`), and my portmantout strings are about 0.5% shorter." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('wetc', [(1, 'wet'), (2, 'etc')])" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bridges['w' + 'c']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(*English trivia:* my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a **bridge**. Murphy's program starts by building a large pool of small portmanteaux that he calls **particles**, and when he can build no more particles, his next step is to put two particles together, so he calls it a **join**. The different metaphors for what our programs are doing lead to different terminology for the same idea.)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A Solution\n", + "\n", + "We're finally ready to solve problems! First the tiny `W2` word set, for which we must specify the starting word, or it will fail:" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(0, 'eskimo'),\n", + " (4, 'kimono'),\n", + " (4, 'monogram'),\n", + " (4, 'grammarian'),\n", + " (2, 'anarchy')]" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "natalie(W2, start='eskimo')" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'eskimonogrammarianarchy'" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "portman(natalie(W2, start='eskimo'), W2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now the big `W` word set, which works fine when it selects a starting word on its own, but, following Tom Murphy and doing him one letter better, I will supply the starting word `portmanteaux`:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 24.4 s, sys: 89.4 ms, total: 24.5 s\n", + "Wall time: 24.6 s\n" + ] + }, + { + "data": { + "text/plain": [ + "570002" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "P = natalie(W, start='portmanteaux')\n", + "S = portman(P, W)\n", + "len(S)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A little over half a million letters long. Here is the start of the string $S$:" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'portmanteauxiliariespritselfwardressestetsonshipsideswipingrassessorshipkeepergnestlikeneditorializationshorelinesmantillascarsonickeledematouslessoneditorializestfullynchessmanganousefulnessesquicentenniallylstonedifiersatzestygiantismshirtwaistlinesmenadsorptionosphericitywardenshipbuilderstwhiledgaroteddyingsinewsstandstilliestablisheriffdomicilingamsterdamnitrosemariestablishmentsaritzaspersersafaristocraciestimatorsionallyratelysianthologizedswampinessentiallyonnaisecularizershankingwoodwax'" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "S[:500]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is easier to grasp by looking at the first 50 items in $P$:" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(0, 'portmanteaux'),\n", + " (3, 'auxiliaries'),\n", + " (2, 'esprits'),\n", + " (3, 'itself'),\n", + " (4, 'selfward'),\n", + " (4, 'wardresses'),\n", + " (3, 'sestets'),\n", + " (5, 'stetsons'),\n", + " (4, 'sonships'),\n", + " (5, 'shipside'),\n", + " (4, 'sideswiping'),\n", + " (4, 'pingrasses'),\n", + " (5, 'assessorship'),\n", + " (4, 'shipkeeper'),\n", + " (4, 'epergnes'),\n", + " (3, 'nestlike'),\n", + " (4, 'likened'),\n", + " (2, 'editorializations'),\n", + " (3, 'onshore'),\n", + " (5, 'shorelines'),\n", + " (5, 'linesman'),\n", + " (3, 'mantillas'),\n", + " (3, 'lascars'),\n", + " (4, 'carson'),\n", + " (5, 'arsonic'),\n", + " (3, 'nickeled'),\n", + " (2, 'edematous'),\n", + " (4, 'tousles'),\n", + " (3, 'lessoned'),\n", + " (2, 'editorializes'),\n", + " (3, 'zestfully'),\n", + " (2, 'lynches'),\n", + " (4, 'chessman'),\n", + " (3, 'manganous'),\n", + " (2, 'usefulness'),\n", + " (7, 'fulnesses'),\n", + " (3, 'sesquicentennially'),\n", + " (4, 'allyls'),\n", + " (1, 'stoned'),\n", + " (2, 'edifiers'),\n", + " (3, 'ersatzes'),\n", + " (3, 'zesty'),\n", + " (3, 'stygian'),\n", + " (4, 'giantisms'),\n", + " (1, 'shirtwaist'),\n", + " (5, 'waistlines'),\n", + " (5, 'linesmen'),\n", + " (3, 'menads'),\n", + " (3, 'adsorption'),\n", + " (3, 'ionospheric')]" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "P[:50]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I'll write the whole string to the file [natalie.txt](natalie.txt):" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "570002" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "open('natalie.txt', 'w').write(S)" ] @@ -541,75 +1554,92 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Next Steps\n", - "\n", - "Each time I run this, I get a slightly different result (there are no `random` calls, but each Python re-start gets a random hash seed, which results in a different iteration order over dicts and sets). I had originally planned to add some randomness and take the best of *k* trials (like I did in my [TSP](TSP.ipynb) notebook), but all the trials I have seen so far fall within a narrow range of 575,000 to 581,000 letters, so I think it is not worth the effort for what would probably be a 1% improvement at best. My string is 6% shorter than Murphy's solution with 611,820 letters. \n", - "\n", - "I'll stop here (but you should feel free to do some experimentation of your own). Some ideas:\n", - "\n", - "- One weakness of my approach is that it can get stuck. With `W = {'one', 'two'}`, if my algorithm chooses `'one'` as the starting word, it will fail. You could fix that by allowing new words to be added to the start of the list (hint: use a `dequeue`) as well as the end.\n", - "- Following my [TSP](TSP.ipynb) notebook, an alternative to maintaining a single list and adding the maximum-overlap word is to maintain multiple lists, and on each iteration merge the lists with maximum overlap.\n", - "- With the 108,709 it is always possible to bridge from any letter to any other letter in at most two steps. But for smaller dictionaries, that might not be the case. You could consider that.\n", - "- To minimize the length of S, we can do three things: get better overlap between the unique words, use fewer/shorter bridging words, or get better overlap of the bridging words. Can you think of strategies to achieve any of these?\n" + "We can create a report on how we did:" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "def report(S, W, P):\n", + " sub = subwords(W)\n", + " nonsub = W - sub\n", + " bridge = len(P) - len(nonsub)\n", + " def Len(words): return sum(map(len, words))\n", + " print(f'W has {len(W):,d} words ({len(nonsub):,d} nonsubwords; {len(sub):,d} subwords).')\n", + " print(f'P has {len(P):,d} words ({len(nonsub):,d} nonsubwords; {bridge:,d} bridge words).')\n", + " print(f'S has {len(S):,d} letters; W has {Len(W):,d}; nonsubs have {Len(nonsub):,d}.')\n", + " print(f'The average overlap in P is {(Len(w for _,w in P)-len(S))/(len(P)-1):.2f} letters.')\n", + " print(f'The compression ratio (W/S) is {Len(W)/len(S):.2f}.')" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "W has 108,709 words (64,389 nonsubwords; 44,320 subwords).\n", + "P has 105,009 words (64,389 nonsubwords; 40,620 bridge words).\n", + "S has 570,002 letters; W has 931,823; nonsubs have 595,805.\n", + "The average overlap in P is 1.34 letters.\n", + "The compression ratio (W/S) is 1.63.\n" + ] + } + ], + "source": [ + "report(S, W, P)" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "W has 22 words (5 nonsubwords; 17 subwords).\n", + "P has 5 words (5 nonsubwords; 0 bridge words).\n", + "S has 23 letters; W has 90; nonsubs have 37.\n", + "The average overlap in P is 3.50 letters.\n", + "The compression ratio (W/S) is 3.91.\n" + ] + } + ], + "source": [ + "report(S1, W2, P1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Unit Tests\n", + "# Next Steps\n", "\n", - "This gives some examples of how the functions are used, and some assurance that they are doing the right thing." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'pass'" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def test(D):\n", - " W = {'on', 'one', 'two', 'won'}\n", - " S = 'twone'\n", - " L = ['two', 'won', 'on', 'one']\n", - " assert portman(L) == S\n", - " assert all(w in S for w in W)\n", - " assert set(L) == W\n", - " \n", - " assert prefixes('word') == ['w', 'wo', 'wor', 'word']\n", - " assert suffixes('word') == ['word', 'ord', 'rd', 'd']\n", - " assert subwords({'hello', 'world', 'he', 'hell', 'el', 'or'}) == (\n", - " {'el', 'he', 'hell', 'or'})\n", - " assert multimap([(1, 10), (2, 20), (3, 30), (2, 22), (3, 33)]) == (\n", - " {1: [10], 2: [20, 22], 3: [30, 33]})\n", - " \n", - " assert 'portmanteau' in D.words\n", - " assert 'port' in D.words\n", - " assert 'port' not in D.uniq\n", - " assert set(D.pre['hello']) == {'hello', 'helloed', 'helloes', 'helloing', 'hellos'}\n", - " \n", - " assert bridge(D, 'a', 'z') == 'abuzz'\n", - " assert bridge(D, 'a', 't') == 'at'\n", - " assert bridge(D, 'e', 't') == 'eat'\n", - " assert bridge(D, 'f', 'd') == 'fad'\n", - " \n", - " assert portman(['two', 'won', 'on', 'one']) == 'twone'\n", - " assert portman(['eskimo', 'kimono', 'monolith']) == 'eskimonolith'\n", + "Each time I restart this notebook, I get a slightly different result. (*Python trivia*: every time you start a new `python` instance, you get a slightly different [`hash`](https://docs.python.org/3.4/reference/datamodel.html#object.__hash__) function, which means that an iteration over a set may yield elements in a different order. This is to prevent a kind of denial of service attack on web servers, but for my program it means that the iteration over `unused` words is in a different order, so the results are different.) I had originally planned to use the **random restart** approach: add some randomness to the word selections and take the best of multiple trials. However, all the trials I have seen so far with my current program fall within a narrow range of 570,000 to 578,000 letters, so I think it is not worth the effort for what would probably be a 1% improvement at best. \n", "\n", - " return 'pass'\n", - " \n", - "test(D)" + "To compare my [program](portman.py) to [Murphy's](https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/portmantout/): I used a greedy approach that incrementally builds up a single long portmanteau, extending it via a bridge when necessary. Murphy first built a pool of smaller portmanteaux, then joined them all together. (I'm reminded of the [Traveling Salesperson Problem](TSP.ipynb) where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments.) The two approaches are different, but it is not clear whether one is better than the other. \n", + "\n", + "In terms of implementation, mine is in Python and is concise (112 lines); Murphy's is in C++ and is verbose (1867 lines). Murphy's code does a lot of extra work that mine doesn't: generating diagrams, and running multiple threads in parallel to implement the random restart idea. It appears Murphy didn't quite have the complete concept of **subwords**: he did mention that when he adds `'bulleting'`, he crosses `'bullet'` and `'bulletin'` off the list, but somehow [his string](http://tom7.org/portmantout/murphy2015portmantout.pdf) contains both `'spectacular'` and `'spectaculars'` in two different places. My guess is that when he adds `'spectaculars'` he crosses off `'spectacular'`, but if he happens to add `'spectacular'` first, he will later add `'spectaculars'`. Support for this view is that his output in `bench.txt` says \"I skipped 24319 words that were already substrs\", but I computed that there are 44,320 such words; he found about half of them. I think those missing 20,001 words are the main reason why my shortest string ([570,002 letters](natalie570002.txt)) is about 40,000 letters shorter than his (611,820 letters).\n", + "\n", + "I'll stop here, but you should feel free to do some experimentation of your own. Some ideas:\n", + "\n", + "- With the set of 108,709 words it is always possible to bridge from any letter to any other letter in at most two steps. But for smaller word lists, that might not be the case. You could consider three-word bridges, and consider what to do when there is no bridging sequence at all (perhaps back up and remove a previously-placed word; perhaps use a beam search to keep several alternatives open at once; perhaps allow the addition of words to the start as well as the end of `P`).\n", + "- Use linguistic resources (such as [pretrained word embeddings](https://nlp.stanford.edu/projects/glove/)) to teach your program what words are related to each other. Encourage the program to place related words next to each other.\n", + "- Use linguistic resources (such as [NLTK](https://github.com/nltk/)) to teach your program where syllable breaks are in words. Encourage the program to make overlaps match syllables. (That's why \"preferendumdums\" sounds better than \"fortyphonshore\".)\n", + "- Here are some ideas to minimize the length of $S$. Can you implement these or think of your own?\n", + " - Get better overlap of bridging words, perhaps by implementing the multi-letter-to-multi-letter approach.\n", + " - Use fewer bridging words by planning ahead to avoid the need for them. Perhaps `compute_startswith` could sort the words in each key's bucket so that the \"difficult\" words (say, the ones that end in unusual letters) are encountered earlier in the program's execution, when there are more available words for them to connect to.\n", + " - You can't alter the number of nonsubwords, but you might be able to get better overlap between them. Perhaps make more informed choices of which ones to use when. For example, if there is an affix that is the prefix of only one word, and the suffix of only one other word, then probably those two words should overlap.\n", + " - Alter either Murphy's code or mine to combine his idea of joining particles with my idea of completely handling subwords.\n", + " - Use a completely different strategy. What can you come up with?" ] } ], @@ -629,7 +1659,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.0" + "version": "3.7.6" } }, "nbformat": 4,