From 04fab5aee0a8a1b1f990558f11ff1a5ef90a8961 Mon Sep 17 00:00:00 2001 From: Peter Norvig Date: Mon, 22 Jun 2020 16:58:58 -0700 Subject: [PATCH] Add files via upload --- ipynb/Portmantout.ipynb | 2067 +++++++++++++++++++-------------------- 1 file changed, 999 insertions(+), 1068 deletions(-) diff --git a/ipynb/Portmantout.ipynb b/ipynb/Portmantout.ipynb index 4e97467..e9d0e1b 100644 --- a/ipynb/Portmantout.ipynb +++ b/ipynb/Portmantout.ipynb @@ -4,41 +4,81 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "
Peter Norvig
\n", + "
Peter Norvig
Dec 2018
Updated Jun 2020
\n", "\n", "# Portmantout Words\n", "\n", - "A [***portmanteau***](https://en.wikipedia.org/wiki/Portmanteau) is a word that squishes together two words, like *math* + *athlete* = *mathlete*, or *tutankhamenability*, the property of being amenable to seeing the Egyptian exhibit. Inspired by [**Darius Bacon**](http://wry.me/), I covered this as a programming exercise in my 2012 [Udacity course](https://www.udacity.com/course/design-of-computer-programs--cs212). Recently I was re-inspired by [**Tom Murphy VII**](http://tom7.org/), who added a new twist: [***portmantout words***](http://www.cs.cmu.edu/~tom7/portmantout/) (*tout* from the French for *all*), which are defined as:\n", + "A [***portmanteau***](https://en.wikipedia.org/wiki/Portmanteau) is a word that squishes together two words, like *math* + *athlete* = *mathlete*. Inspired by [**Darius Bacon**](http://wry.me/), I covered this as a programming exercise in my 2012 [**Udacity course**](https://www.udacity.com/course/design-of-computer-programs--cs212). In 2018 I was re-inspired by [**Tom Murphy VII**](http://tom7.org/), who added a new twist: [***portmantout words***](http://www.cs.cmu.edu/~tom7/portmantout/) (*tout* from the French for *all*), which are defined as:\n", "\n", "> A **portmantout** of a set of words $W$ is a string $S$ such that:\n", "* Every word in $W$ is a **substring** of $S$.\n", - "* The words **overlap**: each word (except the first) must start at an index that is between the beginning and end of another word.\n", + "* The words **overlap**: every word (except the first) starts at an index that is equal to or before the end of another word.\n", "* **Nothing else** is in $S$: every letter in $S$ comes from the overlapping words. (But a word may be repeated any number of times.)\n", "\n", - "Although not part of the definition, the goal is to get as short an $S$ as possible, and to do it for a $W$ of 100,000 words or so. Developing a program to do that is the goal of this notebook. My program (also available as [portman.py](portman.py)) helped me discover:\n", + "Although not part of the definition, the goal is to get as short an $S$ as possible, and to do it for a set of about 100,000 words. Developing a program to do that is the goal of this notebook. My program (also available as [`portman.py`](portman.py)) helped me discover:\n", "\n", - "- **preferendumdums**: a political commentary portmantout of {prefer, referendum, dumdums}\n", - "- **fortyphonshore**: a dire weather report portmantout of {forty, typhons, onshore}; \n", - "- **allegestionstage**: a brutal theatre critic portmantout of {alleges, egestions, onstage}. \n", - "- **skymanipulablearsplittingler**: a nerve-damaging aviator portmantout of {skyman, manipulable, blears, earsplitting, tinglers}\n", - "- **edinburgherselflesslylyricize**: a Scottish music review portmantout of {edinburgh, burghers, herself, selflessly, slyly, lyricize}\n", + "- **preferendumdums**: a political commentary portmanteau of {prefer, referendum, dumdums}\n", + "- **fortyphonshore**: a dire weather report portmanteau of {forty, typhons, onshore}; \n", + "- **allegestionstage**: a brutal theatre critic portmanteau of {alleges, egestions, onstage}. \n", + "- **skymanipulablearsplittingler**: a nerve-damaging aviator portmanteau of {skyman, manipulable, blears, earsplitting, tinglers}\n", + "- **edinburgherselflesslylyricize**: a Scottish music review portmanteau of {edinburgh, burghers, herself, selflessly, slyly, lyricize}\n", + "- **impromptutankhamenability**: a portmanteau of {impromptu, tutankhamen, amenability}, indicating a willingness to see the Egyptian exhibit on the spur of the moment.\n", + "- **dashikimonogrammarianarchy**: a portmanteau of {dashiki, kimono, monogram, grammarian, anarchy}, describing the chaos that ensues when a linguist gets involved in choosing how to enscribe African/Japanese garb. \n", + "\n", + "# Problem-Solving Strategy\n", + "\n", + "Although I haven't proved it, my intuition is that finding a shortest $S$ is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find an optimal solution in a reasonable amount of time. A common approach to NP-hard problems is a **greedy algorithm**: make the locally best choice at each step, in the hope that the steps will fit together into a solution that is not too far from the globally best solution. \n", + "\n", + "Thus, my approach will be to build up a **path**, starting with one word, and then adding **steps** to the end of the evolving path, one at a time. Each step consists of a word from *W* that overlaps the end of the previous word by at least one letter. I will choose the step that seems to be the best choice at the time (the one with the **lowest cost**), and will never undo a step, even if the path seems to get stuck later on. I distinguish two types of steps:\n", + "\n", + "- **Unused word step**: using a word for the first time. (Once we use them all, we're done.)\n", + "- **Bridging word step**: if no remaining unused word overlaps the previous word, we need to do something to get back on track to consuming unused words. I call that something a **bridge**: a step that **repeats** a previously-used word and has a word ending that matches the start of some unused word. Repeating a word may add letters to the final length of $S$, and it doesn't get us closer to the requirement of using all the words, but it is sometimes necessary. Sometimes two words are required to build a bridge, but never more than two (at least with the 100,000 word set we will be dealing with). \n", + "\n", + "There's actually a third type of word, but it doesn't need a corresponding type of step: a **subword** is a word that is a substring of another word. If, say, `ajar` is in $W$, then we know we have to place it in some step along the path. But if `jar` is also in $W$, we don't need a step for it—whenever we place `ajar`, we have automatically placed `jar`. The program can save computation time save time by initializing the set of **unused words** to be the **nonsubwords** in $W$. (We are still free to use a subword as a **bridging word** if needed.)\n", + "\n", + "(*English trivia:* I use the clumsy term \"nonsubwords\" rather than \"superwords\", because there are a small number of words, like \"cozy\" and \"pugs\" and \"july,\" that are not subwords of any other words but are also not superwords: they have no subwords.)\n", + "\n", + "If we're going to be **greedy**, we need to know what we're being greedy about. I will always choose a step that minimizes the **number of excess letters** that the step adds. To be precise: I arbitrarily assume a baseline model in which all the words are concatenated with no repeated words and no overlap between them. (That's not a valid solution, but it is useful as a benchmark.) So if I add an unused word, and overlap it with the previous word by three letters, that is an excess of -3 (a negative excess is a positive thing): I've saved three letters over just concatenating the unused word. For a bridging word, the excess is the number of letters that do not overlap either the previous word or the next word. \n", + "\n", + "**Examples:** In each row of the table below, `'ajar'` is the previous word, but each row makes different assumptions about what unused words remain, and thus we get different choices for the step to take. The table shows the overlapping letters between the previous word and the step, and in the case of bridges, it shows the next unused word that the step is bridging to. In the final two columns we give the excess score and excess letters.\n", + "\n", + "|Previous|Step(s)|Overlap|Bridge to|Type of Step|Excess|Letters|\n", + "|--------|----|----|---|---|---|---|\n", + "| ajar|jarring|jar||unused word |-3| |\n", + "| ajar|arbitrary|ar||unused word |-2||\n", + "| ajar|argot|ar|goths|one-step bridge |0||\n", + "| ajar|arrow|ar|owlets| one-step bridge|1|r|\n", + "| ajar|rani+iraq|r|quizzed| two-step bridge|5 |anira | \n", + "\n", + "Let's go over the example steps one at a time:\n", + "- **jarring**: Here we assume `jarring` is an unused word, and it overlaps by 3 letters, giving it an excess cost of -3, which is the best possible (an overlap of 4 would mean `ajar` is a subword, and we already agreed to eliminate subwords).\n", + "- **arbitrary**: an unused word that overlaps by 2 letters, so it would only be chosen if there were no unused words with 3-letter overlap.\n", + "- **argot**: From here on down, we assume there are no unused words that overlap, which means that `argot` is a best-possible bridge, because it completely overlaps the `ar` in `ajar` and the `got` in an unused word, `goths`.\n", + "- **arrow**: overlaps the `ar` in `ajar` and the `ow` in an unused word, `owlets`, but that leaves one excess letter, `r`, for an excess cost of 1.\n", + "- **rani + iraq**: Suppose `quizzed` is the only remaining unused word. There is no single word that bridges from any suffix of `ajar` to any prefix of `quizzed`. But I have arranged to build two-word bridges for every combination of one letter on the left and one letter on the right (unless there was already a shorter one-word bridge for that combination). The bridge from `'r'` to `'q'` is `rani` followed by `iraq`, which has an excess score of 5 due to the letters `anira` not overlapping anything.\n", + "\n", + "We see that unused word steps always have a negative excess cost (that's good) while bridge steps always have a zero or positive excess cost; thus an unused word step is always better than a bridge step (according to this metric).\n", "\n", "\n", "\n", - "# Program Design\n", "\n", - "I originally thought I would define a major function, `S = portman(W)`, to generate the portmantout string, and a minor function, `is_portman(W, S)`, to verify the result. But I found the verification process was difficult. For example, given `S = '...helloworld...'` I would reject that as non-overlapping if I parsed it as `'hello'` + `'world'`, but I would accept it if parsed as `'hell'` + `'low'` + `'world'`. It was hard for `is_portman` to decide which parse was intended, which is a shame because `portman` *knew* which was intended, but discarded the information. \n", + "# Data Type Implementation\n", "\n", - "Therefore, I decided to change the interface: I'll have one function that takes $W$ as input and returns what I call a **portmantout proof**, $P$. I can gain insight by examining $P$, and I can pass $P$ to a second function that can easily generate the string $S$ while verifying the proof. I decided on the following calling and [naming](https://en.wikipedia.org/wiki/Natalie_Portman) conventions:\n", + "Here I describe how to implement the main data types in Python:\n", "\n", - " P = natalie(W) # Generate a portmantout proof P from a set of words W\n", - " S = portman(P, W) # Verify that the proof is valid and compute the string S\n", + "- **Word**: a Python `str` (as are subparts of words, like suffixes or individual letters).\n", + "- **Wordset**: a subclass of `set`, denoting a set of words.\n", + "- **Path**: a Python `list` of steps.\n", + "- **Step**: a tuple of `(overlap, word)`. The step adding `jarring` to `ajar` would be `(3, 'jarring')`, indicating an overlap of three letters. The first step in a path should have an overlap of 0; all others should have a positive integer overlap.\n", + "- **Bridge**: a tuple of an excess cost followed by one or two steps, e.g. `(1, (2, 'arrow'))`.\n", + "- **Bridges**: a precomputed and cached table mapping a prefixes and suffix to a bridge. For example:\n", "\n", - "or in other words:\n", + " W.bridges['ar']['ow'] == (1, (2, 'arrow'))\n", + " W.bridges['r']['q'] == (5, (1, 'rani'), (1, 'iraq'))\n", "\n", - " S = portman(natalie(W), W) # Generate a portmantout of W\n", + "(*Python trivia:* I implemented `Wordset` as a subclass of `set` so that I can add attributes like `W.bridges`. You can do that with a user-defined subclass, but not with a builtin class.)\n", "\n", - "The proof $P$ is in the form of an ordered list, `[(overlap, word),...]` where each `word` is a member of $W$ and each `overlap` is an integer saying how many letters in the word overlap with the previous word (this should be 0 for the first word and positive for subsequent words). For example:" + "In the following code, data tyes are **Capitalized** and indexes into tuples are **UPPERCASE**." ] }, { @@ -47,25 +87,25 @@ "metadata": {}, "outputs": [], "source": [ - "Words = set \n", - "Proof = list\n", + "from collections import defaultdict, Counter\n", + "from typing import List, Tuple, Set, Dict, Any\n", "\n", - "W1: Words = {'anarchy', 'eskimo', 'grammarian', 'kimono', 'monogram'}\n", - "S1: str = 'eskimonogrammarianarchy'\n", - "P1: Proof = [(0, 'eskimo'),\n", - " (4, 'kimono'),\n", - " (4, 'monogram'),\n", - " (4, 'grammarian'),\n", - " (2 , 'anarchy')]" + "Word = str\n", + "class Wordset(set): \"\"\"A set of words.\"\"\"\n", + "Step = Tuple[int, str] # An (overlap, word) pair.\n", + "OVERLAP, WORD = 0, 1 # Indexes of the two parts of a Step.\n", + "Path = List[Step] # A list of steps.\n", + "Bridge = (int, Step,...) # An excess letter count and step(s), e.g. (1, (2, 'arrow')).\n", + "EXCESS, STEPS = 0, slice(1, None) # Indexes of the two parts of a bridge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# portman\n", + "# 108,709 Word Set \n", "\n", - "The function `portman(P, W)` takes a proof $P$ and a set of words $W$ and generates the portmantout string $S$ while verifying that the proof is correct (or raising an `AssertionError` if it is not). Assertions are appropriate because I'm thinking of this as one part of my program verifying the internal logic of another part. If I was running a service to verify other people's proofs, I would not use `assert` statements." + "We can make Tom Murphy's 108,709-word list `\"wordlist.asc\"` into a `Wordset`, $W$:" ] }, { @@ -74,7 +114,32 @@ "metadata": {}, "outputs": [], "source": [ - "from collections import defaultdict, Counter" + "! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc\n", + "\n", + "W = Wordset(open('wordlist.asc').read().split()) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Overall Program Design\n", + "\n", + "I thought I would define a major function, `portman`, to generate the portmantout string $S$ from the set of words $W$ according to the strategy outlined above, and a minor function, `is_portman`, to verify the result. But I found verification to be difficult. For example, given $S =$ `'...helloworld...'` I would reject that as non-overlapping if I parsed it as `'hello'` + `'world'`, but I would accept it if parsed as `'hell'` + `'low'` + `'world'`. It was hard for `is_portman` to decide which parse was intended, which is a shame because `portman` *knew* which was intended, but discarded the information. \n", + "\n", + "Therefore, I decided to change the interface: I'll have one function that takes $W$ as input and returns a path $P$, and a second function to generate the string $S$ from $P$. I decided on the following calling and [naming](https://en.wikipedia.org/wiki/Natalie_Portman) conventions:\n", + "\n", + " P = natalie(W) # Find a portmantout path P for a set of words W\n", + " is_portman(P, W) # Verify whether P is a valid path covering words W\n", + " S = portman(P) # Compute the string S from the path P\n", + "\n", + "Thus I can generate a string $S$ with:\n", + "\n", + " S = portman(natalie(W))\n", + " \n", + "# portman\n", + "\n", + "Here is the definition of `portman` and the result of `portman(P1)`, covering the word set `W1` (which has five superwords and 16 subwords):" ] }, { @@ -83,39 +148,26 @@ "metadata": {}, "outputs": [], "source": [ - "def portman(P: Proof, W: Words) -> str:\n", - " \"\"\"Compute the portmantout string S from the proof P; verify that it covers W.\"\"\"\n", - " S = []\n", - " prev_word = ''\n", - " for (overlap, word) in P:\n", - " assert word in W, f'nothing else is allowed in S: {word}'\n", - " left, right = word[:overlap], word[overlap:] # Split word into two parts\n", - " assert overlap >= 0 and left == prev_word[-overlap:], f'the words must overlap: {prev_word, word}'\n", - " S.append(right)\n", - " prev_word = word\n", - " S = ''.join(S)\n", - " assert all(w in S for w in W), 'each word in W must be a substring of S'\n", - " return S" + "def portman(P: Path) -> Word:\n", + " \"\"\"Compute the portmantout string S from the path P.\"\"\"\n", + " return ''.join(word[overlap:] for (overlap, word) in P)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'eskimonogrammarianarchy'" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "portman(P1, W1)" + "W1 = Wordset(('anarchy', 'dashiki', 'grammarian', 'kimono', 'monogram',\n", + " 'an', 'am', 'arc', 'arch', 'aria', 'as', 'ash', 'dash', \n", + " 'gram', 'grammar', 'i', 'mar', 'narc', 'no', 'on', 'ram'))\n", + "P1 = [(0, 'dashiki'),\n", + " (2, 'kimono'),\n", + " (4, 'monogram'),\n", + " (4, 'grammarian'),\n", + " (2 , 'anarchy')]\n", + "S1 = 'dashikimonogrammarianarchy'" ] }, { @@ -126,7 +178,7 @@ { "data": { "text/plain": [ - "'helloworld'" + "'dashikimonogrammarianarchy'" ] }, "execution_count": 5, @@ -135,16 +187,26 @@ } ], "source": [ - "portman([(0, 'hell'), (1, 'low'), (1, 'world')], {'hell', 'low', 'world'})" + "portman(P1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Subwords\n", + "# natalie\n", "\n", - "I want to introduce the concept of **subwords**. The following set of words `W2` has 17 more words than `W1`:" + "As stated above, the approach is to start with one word (either given as an optional argument to `natalie` or chosen arbitrarily from the word set $W$), and then repeatedly add steps, each step being either an unused word or a bridging word. In order to make this process efficient, we arrange for `precompute(W)` to cache the following information in `W`:\n", + " - `W.subwords`: a set of all the words that are contained within another word in `W`.\n", + " - `W.bridges`: a dict where `W.bridges[suf][pre]` gives the best bridge between the two word affixes.\n", + " - `W.unused`: initially the set of nonsubwords in `W`; when a word is used it is removed from the set.\n", + " - `W.startswith`: a dict that maps from a prefix to all the unused words that start with the prefix. A word is removed from all the places it appears when it is used. Example: `W.startswith['somet'] == {'something', 'sometimes'}`.\n", + " \n", + "These structures are somewhat complicated, so don't be discouraged if you have to go over a line of code or a prose description word-by-word several times before you understand exactly how it works.\n", + "\n", + "After the precomputation, `natalie` loops until there are no more unused words. On each turn we call `unused_step`, which returns a list of one step if an unused word overlaps, or the empty list if it doesn't, in which case we call `bridging_steps`, which always returns a bridge of either one or two steps. We then append the step(s) to the path `P`, and call `used(W, word)` to mark that `word` has been used (reducing the size of `W.unused` and updating `W.startswith` if `word` was previously unused). \n", + "\n", + "It is important that every bridge leads to an unused word. That way we know the program will **always terminate**: if $N$ is the number of unused nonsubwords in $W$, then consider the quantity $(2N + (1$ `if` the last step overlaps an unused word `else` $0))$. Every iteration of the `while` loop decreases this quantity by at least 1; therefore the quantity will eventually be zero, and when it is zero, it must be that `W.unused` is empty and the loop terminates." ] }, { @@ -153,45 +215,70 @@ "metadata": {}, "outputs": [], "source": [ - "W2 = {'anarchy', 'eskimo', 'grammarian', 'kimono', 'monogram', \n", - " 'a', 'am', 'an', 'arc', 'arch', 'aria', 'gram', 'grammar', \n", - " 'i', 'mar', 'mono', 'narc', 'no', 'on', 'ram', 'ski', 'skim'}" + "def natalie(W: Wordset, start=None) -> Path:\n", + " \"\"\"Return a portmantout path containing all words in W.\"\"\"\n", + " precompute(W)\n", + " word = start or first(W.unused)\n", + " used(W, word)\n", + " P = [(0, word)]\n", + " while W.unused:\n", + " steps = unused_step(W, word) or bridging_steps(W, word)\n", + " for (overlap, word) in steps:\n", + " P.append((overlap, word))\n", + " used(W, word)\n", + " return P" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "But our old `P1` still works as a portmantout proof of `W2`, yielding the same string:" + "`unused_steps` considers every suffix of the previous word, longest suffix first. If a suffix starts any unused words, we choose the first such word as the step. Since we're going longest-suffix first, no other word could do better, and since unused steps have negative costs and bridges don't, the unused step will always be better than any bridge.\n", + "\n", + "`bridging_steps` also tries every suffix of the previous word, and for each one it looks in the `W.bridges[suf]` table to see what prefixes (of unused words) we can bridge to from this suffix. Consider all such `W.bridges[suf][pre]` entries that bridge to the prefix of an unused word (as maintained in `W.startswith[pre]`). Out of all such bridges, take one with the minimal excess cost, and return the one- or two-step sequence that makes up the bridge.\n", + "\n", + "So, both `unused_steps` and `bridging_steps` return a list of steps, the former either zero or one steps; the latter either one or two steps." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'eskimonogrammarianarchy'" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "portman(P1, W2)" + "def unused_step(W: Wordset, prev_word: Word) -> List[Step]:\n", + " \"\"\"Return [(overlap, unused_word)] or [].\"\"\"\n", + " for suf in suffixes(prev_word):\n", + " for unused_word in W.startswith.get(suf, ()):\n", + " overlap = len(suf)\n", + " return [(overlap, unused_word)]\n", + " return []\n", + "\n", + "def bridging_steps(W: Wordset, prev_word: Word) -> List[Step]:\n", + " \"\"\"The steps from the shortest bridge that bridges \n", + " from a suffix of prev_word to a prefix of an unused word.\"\"\"\n", + " bridge = min(W.bridges[suf][pre] \n", + " for suf in suffixes(prev_word) if suf in W.bridges\n", + " for pre in W.bridges[suf] if W.startswith[pre])\n", + " return bridge[STEPS]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This works because the 17 new words in `W2` are all **subwords** of the first five words. If a superword like `'monogram'` is included in the proof $P$ and thus in the string $S$, then subwords like `'on'`, `'no'`, and `'gram'` are automatically included, without having to explicitly list them in $P$. \n", + "(*Python trivia:* in `unused_step` I do `W.startswith.get(suf, ())`, not `W.startswith[suf]` because the dict in question is a `defaultdict(set)`, and if there is no entry there, I don't want to insert an empty set entry.)\n", "\n", - "We can compute the subwords (and from that, nonsubwords) of a set of words as follows:" + "**Failure is not an option**: what happens if we have a small word set that can't make a portmantout? First `unused_step` will fail to find an unused word, which is fine; then `bridging_steps` will fail to find a bridge, and will raise `ValueError: min() arg is an empty sequence`. You could catch that error and return, say, an empty path if you wanted to, but my intended use is for word sets where this can never happen." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# precompute etc.\n", + "\n", + "Here are a bunch of the subfunctions that make the code above work:" ] }, { @@ -200,13 +287,54 @@ "metadata": {}, "outputs": [], "source": [ - "def subwords(W: Words) -> Words:\n", - " \"\"\"All the words in W that are subparts of some other word.\"\"\"\n", - " wordparts = (subparts(w) & W for w in W)\n", - " return set().union(*wordparts)\n", + "def precompute(W):\n", + " \"\"\"Precompute and cache data structures for W. The .subwords and .bridges\n", + " data structures are static and only need to be computed once; .unused and\n", + " .startswith are dynamic and must be recomputed on each call to `natalie`.\"\"\"\n", + " if not hasattr(W, 'subwords') or not hasattr(W, 'bridges'): \n", + " W.subwords = subwords(W)\n", + " W.bridges = build_bridges(W)\n", + " W.unused = W - W.subwords\n", + " W.startswith = compute_startswith(W.unused)\n", + " \n", + "def used(W, word):\n", + " \"\"\"Remove word from `W.unused` and, for each prefix, from `W.startswith[pre]`.\"\"\"\n", + " assert word in W, f'used \"{word}\", which is not in the word set'\n", + " if word in W.unused:\n", + " W.unused.remove(word)\n", + " for pre in prefixes(word):\n", + " W.startswith[pre].remove(word)\n", + " if not W.startswith[pre]:\n", + " del W.startswith[pre]\n", + " \n", + "def first(iterable, default=None): return next(iter(iterable), default)\n", "\n", - "def subparts(word) -> set:\n", - " \"\"\"All non-empty proper substrings of this word\"\"\"\n", + "def multimap(pairs) -> Dict[Any, set]:\n", + " \"\"\"Given (key, val) pairs, make a dict of {key: {val,...}}.\"\"\"\n", + " result = defaultdict(set)\n", + " for key, val in pairs:\n", + " result[key].add(val)\n", + " return result\n", + "\n", + "def compute_startswith(words) -> Dict[str, Set[Word]]: \n", + " \"\"\"A dict mapping a prefix to all the words it starts:\n", + " {'somet': {'something', 'sometimes'},...}.\"\"\"\n", + " return multimap((pre, w) for w in words for pre in prefixes(w))\n", + "\n", + "def subwords(W: Wordset) -> Set[str]:\n", + " \"\"\"All the words in W that are subparts of some other word.\"\"\"\n", + " return {subword for w in W for subword in subparts(w) & W} \n", + " \n", + "def suffixes(word) -> List[str]:\n", + " \"\"\"All non-empty proper suffixes of word, longest first.\"\"\"\n", + " return [word[i:] for i in range(1, len(word))]\n", + "\n", + "def prefixes(word) -> List[str]:\n", + " \"\"\"All non-empty proper prefixes of word.\"\"\"\n", + " return [word[:i] for i in range(1, len(word))]\n", + "\n", + "def subparts(word) -> Set[str]:\n", + " \"\"\"All non-empty proper substrings of word\"\"\"\n", " return {word[i:j] \n", " for i in range(len(word)) \n", " for j in range(i + 1, len(word) + (i > 0))}" @@ -216,29 +344,43 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "(*Python trivia:* the `(i > 0)` in `subparts` means that for a four-letter word like `'skim'`, we include subparts `word[i:4]` except when `i == 0`, thus including `'kim'`, `'im'`, and `'m'`, but not `'skim'`. In Python it is considered good style to have a Boolean expression like `(i > 0)` automatically converted to an integer `0` or `1`.)\n", + "(*Python trivia:* the `(i > 0)` in the last line of `subparts` means that for a four-letter word like `'gram'`, we include subparts `word[i:4]` except when `i == 0`, thus including `'ram'`, `'am'`, and `'m'`, but not `'gram'`. In Python it is considered good style to have a Boolean expression like `(i > 0)` automatically converted to an integer `0` or `1`.)\n", "\n", - "(*English trivia:* I use the clumsy term \"nonsubwords\" rather than \"superwords\", because there are a couple dozen words, like \"cozy\" and \"pugs\" and \"july,\" that are not subwords of any other words but are also not superwords: they have no subwords.)" + "(*Math trivia:* \"Proper\" means \"not whole\". A proper subset is a subset that is not the whole set itself; a proper subpart of a word is a part (i.e. substring) of the word that is not the whole word itself.)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Building Bridges\n", + "\n", + "The last piece of the program is the construction of the `W.bridges` table. Recall that we want `W.bridges[suf][pre]` to be a bridge between a suffix of the previous word and a prefix of an unused word, as in the examples:\n", + "\n", + " W.bridges['ar']['ow'] == (1, (2, 'arrow'))\n", + " W.bridges['ar']['c'] == (0, (2, 'arc'))\n", + " W.bridges['r']['q'] == (5, (1, 'rani'), (1, 'iraq'))\n", + " \n", + "We build all the bridges once and for all in `precompute`, and don't update them as words are used. Thus, `W.bridges['r']['q']` says \"if there are any unused words starting with `'q'`, you can use this bridge, but I'm not promising there are any.\" The caller (i.e. `bridging_steps`) is responsible for checking that `W.startswith['q']` contains unused word(s).\n", + " \n", + "Bridges should be short. We don't need to consider `antidisestablishmentarianism` as a possible bridge word. Instead, from our 108,709 word set $W$, we'll select the 10,273 words with length up to 5, plus 20 six-letter words that end in any of 'qujvz', the rarest letters. (For other word sets, you may have to tune these parameters.) I call these `shortwords`. I also compute a `shortstartswith` table for the `shortwords`, where, for example,\n", + "\n", + " shortstartswith['som'] == {'soma', 'somas', 'some'} # but not 'somebodies', 'somersaulting', ...\n", + " \n", + "To build one-word bridges, consider every shortword, and split it up in all possible ways into a prefix that will overlap the previous word, a suffix that will overlap the next word, and a count of zero or more excess letters in the middle that don't overlap anything. For example:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'i', 'im', 'k', 'ki', 'kim', 'm', 's', 'sk', 'ski'}" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "subparts('skim') # The subparts (non-empty proper substrings) of a word" + "def splits(word) -> List[Tuple[int, str, str]]: \n", + " \"\"\"A sequence of (excess, pre, suf) tuples.\"\"\"\n", + " return [(excess, word[:i], word[i+excess:])\n", + " for excess in range(len(word) - 1)\n", + " for i in range(1, len(word) - excess)]" ] }, { @@ -249,7 +391,16 @@ { "data": { "text/plain": [ - "{'anarchy', 'eskimo', 'grammarian', 'kimono', 'monogram'}" + "[(0, 'a', 'rrow'),\n", + " (0, 'ar', 'row'),\n", + " (0, 'arr', 'ow'),\n", + " (0, 'arro', 'w'),\n", + " (1, 'a', 'row'),\n", + " (1, 'ar', 'ow'),\n", + " (1, 'arr', 'w'),\n", + " (2, 'a', 'ow'),\n", + " (2, 'ar', 'w'),\n", + " (3, 'a', 'w')]" ] }, "execution_count": 10, @@ -258,50 +409,37 @@ } ], "source": [ - "W2 - subwords(W2) # These nonsubwords must be in the proof" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'a',\n", - " 'am',\n", - " 'an',\n", - " 'arc',\n", - " 'arch',\n", - " 'aria',\n", - " 'gram',\n", - " 'grammar',\n", - " 'i',\n", - " 'mar',\n", - " 'mono',\n", - " 'narc',\n", - " 'no',\n", - " 'on',\n", - " 'ram',\n", - " 'ski',\n", - " 'skim'}" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "subwords(W2) # These subwords don't need to appear in a proof" + "splits('arrow')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we have the notion of subwords, we can modify `portman` to be a bit more efficient and concise: " + "The first element of the list says that `'arrow'` can bridge from `'a'` to `'rrow'` with 0 excess letters; the last says it can bridge from `'a'` to `'w'` with 3 excess letters (which happen to be `'rro'`). We consider every possible split, and pass it on to `try_bridge`, which records the bridge in the table under `bridges[pre][suf]` unless there is already a shorter bridge stored there." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "def try_bridge(bridges, pre, suf, excess, word, step2=None):\n", + " \"\"\"Store a new bridge if it has less excess than the previous bridges[pre][suf].\"\"\"\n", + " if suf not in bridges[pre] or excess < bridges[pre][suf][EXCESS]:\n", + " bridge = (excess, (len(pre), word))\n", + " if step2: bridge += (step2,)\n", + " bridges[pre][suf] = bridge" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now for two-word bridges. I thought that if I allowed all possible two-word bridges the program would be slow because there would be so many of them, and most of them would be too long to be of any use. Thus, I decided to only use two-word bridges that bridge from the last letter in the previous word to the first letter in an unused word.\n", + "\n", + "We start out the same way, looking at every shortword. But this time we look at every suffix of each shortword, and see if the suffix starts another shortword. If it does, then we have a two-word bridge. Here's the complete `build_bridges` function:" ] }, { @@ -310,71 +448,107 @@ "metadata": {}, "outputs": [], "source": [ - "def portman(P: Proof, W: Words) -> str:\n", - " \"\"\"Compute the portmantout string S from the proof P; verify that it covers W.\"\"\"\n", - " assert (W - subwords(W)) <= set(w for _, w in P) <= W, \"all the words in W and nothing else\"\n", - " S = []\n", - " prev_word = ''\n", - " for (overlap, word) in P:\n", - " left, right = word[:overlap], word[overlap:] # Split word into two parts\n", - " assert overlap >= 0 and left == prev_word[-overlap:], f'the words must overlap: {prev_word, word}'\n", - " S.append(right)\n", - " prev_word = word\n", - " return ''.join(S)" + "def build_bridges(W: Wordset, maxlen=5, end='qujvz'):\n", + " \"\"\"A table of bridges[pre][suf] == (excess, (overlap, word)), e.g.\n", + " bridges['ar']['c'] == (0, (2, 'arc')).\"\"\"\n", + " bridges = defaultdict(dict)\n", + " shortwords = [w for w in W if len(w) <= maxlen + (w[-1] in end)]\n", + " shortstartswith = compute_startswith(shortwords)\n", + " # One-word bridges\n", + " for word in shortwords: \n", + " for excess, pre, suf, in splits(word):\n", + " try_bridge(bridges, pre, suf, excess, word)\n", + " # Two-word bridges\n", + " for word1 in shortwords:\n", + " for suf in suffixes(word1): \n", + " for word2 in shortstartswith[suf]: \n", + " excess = len(word1) + len(word2) - len(suf) - 2\n", + " A, B = word1[0], word2[-1]\n", + " if A != B:\n", + " step2 = (len(suf), word2)\n", + " try_bridge(bridges, A, B, excess, word1, step2)\n", + " return bridges" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(*Note:* When we are *using* a bridge, we say `W.bridges[suf][pre]` but when we are *building* the bridge we say `W.bridges[pre][suf]`. It may seem confusing to write the code both ways, but that's because the very definition of **overlapping** is that the suffix of one word is the same as the prefix of the next, so it just depends how you are looking at it.)\n", + "\n", + "**Missing bridges:** How do we know if a word set will always generate a portmantout? If the word set has a bridge from every one-letter suffix to every one other one-letter prefix, then it will always be able to bridge from any word to any unused word. (Of course, there are other conditions under which it can succeed.) The function `missing_bridges` tells us which of these one-letter-to-one-letter bridges are missing:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'eskimonogrammarianarchy'" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "portman(P1, W2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "(*Python trivia:* if `X, Y` and `Z` are sets, `X <= Y <= Z` means \"is `X` a subset of `Y` and `Y` a subset of `Z`?\" We use the notation here to say that the set of words in $P$ must contain all the nonsubwords and can only contain words from $W$.)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Set of 108,709 Words\n", + "def missing_bridges(W):\n", + " \"\"\"What one-letter-to-one-letter bridges are missing from W.bridges?\"\"\"\n", + " return {A + B for A in alphabet for B in alphabet \n", + " if A != B and B not in W.bridges[A]}\n", "\n", - "I will fetch the set of words that Tom Murphy used, and explore it a bit:" + "alphabet = 'abcdefghijklmnopqrstuvwxyz'" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "set()" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc" + "precompute(W)\n", + "missing_bridges(W)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Great! $W$ has no missing bridges. But the tiny word set $W1$ is missing 630 out of a possible 650 bridges:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "(630, 650)" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "W = set(open('wordlist.asc').read().split())" + "precompute(W1)\n", + "len(missing_bridges(W1)), 26 * 25" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Portmantout Solutions\n", + "\n", + "**Finally!** We're ready to make portmantouts. First for the small word list `W1`:" ] }, { @@ -385,7 +559,11 @@ { "data": { "text/plain": [ - "'W has 108,709 words (44,320 subwords and 64,389 nonsubwords)'" + "[(0, 'dashiki'),\n", + " (2, 'kimono'),\n", + " (4, 'monogram'),\n", + " (4, 'grammarian'),\n", + " (2, 'anarchy')]" ] }, "execution_count": 16, @@ -394,10 +572,7 @@ } ], "source": [ - "N = len(W)\n", - "sub = len(subwords(W))\n", - "\n", - "f'W has {N:,d} words ({sub:,d} subwords and {N-sub:,d} nonsubwords)'" + "natalie(W1, start='dashiki')" ] }, { @@ -408,7 +583,7 @@ { "data": { "text/plain": [ - "True" + "'dashikimonogrammarianarchy'" ] }, "execution_count": 17, @@ -417,7 +592,14 @@ } ], "source": [ - "'eskimo' in W" + "portman(natalie(W1, start='dashiki'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's make the portmantout and see how many steps and how many letters it is, and how long it takes:" ] }, { @@ -425,10 +607,18 @@ "execution_count": 18, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 9.1 s, sys: 151 ms, total: 9.25 s\n", + "Wall time: 9.56 s\n" + ] + }, { "data": { "text/plain": [ - "False" + "(103470, 553747)" ] }, "execution_count": 18, @@ -437,7 +627,16 @@ } ], "source": [ - "'waldo' in W # Where's Waldo? Not in W" + "%time P = natalie(W)\n", + "S = portman(P)\n", + "len(P), len(S)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can specify the starting word, as Tom Murphy did:" ] }, { @@ -445,10 +644,18 @@ "execution_count": 19, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 8.79 s, sys: 103 ms, total: 8.89 s\n", + "Wall time: 9.06 s\n" + ] + }, { "data": { "text/plain": [ - "'antidisestablishmentarianism'" + "(103466, 553742)" ] }, "execution_count": 19, @@ -457,141 +664,222 @@ } ], "source": [ - "max(W, key=len)" + "%time P = natalie(W, start='portmanteau')\n", + "S = portman(P)\n", + "len(P), len(S)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I thought it might take 10 minutes, so under 10 seconds is super. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Making it Prettier\n", + "\n", + "Notice I haven't actually *looked* at the portmantout yet. I didn't want to dump half a million letters into an output cell. Instead, I'll define `report` to summarize the results and save the full string $S$ into [a file](natalie.txt). I introduce the notion of **safe bridges**, which means that `W.bridges` contains a bridge from every leter to every other letter. A word set that has that property can always find a solution; word sets that don't might or might not find a solution, depending on the order of step choices." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Counter({1: 2,\n", - " 2: 25,\n", - " 3: 500,\n", - " 4: 2920,\n", - " 5: 6826,\n", - " 6: 11454,\n", - " 7: 16852,\n", - " 8: 19445,\n", - " 9: 16684,\n", - " 10: 11876,\n", - " 11: 8372,\n", - " 12: 5811,\n", - " 13: 3676,\n", - " 14: 2101,\n", - " 15: 1159,\n", - " 16: 583,\n", - " 17: 229,\n", - " 18: 107,\n", - " 19: 39,\n", - " 20: 29,\n", - " 21: 11,\n", - " 22: 4,\n", - " 23: 2,\n", - " 25: 1,\n", - " 28: 1})" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "Counter(sorted(map(len, W))) # word lengths" + "def report(W, P, steps=50, letters=500, save='natalie.txt'):\n", + " S = portman(P)\n", + " sub = W.subwords\n", + " nonsub = W - sub\n", + " bridge = len(P) - len(nonsub)\n", + " missing = len(missing_bridges(W)) or \"no\"\n", + " valid = \"is\" if is_portman(P, W) else \"IS NOT\"\n", + " def L(words): return sum(map(len, words)) # Number of letters\n", + " print(f'W has {len(W):,d} words ({len(nonsub):,d} nonsubwords; {len(sub):,d} subwords).')\n", + " print(f'P has {len(P):,d} steps ({len(nonsub):,d} nonsubwords; {bridge:,d} bridge words).')\n", + " print(f'S has {len(S):,d} letters; W has {L(W):,d}; nonsubs have {L(nonsub):,d}.')\n", + " print(f'P has an average overlap of {(L(w for _,w in P)-len(S))/(len(P)-1):.2f} letters.')\n", + " print(f'S has a compression ratio (letters(W)/letters(S)) of {L(W)/len(S):.2f}.')\n", + " print(f'P (and thus S) {valid} a valid portmantout of W.')\n", + " print(f'W has {missing} missing one-letter-to-one-letter bridges.')\n", + " if save:\n", + " print(f'S saved as \"{save}\", {open(save, \"w\").write(S)} bytes.')\n", + " print(f'\\nThe first and last {steps} steps are:\\n')\n", + " for step in [*P[:steps], '... ...', *P[-steps:]]:\n", + " print(step)\n", + " print(f'\\nThe first and last {letters} letters are:\\n\\n{S[:letters]} ... {S[-letters:]}')\n", + "\n", + "def is_portman(P: Path, W: Wordset) -> str:\n", + " \"\"\"Verify that P forms a valid portmantout string for W.\"\"\"\n", + " all_words = (W - W.subwords) <= set(w for (_, w) in P) <= W\n", + " overlaps = all(overlap > 0 and P[i - 1][1][-overlap:] == word[:overlap]\n", + " for i, (overlap, word) in enumerate(P[1:], 1))\n", + " return all_words and overlaps and P[0][OVERLAP] == 0 # first step has 0 overlap" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(*Python trivia:* if `X, Y` and `Z` are sets, `X <= Y <= Z` means \"is `X` a subset of `Y` and `Y` a subset of `Z`?\" We use the notation here to say that the set of words in $P$ must contain all the nonsubwords and can only contain words from $W$.)" ] }, { "cell_type": "code", "execution_count": 21, - "metadata": {}, + "metadata": { + "scrolled": false + }, "outputs": [ { - "data": { - "text/plain": [ - "[('s', 12075),\n", - " ('c', 10287),\n", - " ('p', 8414),\n", - " ('r', 6772),\n", - " ('d', 6673),\n", - " ('a', 6333),\n", - " ('b', 6205),\n", - " ('m', 5757),\n", - " ('t', 5490),\n", - " ('f', 4691),\n", - " ('e', 4460),\n", - " ('i', 4354),\n", - " ('h', 3884),\n", - " ('g', 3576),\n", - " ('l', 3331),\n", - " ('u', 3302),\n", - " ('o', 2935),\n", - " ('w', 2697),\n", - " ('n', 2453),\n", - " ('v', 1799),\n", - " ('j', 1039),\n", - " ('k', 960),\n", - " ('q', 559),\n", - " ('y', 338),\n", - " ('z', 260),\n", - " ('x', 65)]" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "W has 108,709 words (64,389 nonsubwords; 44,320 subwords).\n", + "P has 103,466 steps (64,389 nonsubwords; 39,077 bridge words).\n", + "S has 553,742 letters; W has 931,823; nonsubs have 595,805.\n", + "P has an average overlap of 1.65 letters.\n", + "S has a compression ratio (letters(W)/letters(S)) of 1.68.\n", + "P (and thus S) is a valid portmantout of W.\n", + "W has no missing one-letter-to-one-letter bridges.\n", + "S saved as \"natalie.txt\", 553742 bytes.\n", + "\n", + "The first and last 50 steps are:\n", + "\n", + "(0, 'portmanteau')\n", + "(2, 'autographic')\n", + "(7, 'graphicness')\n", + "(3, 'essayists')\n", + "(2, 'tsarisms')\n", + "(1, 'skydiving')\n", + "(3, 'ingenuously')\n", + "(3, 'slyest')\n", + "(4, 'yesterdays')\n", + "(4, 'daysides')\n", + "(5, 'sideswiping')\n", + "(4, 'pingrasses')\n", + "(5, 'assessee')\n", + "(3, 'seeking')\n", + "(4, 'kinged')\n", + "(2, 'editresses')\n", + "(3, 'sestets')\n", + "(5, 'stetsons')\n", + "(4, 'sonships')\n", + "(5, 'shipshape')\n", + "(5, 'shapelessness')\n", + "(3, 'essayers')\n", + "(3, 'erstwhile')\n", + "(5, 'whiles')\n", + "(3, 'lessening')\n", + "(3, 'ingested')\n", + "(4, 'stedhorses')\n", + "(6, 'horseshoes')\n", + "(5, 'shoestrings')\n", + "(5, 'ringsides')\n", + "(5, 'sideslipping')\n", + "(3, 'ingrafting')\n", + "(4, 'tingles')\n", + "(3, 'lessees')\n", + "(4, 'seesawing')\n", + "(4, 'wingedly')\n", + "(2, 'lyrically')\n", + "(4, 'allyls')\n", + "(1, 'syllabicate')\n", + "(4, 'cateresses')\n", + "(3, 'sessile')\n", + "(4, 'silentness')\n", + "(3, 'essays')\n", + "(1, 'sheening')\n", + "(3, 'ingratiating')\n", + "(5, 'atingle')\n", + "(6, 'tinglers')\n", + "(3, 'ersatzes')\n", + "(3, 'zestfulness')\n", + "(7, 'fulnesses')\n", + "... ...\n", + "(1, 'quarrelled')\n", + "(1, 'deli')\n", + "(1, 'iraq')\n", + "(1, 'quakily')\n", + "(1, 'yoni')\n", + "(1, 'iraq')\n", + "(1, 'quaffing')\n", + "(1, 'gorki')\n", + "(1, 'iraq')\n", + "(1, 'quarts')\n", + "(1, 'sir')\n", + "(2, 'iraq')\n", + "(1, 'qaids')\n", + "(1, 'sir')\n", + "(2, 'iraq')\n", + "(1, 'quarantining')\n", + "(1, 'gorki')\n", + "(1, 'iraq')\n", + "(1, 'qatar')\n", + "(1, 'rani')\n", + "(1, 'iraq')\n", + "(1, 'quinquina')\n", + "(1, 'aqua')\n", + "(3, 'quakier')\n", + "(1, 'rani')\n", + "(1, 'iraq')\n", + "(1, 'quailing')\n", + "(1, 'gorki')\n", + "(1, 'iraq')\n", + "(1, 'quahogs')\n", + "(1, 'sir')\n", + "(2, 'iraq')\n", + "(1, 'quaaludes')\n", + "(1, 'sir')\n", + "(2, 'iraq')\n", + "(1, 'quakerism')\n", + "(1, 'maqui')\n", + "(3, 'quitclaimed')\n", + "(1, 'deli')\n", + "(1, 'iraq')\n", + "(1, 'quarries')\n", + "(1, 'sir')\n", + "(2, 'iraq')\n", + "(1, 'quinols')\n", + "(1, 'sir')\n", + "(2, 'iraq')\n", + "(1, 'quicklime')\n", + "(1, 'emir')\n", + "(2, 'iraq')\n", + "(1, 'quakingly')\n", + "\n", + "The first and last 500 letters are:\n", + "\n", + "portmanteautographicnessayistsarismskydivingenuouslyesterdaysideswipingrassesseekingeditressestetsonshipshapelessnessayerstwhilesseningestedhorseshoestringsideslippingraftinglesseesawingedlyricallylsyllabicateressessilentnessaysheeningratiatinglersatzestfulnessestercesareanschlussresistentorsionallyonnaiseminarsenidesolatediumsquirtingeditoriallysergichoroustaboutstationstagersiltiercelsiusurpingotsaristsaritzastrodometerselysiantitankersparestatingeingrowingspreadsheetsarsaparillasciviouslylyce ... iraquandaryoniraquotidianoiraqianaquaggiestaxiraquodsiraquandobeliraquondamaquiversiraquaffersiraquantumaquicklyoniraquaintlyoniraqurushairaquaverersiraquakiestaxiraquaversiraquizzingorkiraquarrellersiraquicksetsiraquickiesiraquintalsiraquackishnessiraquietismsiraquizzedeliraquailedeliraquarrelledeliraquakilyoniraquaffingorkiraquartsiraqaidsiraquarantiningorkiraqataraniraquinquinaquakieraniraquailingorkiraquahogsiraquaaludesiraquakerismaquitclaimedeliraquarriesiraquinolsiraquicklimemiraquakingly\n" + ] } ], "source": [ - "Counter(w[0] for w in W).most_common() # first letters" + "report(W, P)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exploring\n", + "\n", + "The program is complete, but there are still many interesting things to explore. \n", + "\n", + "**My first question**: is there an imbalance in starting and ending letters of words? That could lead to a need for many two-word bridges. We saw that the last 50 steps of $P$ all involved words that start with `q`, or bridges to them. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('s', 35647),\n", - " ('d', 11139),\n", - " ('e', 10701),\n", - " ('y', 10071),\n", - " ('g', 9125),\n", - " ('r', 7643),\n", - " ('t', 5990),\n", - " ('n', 5261),\n", - " ('l', 3314),\n", - " ('c', 1819),\n", - " ('m', 1600),\n", - " ('a', 1398),\n", - " ('h', 1268),\n", - " ('k', 920),\n", - " ('p', 699),\n", - " ('o', 665),\n", - " ('i', 343),\n", - " ('w', 306),\n", - " ('x', 243),\n", - " ('f', 240),\n", - " ('b', 158),\n", - " ('u', 88),\n", - " ('z', 51),\n", - " ('v', 15),\n", - " ('j', 3),\n", - " ('q', 2)]" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "Counter(w[-1] for w in W).most_common() # last letters" + "precompute(W)" ] }, { @@ -602,32 +890,32 @@ { "data": { "text/plain": [ - "[('e', 108278),\n", - " ('s', 83059),\n", - " ('i', 81747),\n", - " ('a', 70762),\n", - " ('r', 68864),\n", - " ('n', 64756),\n", - " ('t', 61999),\n", - " ('o', 57205),\n", - " ('l', 50145),\n", - " ('c', 37734),\n", - " ('d', 34192),\n", - " ('u', 31126),\n", - " ('g', 26858),\n", - " ('p', 26294),\n", - " ('m', 25538),\n", - " ('h', 20654),\n", - " ('b', 18169),\n", - " ('y', 15752),\n", - " ('f', 12757),\n", - " ('v', 9670),\n", - " ('k', 8109),\n", - " ('w', 7971),\n", - " ('z', 4083),\n", - " ('x', 2666),\n", - " ('j', 1746),\n", - " ('q', 1689)]" + "[('s', 7388),\n", + " ('c', 5849),\n", + " ('p', 4977),\n", + " ('d', 4093),\n", + " ('r', 3811),\n", + " ('b', 3776),\n", + " ('a', 3528),\n", + " ('m', 3405),\n", + " ('t', 3097),\n", + " ('f', 2794),\n", + " ('i', 2771),\n", + " ('u', 2557),\n", + " ('e', 2470),\n", + " ('g', 2177),\n", + " ('h', 2169),\n", + " ('o', 1797),\n", + " ('l', 1634),\n", + " ('w', 1561),\n", + " ('n', 1542),\n", + " ('v', 1032),\n", + " ('j', 638),\n", + " ('k', 566),\n", + " ('q', 330),\n", + " ('y', 207),\n", + " ('z', 169),\n", + " ('x', 51)]" ] }, "execution_count": 23, @@ -636,121 +924,90 @@ } ], "source": [ - "Counter(L for w in W for L in w).most_common() # all letters" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# natalie\n", - "\n", - "The function `natalie` generates a portmantout proof for a set of words. The rough outline is:\n", - "\n", - " def natalie(W: Words) -> Proof:\n", - " precompute some data structures to make things more efficient\n", - " P = a proof, initially with just the first word, that we will build up\n", - " while there are nonsubwords that have not been used:\n", - " for (overlap, word) in (unused or bridging words that overlap):\n", - " append (overlap, word) to P and update data structures\n", - " return P\n", - " \n", - "There are two choices of how to pick a word to add to `P`:\n", - "- The function `unused_word` finds a word that has not been used yet that has a maximal overlap with the previous word. If there is such a word, we will always use it, and never consider reverting that choice. That's called a **greedy** approach, and it typically leads to solutions that are not optimal (the resulting $S$ is not the shortest possible) but are computationally feasible. It seems like finding a shortest $S$ is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find an optimal solution in a reasonable amount of time. So I'm happy with the greedy, suboptimal approach.\n", - "- The function `bridging_words` is called only when there is no way to add an unused word to the previous word. `bridging_words` returns a one- or two-word sequence (called a **bridge**) that will bring us to a place where we can again consume an unused word on the following iteration. \n", - "\n", - "`natalie` keeps track of the following data structures (which we will explain in more detail below):\n", - "- `unused: Words`: a set of the unused nonsubwords in $W$. When a word is added to $P$, it is removed from `unused`.\n", - "- `P: Proof`: e.g. `[(0, 'eskimo'), (4, 'kimono'),...]`, the proof that we are building up.\n", - "- `startswith: dict`: e.g. `startswith['kimo'] = ['kimono',...]` is a list of words that start with `'kimo'`. \n", - "- `firsts: Counter`: e.g. `firsts['a'] == 3528` is the number of unused words that start with the letter `a`.\n", - "- `bridges: dict`: e.g. `bridges['a' + 'q'] == ('airaq', [(1, 'air'), (2, 'iraq')])`, a description of a way to bridge from a word that ends in `'a'` to one that begins with `'q'`." + "Counter(w[0] for w in W.unused).most_common() # How many nonsubwords start with each letter?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "[('s', 29056),\n", + " ('y', 8086),\n", + " ('d', 7520),\n", + " ('g', 6343),\n", + " ('e', 3215),\n", + " ('t', 2107),\n", + " ('r', 1994),\n", + " ('n', 1860),\n", + " ('l', 1182),\n", + " ('c', 908),\n", + " ('m', 657),\n", + " ('a', 384),\n", + " ('h', 351),\n", + " ('k', 157),\n", + " ('i', 128),\n", + " ('p', 123),\n", + " ('o', 113),\n", + " ('x', 68),\n", + " ('f', 51),\n", + " ('w', 42),\n", + " ('z', 21),\n", + " ('u', 11),\n", + " ('v', 6),\n", + " ('b', 6)]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "def natalie(W: Words, start=None) -> Proof:\n", - " \"\"\"Return a portmantout string for W, and a Proof for it.\"\"\"\n", - " prev_word = start or next(iter(W))\n", - " unused = W - (subwords(W) | {prev_word})\n", - " P: Proof = [(0, prev_word)] # The emerging Proof \n", - " startswith = compute_startswith(unused) # startswith['th'] = words that start with 'th'\n", - " firsts = Counter(word[0] for word in unused) # Count of first letters of words\n", - " bridges = compute_bridges(W) # Words that bridge from 'a' to 'b'\n", - " while unused:\n", - " for (overlap, word) in (unused_word(prev_word, startswith, unused) or\n", - " bridging_words(prev_word, firsts, bridges)):\n", - " if word not in W:\n", - " return [] # Fail\n", - " P.append((overlap, word))\n", - " if word in unused:\n", - " unused.remove(word)\n", - " firsts -= Counter(word[0])\n", - " prev_word = word\n", - " return P" + "Counter(w[-1] for w in W.unused).most_common() # How many nonsubwords end with each letter?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "(*Python trivia:* I say `firsts -= Counter(word[0])` rather than `first[word[0]] -= 1` because when the count for a letter reaches zero, the former deletes the letter from the Counter, whereas the latter just sets it to zero.)\n", + "Yes, there is a problem: there are 330 words that start with `q` and no nonsubwords that end in `q` (there are two subwords, `colloq` and `iraq`). There is also a problem with 29,056 nonsubwords ending in `s` and only 7,388 starting with `s`. But many words start with combinations like `as` or `es` or `ps`, whereas there are few chances for a `q` at the start of a word to match up with a `q` near the end of a word.\n", "\n", - "How do we know that `unused` will eventually be empty, so that the `while unused` loop can terminate? On each iteration we either use `unused_word`, which reduces the size of `unused`, or we use `bridging_words`, which doesn't. But after `bridging_words` adds the bridging word(s), we are guaranteed to be able to use an `unused` word on the next iteration (at least for the word set $W$; for smaller word sets `bridging_words` might return a non-word, which causes `natalie` to return the empty proof, indicating failure. \n", - "\n", - "The most important parts of `natalie` are the two functions `unused_word` and `bridging_words`, which decide what words will be added next to the emerging proof. Both return a list of the form `[(overlap, word),...]` that define a word or sequence of words to add to the proof, and the number of letters that each word overlaps the previous word. We will discuss the two functions in turn." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# unused_word and compute_startswith\n", - "\n", - "To select an `unused_word`, consider the suffixes of the previous word, longest suffix first, and if that suffix is a prefix of an unused word, then take it. `unused_word` returns either a list of length one, `[(overlap, word)]`, or the empty list, `[]` if no overlapping word can be found. " + "Here are all the words that have a `q` as one of their last three letters:" ] }, { "cell_type": "code", "execution_count": 25, - "metadata": {}, - "outputs": [], + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "('diplomatique faqir mystique relique mosque arabesque macaque maqui mozambique catafalque plaque brusque bisque unique obloquy manque perique applique claque boutique iraq grotesque cheque picaresque statuesque oblique opaque marque toque basque cinque obsequy aqua iraqis barque cinematheque critique odalisque albuquerque prosequi colloq tuque pulque pratique remarque baroque colloquy iraqi soliloquy technique burlesque placque discotheque hdqrs pique antique bosque cliquy semiopaque picturesque cirque romanesque torque yanqui ventriloquy casque clique physique masque bezique communique risque',\n", + " 72)" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "def unused_word(prev_word: str, startswith: dict, unused: Words) -> [(int, str),...]:\n", - " \"\"\"Return a [(overlap, word)] pair to follow `prev_word`, or [].\"\"\"\n", - " return next(([(len(suf), word)]\n", - " for suf in suffixes(prev_word) if suf in startswith\n", - " for word in startswith[suf] if word in unused), [])\n", - "\n", - "def suffixes(word) -> list:\n", - " \"\"\"All non-empty proper suffixes of word, longest first.\"\"\"\n", - " return [word[i:] for i in range(1, len(word))]\n", - "\n", - "def prefixes(word) -> list:\n", - " \"\"\"All non-empty proper prefixes of word, shortest first.\"\"\"\n", - " return [word[:i] for i in range(1, len(word))]\n", - "\n", - "def multimap(pairs) -> dict:\n", - " \"\"\"Given (key, val) pairs, make a dict of {key: [val,...]}.\"\"\"\n", - " result = defaultdict(list)\n", - " for key, val in pairs:\n", - " result[key].append(val)\n", - " return result\n", - "\n", - "def compute_startswith(W) -> dict: \n", - " \"\"\"A mapping of a prefix to a list of the words that start with it.\"\"\"\n", - " return multimap((pre, w) for w in W for pre in prefixes(w))" + "q3 = {w for w in W if 'q' in w[-3:]}\n", + "' '.join(q3), len(q3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - " So when the previous word is `'eskimo'`, a call to `unused_word` finds a word with a four-letter overlap; no other word overlaps more:" + "**My second question**: what are the most common steps in $P$? These will be bridge words. What do they have in common?" ] }, { @@ -761,7 +1018,31 @@ { "data": { "text/plain": [ - "[(4, 'kimono')]" + "[((1, 'so'), 2561),\n", + " ((1, 'sap'), 2536),\n", + " ((1, 'dab'), 2360),\n", + " ((1, 'sic'), 2223),\n", + " ((1, 'of'), 2039),\n", + " ((2, 'lyre'), 1643),\n", + " ((1, 'sun'), 1519),\n", + " ((1, 'sin'), 1400),\n", + " ((1, 'yam'), 867),\n", + " ((2, 'lye'), 734),\n", + " ((1, 'go'), 679),\n", + " ((1, 'yow'), 612),\n", + " ((1, 'spa'), 610),\n", + " ((1, 'econ'), 609),\n", + " ((1, 'gem'), 562),\n", + " ((1, 'gun'), 487),\n", + " ((1, 'yen'), 465),\n", + " ((3, 'erst'), 454),\n", + " ((2, 'type'), 447),\n", + " ((1, 'she'), 390),\n", + " ((1, 'you'), 371),\n", + " ((1, 'sex'), 324),\n", + " ((1, 'simp'), 317),\n", + " ((1, 'tv'), 312),\n", + " ((1, 'gal'), 297)]" ] }, "execution_count": 26, @@ -770,16 +1051,14 @@ } ], "source": [ - "startswith = compute_startswith(W)\n", - "\n", - "unused_word('eskimo', startswith, W)" + "Counter(P).most_common(25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Some examples of usage:" + "Even though `iraq` dominated the last 50 steps, that's not true of `P` overall. Instead, it looks like bridging away from `s` is a big concern (as expected by the letter counts). (Also, `lyre` and `lye` bridge from an adverb ending.) The following says that about 30% of all steps in `P` bridge from a suffix that contains `s`:" ] }, { @@ -790,20 +1069,7 @@ { "data": { "text/plain": [ - "{'eskimo': [(4, 'kimono')],\n", - " 'kimono': [(4, 'monomers')],\n", - " 'shoehorns': [(5, 'hornswoggling')],\n", - " 'elephant': [(5, 'phantasm')],\n", - " 'phantasies': [(4, 'siesta')],\n", - " 'dachshund': [(4, 'hundredfold')],\n", - " 'vicars': [(4, 'carsick')],\n", - " 'flimsiest': [(5, 'siesta')],\n", - " 'siesta': [(4, 'establisher')],\n", - " 'dilettantism': [(6, 'antismog')],\n", - " 'antismog': [(4, 'smoggier')],\n", - " 'seascape': [(5, 'scapes')],\n", - " 'snark': [(4, 'narks')],\n", - " 'referendum': [(3, 'dumpcart')]}" + "0.2971507548373379" ] }, "execution_count": 27, @@ -812,9 +1078,14 @@ } ], "source": [ - "{w: unused_word(w, startswith, W) \n", - " for w in '''eskimo kimono shoehorns elephant phantasies dachshund vicars \n", - " flimsiest siesta dilettantism antismog seascape snark referendum'''.split()}" + "sum('s' in w[:o] for o, w in P) / len(P)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**My third question:** What is the distribution of word lengths? What is the longest word? What is the distribution of letters?" ] }, { @@ -825,7 +1096,28 @@ { "data": { "text/plain": [ - "['kimono', 'kimonos', 'kimonoed']" + "Counter({3: 2,\n", + " 4: 186,\n", + " 5: 1796,\n", + " 6: 4364,\n", + " 7: 8672,\n", + " 8: 11964,\n", + " 9: 11950,\n", + " 10: 8443,\n", + " 11: 6093,\n", + " 12: 4423,\n", + " 13: 2885,\n", + " 14: 1765,\n", + " 15: 1017,\n", + " 16: 469,\n", + " 17: 198,\n", + " 18: 91,\n", + " 19: 33,\n", + " 20: 22,\n", + " 21: 9,\n", + " 22: 4,\n", + " 23: 2,\n", + " 28: 1})" ] }, "execution_count": 28, @@ -834,7 +1126,7 @@ } ], "source": [ - "startswith['kimo']" + "Counter(sorted(map(len, W.unused))) # Counter of word lengths" ] }, { @@ -845,7 +1137,7 @@ { "data": { "text/plain": [ - "['imono', 'mono', 'ono', 'no', 'o']" + "'antidisestablishmentarianism'" ] }, "execution_count": 29, @@ -854,7 +1146,7 @@ } ], "source": [ - "suffixes('kimono')" + "max(W, key=len)" ] }, { @@ -865,7 +1157,32 @@ { "data": { "text/plain": [ - "['k', 'ki', 'kim', 'kimo', 'kimon']" + "[('e', 68038),\n", + " ('s', 60080),\n", + " ('i', 53340),\n", + " ('a', 43177),\n", + " ('n', 42145),\n", + " ('r', 41794),\n", + " ('t', 38093),\n", + " ('o', 35027),\n", + " ('l', 32356),\n", + " ('c', 23100),\n", + " ('d', 22448),\n", + " ('u', 19898),\n", + " ('g', 17815),\n", + " ('p', 16128),\n", + " ('m', 16062),\n", + " ('h', 12673),\n", + " ('y', 11889),\n", + " ('b', 11581),\n", + " ('f', 7885),\n", + " ('v', 5982),\n", + " ('k', 4892),\n", + " ('w', 4880),\n", + " ('z', 2703),\n", + " ('x', 1677),\n", + " ('j', 1076),\n", + " ('q', 1066)]" ] }, "execution_count": 30, @@ -874,7 +1191,14 @@ } ], "source": [ - "prefixes('kimono')" + "Counter(L for w in W.unused for L in w).most_common() # Counter of letters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**My fourth question**: How many bridges are there? How many excess letters do they have? What words do they use? " ] }, { @@ -885,33 +1209,7 @@ { "data": { "text/plain": [ - "{'uf': ['ufos', 'ufo'],\n", - " 'gj': ['gjetost', 'gjetosts'],\n", - " 'oj': ['ojibwas', 'ojibwa'],\n", - " 'ry': ['rye'],\n", - " 'pf': ['pfennig', 'pfennigs'],\n", - " 'yc': ['ycleped', 'yclept'],\n", - " 'zl': ['zloty', 'zlotys'],\n", - " 'mc': ['mcdonald'],\n", - " 'ez': ['ezekiel'],\n", - " 'fj': ['fjord', 'fjords'],\n", - " 'tc': ['tchaikovsky'],\n", - " 'xm': ['xmases', 'xmas'],\n", - " 'ie': ['ieee'],\n", - " 'dn': ['dnieper'],\n", - " 'ud': ['udder', 'udders'],\n", - " 'sf': ['sforzato', 'sforzatos'],\n", - " 'aj': ['ajar'],\n", - " 'ym': ['ymca'],\n", - " 'vy': ['vyingly', 'vying'],\n", - " 'qo': ['qophs', 'qoph'],\n", - " 'hd': ['hdqrs'],\n", - " 'zw': ['zwieback', 'zwiebacks'],\n", - " 'jn': ['jnana', 'jnanas'],\n", - " 'dv': ['dvorak'],\n", - " 'bw': ['bwana', 'bwanas'],\n", - " 'fb': ['fbi'],\n", - " 'ct': ['ctrl']}" + "56477" ] }, "execution_count": 31, @@ -920,17 +1218,9 @@ } ], "source": [ - "{pre: startswith[pre] # Rare two-letter prefixes\n", - " for pre in startswith if len(pre) == 2 and len(startswith[pre]) <= 2}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# bridging_words and compute_bridges\n", - "\n", - "Suppose we reach a situation where the previous word was `'one'`, and the only remaining unused words are `'two'`, `'three'`, and `'six'`. Since there is no possible overlap, `unused_word` will return the empty list:" + "# Make a list of all bridges, B, and see how many there are\n", + "B = [(suf, pre, W.bridges[suf][pre]) for suf in W.bridges for pre in W.bridges[suf]]\n", + "len(B)" ] }, { @@ -941,7 +1231,35 @@ { "data": { "text/plain": [ - "[]" + "[('s', 'laty', (0, (1, 'slaty'))),\n", + " ('p', 'ram', (0, (1, 'pram'))),\n", + " ('a', 'lpha', (0, (1, 'alpha'))),\n", + " ('d', 'ces', (1, (1, 'duces'))),\n", + " ('h', 'uts', (0, (1, 'huts'))),\n", + " ('ke', 'ab', (1, (2, 'kebab'))),\n", + " ('f', 'izz', (0, (1, 'fizz'))),\n", + " ('ho', 'wls', (0, (2, 'howls'))),\n", + " ('c', 'lo', (2, (1, 'cello'))),\n", + " ('g', 'ogo', (0, (1, 'gogo'))),\n", + " ('l', 'th', (1, (1, 'loth'))),\n", + " ('b', 'ola', (0, (1, 'bola'))),\n", + " ('ne', 'ro', (1, (2, 'negro'))),\n", + " ('riv', 'n', (1, (3, 'riven'))),\n", + " ('li', 'szt', (0, (2, 'liszt'))),\n", + " ('on', 'ces', (0, (2, 'onces'))),\n", + " ('na', 'l', (1, (2, 'nail'))),\n", + " ('ov', 'um', (0, (2, 'ovum'))),\n", + " ('br', 'ke', (1, (2, 'broke'))),\n", + " ('sti', 'le', (0, (3, 'stile'))),\n", + " ('ax', 'els', (0, (2, 'axels'))),\n", + " ('yea', 'n', (1, (3, 'yearn'))),\n", + " ('whel', 'p', (0, (4, 'whelp'))),\n", + " ('cabe', 'r', (0, (4, 'caber'))),\n", + " ('fal', 'ls', (0, (3, 'falls'))),\n", + " ('cza', 'r', (0, (3, 'czar'))),\n", + " ('snuc', 'k', (0, (4, 'snuck'))),\n", + " ('scen', 'e', (0, (4, 'scene'))),\n", + " ('apne', 'a', (0, (4, 'apnea')))]" ] }, "execution_count": 32, @@ -950,80 +1268,28 @@ } ], "source": [ - "unused_word('one', startswith, {'two', 'three', 'six'})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "However, we can still hope to find a previously-used a word that will **bridge** from the `'e'` at the end of `'one'` to the `'t'` or `'s'` at the start of one of the unused words. The function `compute_bridges` precomputes a table that we call `bridges`, and the function `bridging_words` finds the approriate bridging word(s) in that table.\n", - "\n", - "The `bridging_words` function is simple: in this example we fetch `bridges['e' + 't']` and `bridges['e' + 's']`, decide which gives us the shortest bridge, and return the list `[(overlap, word),...]` that makes up that bridge. For `unused_word` the list was always of length zero or one; for `bridging_words` the list is always of length one or two." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To compute the bridges is a bit of work, but we only have to do it once. We want a table of `{A + B: (bridge, [(overlap, word),...])}` where `A` and `B` are the letters we want to bridge between, and `bridge` is either a single word or a portmanteau of two words. \n", - "\n", - "We start by selecting from $W$ all the short words, as well as words that end in an unusual letter. (For our 108,709 word set $W$, we selected the 10,273 words with length up to 5, plus 159 words that end in any of 'qujvz', the five rarest letters. For other word sets, you may have to tune these parameters.) We consider each of these individual words as a possible bridge between its first and last letters, keeping only the shortest. The following means that `'eat'` is a shortest possible word that starts with `'e'` and ends in `'t'`.\n", - "\n", - " bridges['e' + 't'] == ('eat', [(1, 'eat')])\n", - " \n", - "Sometimes we can't bridge with one word: no word starts with `'a'` and ends with `'q'`. Thus, we also consider two-word bridges:\n", - "\n", - " bridges['a' + 'q'] == ('airaq', [(1, 'air'), (2, 'iraq')])\n", - " \n", - "which means that the shortest possible portmanteau that starts with `'a'` and ends in `'q'` is `'airaq'`, and the first word in that portmanteau is `'air'` and the second `'iraq'`.\n", - "\n", - "Here's the code:" + "B[::2000] # Sample every 2000th bridge" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "Counter({0: 37189, 1: 16708, 2: 2425, 3: 95, 4: 32, 5: 21, 6: 6, 8: 1})" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "def bridging_words(prev: str, firsts: Counter, bridges: dict) -> [(int, str),...]:\n", - " \"\"\"Find a previously-used word that will bridge to one of the letters in firsts.\"\"\"\n", - " A = prev[-1] # Ending letter of previous word\n", - " (bridge, pairs) = min((bridges[A + B] for B in firsts), key=bridgelen)\n", - " return pairs\n", - "\n", - "def bridgelen(bridge) -> int: return len(bridge[0])\n", - " \n", - "def compute_bridges(W: Words, maxlen=5, endings=tuple('qujvz')) -> dict:\n", - " \"\"\"A table of {A + B: (bridge, word1)} pairs that bridge letter A to letter B, \n", - " e.g. {'e'+'t': ('eat', 'eat'), 'a'+'q': ('airaq', 'air')}.\"\"\"\n", - " long = '?' * 29 # A default long \"word\"\n", - " bridges = {A + B: (long, [(1, long)]) for A in alphabet for B in alphabet}\n", - " shortwords = [w for w in W if len(w) <= maxlen or w.endswith(endings)]\n", - " startswith = compute_startswith(shortwords)\n", - " \n", - " def consider(bridge, *pairs):\n", - " \"\"\"Use this bridge if it is shorter than the previous bridges[AB].\"\"\"\n", - " AB = bridge[0] + bridge[-1]\n", - " bridges[AB] = min(bridges[AB], (bridge, [*pairs]), key=bridgelen)\n", - " \n", - " for w1 in shortwords:\n", - " consider(w1, (1, w1)) # One-word bridges\n", - " for suf in suffixes(w1): \n", - " for w2 in startswith[suf]:\n", - " bridge = w1 + w2[len(suf):]\n", - " consider(bridge, (1, w1), (len(suf), w2)) # Two-word bridges\n", - " return bridges\n", - "\n", - "alphabet = 'abcdefghijklmnopqrstuvwxyz'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here is how `bridging_words` works in our example situation. First the necessary data structures `firsts` and `bridges`:" + "# Counter of bridge excess letters\n", + "Counter(x for (_, _, (x, *_)) in B)" ] }, { @@ -1032,18 +1298,21 @@ "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 7.17 s, sys: 75 ms, total: 7.24 s\n", - "Wall time: 7.36 s\n" - ] + "data": { + "text/plain": [ + "0.3916638631655364" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "firsts = Counter(t=2, s=2) # Counts of the first letters of unused words in this situation\n", + "def average(counter):\n", + " return sum(x * counter[x] for x in counter) / sum(counter.values())\n", "\n", - "%time bridges = compute_bridges(W)" + "average(_) # Average excess across all bridges" ] }, { @@ -1054,7 +1323,7 @@ { "data": { "text/plain": [ - "[(1, 'eat')]" + "Counter({1: 56327, 2: 150})" ] }, "execution_count": 35, @@ -1063,25 +1332,64 @@ } ], "source": [ - "bridging_words('one', firsts, bridges)" + "# How many 1-step and 2-step bridges are there?\n", + "Counter(len(steps) for (_, _, (_, *steps)) in B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "That says that we should add the word `'eat'`, which overlaps one letter with `one`. After adding `'eat'`, we'll be set up to add one of the unused words on the next step. Here's some of the calculations that got us to `'eat'`:" + "**My fifth question**: What strange letter combinations are there? Let's look at two-letter suffixes or prefixes that only appear in one or two nonsubwords. " ] }, { "cell_type": "code", "execution_count": 36, - "metadata": {}, + "metadata": { + "scrolled": false + }, "outputs": [ { "data": { "text/plain": [ - "('eat', [(1, 'eat')])" + "{'jn': {'jnanas'},\n", + " 'dv': {'dvorak'},\n", + " 'if': {'iffiness'},\n", + " 'ym': {'ymca'},\n", + " 'kw': {'kwachas', 'kwashiorkor'},\n", + " 'fj': {'fjords'},\n", + " 'ek': {'ekistics'},\n", + " 'aj': {'ajar'},\n", + " 'xi': {'xiphoids', 'xiphosuran'},\n", + " 'sf': {'sforzatos'},\n", + " 'yc': {'ycleped', 'yclept'},\n", + " 'hd': {'hdqrs'},\n", + " 'dn': {'dnieper'},\n", + " 'ip': {'ipecacs'},\n", + " 'ee': {'eelgrasses', 'eelworm'},\n", + " 'qa': {'qaids', 'qatar'},\n", + " 'ie': {'ieee'},\n", + " 'oj': {'ojibwas'},\n", + " 'pf': {'pfennigs'},\n", + " 'wu': {'wurzel'},\n", + " 'uf': {'ufos'},\n", + " 'ik': {'ikebanas', 'ikons'},\n", + " 'tc': {'tchaikovsky'},\n", + " 'bw': {'bwanas'},\n", + " 'zw': {'zwiebacks'},\n", + " 'gj': {'gjetosts'},\n", + " 'iv': {'ivories', 'ivory'},\n", + " 'xm': {'xmases'},\n", + " 'zl': {'zlotys'},\n", + " 'll': {'llamas', 'llanos'},\n", + " 'ct': {'ctrl'},\n", + " 'qo': {'qophs'},\n", + " 'gw': {'gweducks', 'gweducs'},\n", + " 'ez': {'ezekiel'},\n", + " 'mc': {'mcdonald'},\n", + " 'ay': {'ayahs', 'ayatollahs'},\n", + " 'fb': {'fbi'}}" ] }, "execution_count": 36, @@ -1090,7 +1398,8 @@ } ], "source": [ - "bridges['e' + 't']" + "{pre: W.startswith[pre] # Rare two-letter prefixes\n", + " for pre in W.startswith if len(pre) == 2 and len(W.startswith[pre]) in (1, 2)}" ] }, { @@ -1101,7 +1410,73 @@ { "data": { "text/plain": [ - "('ells', [(1, 'ells')])" + "{'lm': {'stockholm', 'unhelm'},\n", + " 'yx': {'styx'},\n", + " 'ao': {'chiao', 'ciao'},\n", + " 'oe': {'monroe'},\n", + " 'oi': {'hanoi', 'polloi'},\n", + " 'tl': {'peyotl', 'shtetl'},\n", + " 'nx': {'bronx', 'meninx'},\n", + " 'rf': {'waldorf', 'windsurf'},\n", + " 'wa': {'kiowa', 'okinawa'},\n", + " 'lu': {'honolulu'},\n", + " 'ho': {'groucho'},\n", + " 'bm': {'ibm', 'icbm'},\n", + " 'vo': {'concavo'},\n", + " 'zo': {'diazo', 'palazzo'},\n", + " 'ud': {'aloud', 'overproud'},\n", + " 'pa': {'tampa'},\n", + " 'xo': {'convexo'},\n", + " 'hr': {'kieselguhr'},\n", + " 'hm': {'microhm'},\n", + " 'ef': {'unicef'},\n", + " 'rb': {'cowherb'},\n", + " 'ji': {'fiji'},\n", + " 'ep': {'asleep', 'shlep'},\n", + " 'td': {'retd'},\n", + " 'po': {'troppo'},\n", + " 'gm': {'apophthegm'},\n", + " 'ub': {'beelzebub'},\n", + " 'ku': {'haiku'},\n", + " 'hu': {'buchu'},\n", + " 'xe': {'deluxe', 'maxixe'},\n", + " 'gn': {'champaign'},\n", + " 'ug': {'bedrug', 'sparkplug'},\n", + " 'ec': {'filespec', 'quebec'},\n", + " 'nu': {'vishnu'},\n", + " 'ru': {'nehru'},\n", + " 'mb': {'clomb', 'whitecomb'},\n", + " 'ui': {'maqui', 'prosequi'},\n", + " 'sr': {'ussr'},\n", + " 'ln': {'lincoln'},\n", + " 'xs': {'duplexs'},\n", + " 'mp': {'prestamp'},\n", + " 'ab': {'skylab'},\n", + " 'hn': {'mendelssohn'},\n", + " 'cd': {'recd'},\n", + " 'uc': {'caoutchouc'},\n", + " 'dt': {'rembrandt'},\n", + " 'nc': {'dezinc', 'quidnunc'},\n", + " 'sz': {'grosz'},\n", + " 'we': {'zimbabwe'},\n", + " 'ai': {'bonsai'},\n", + " 'mt': {'daydreamt', 'undreamt'},\n", + " 'zt': {'liszt'},\n", + " 'ua': {'joshua'},\n", + " 'aa': {'markkaa'},\n", + " 'fa': {'khalifa'},\n", + " 'ob': {'blowjob'},\n", + " 'ko': {'gingko', 'stinko'},\n", + " 'zm': {'transcendentalizm'},\n", + " 'dn': {'haydn'},\n", + " 'oz': {'kolkhoz'},\n", + " 'eh': {'mikveh', 'yahweh'},\n", + " 'tu': {'impromptu'},\n", + " 'za': {'organza'},\n", + " 'su': {'shiatsu'},\n", + " 'vt': {'govt'},\n", + " 'ou': {'thankyou'},\n", + " 'nz': {'franz'}}" ] }, "execution_count": 37, @@ -1110,7 +1485,42 @@ } ], "source": [ - "bridges['e' + 's']" + "endswith = multimap((w[-2:], w) for w in W.unused)\n", + "\n", + "{suf: endswith[suf] # Rare two-letter suffixes\n", + " for suf in endswith if len(endswith[suf]) in (1, 2)}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The two-letter prefixes definitely include some strange words.\n", + "\n", + "The list of two-letter suffixes is mostly pointing out flaws in the word list. For example, lots of words end in `ab`: blab, cab, jab, lab, etc. But must of them are subwords (of blabs, cabs, jabs, labs, etc.); only `skylab` made it into the word list in singular form but not plural." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comparison to Tom Murphy's Program\n", + "\n", + "To compare my [program](portman.py) to [Murphy's](https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/portmantout/): I used a greedy approach that incrementally builds up a single long portmanteau, extending it via a bridge when necessary. Murphy first built a pool of smaller portmanteaux, then joined them all together. I'm reminded of the [Traveling Salesperson Problem](TSP.ipynb) where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments. The two approaches are different, but it is not clear whether one is better than the other. You could try it!\n", + "\n", + "(*English trivia:* my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a **bridge**. Murphy's program starts by building a large pool of small portmanteaux that he calls **particles**, and when he can build no more particles, his next step is to put two particles together, so he calls it a **join**. The different metaphors for what our programs are doing lead to different terminology for the same idea.)\n", + "\n", + "In terms of implementation, mine is in Python and is concise (139 lines); Murphy's is in C++ and is verbose (1867 lines), although Murphy's code does a lot of extra work that mine doesn't: generating diagrams and animations, and running multiple threads in parallel to implement the random restart idea. \n", + "\n", + "It appears Murphy didn't quite have the complete concept of **subwords**. He did mention that when he adds `'bulleting'`, he crosses `'bullet'` and `'bulletin'` off the list, but somehow [his string](http://tom7.org/portmantout/murphy2015portmantout.pdf) contains both `'spectacular'` and `'spectaculars'` in two different places. My guess is that when he adds `'spectaculars'` he crosses off `'spectacular'`, but if he happens to add `'spectacular'` first, he will later add `'spectaculars'`. Support for this view is that his output in `bench.txt` says \"I skipped 24319 words that were already substrs\", but I computed that there are 44,320 such subwords; he found about half of them. I think those missing 20,001 words are the main reason why my strings are coming in at around 554,000 letters, about 57,000 letters shorter than Murphy's 611,820 letters.\n", + "\n", + "Also, Murphy's joins are always between one-letter prefixes and suffixes. I do the same thing for two-word bridges, because having a `W.bridges[A][B]` for every letter `A` and `B` is the easiest way to prove that the program will terminate. But for one-word bridges, I allow prefixes and suffixes of any length up to a total of 6 for `len(pre) + len(suf)`. I can get away with this because I limited my candidate pool to the 10,000 `shortwords`. It would have been untenable to build all bridges for all 100,000 words, and probably would not have helped shorten $S$ appreciably.\n", + "\n", + "*Note 2:* I should say that I stole one important trick from Murphy. I started watching his highly-entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI), but then I paused it because I wanted the fun of solving the problem mostly on my own. After I finished the first version of my program, I returned to the video and [paper](http://tom7.org/portmantout/murphy2015portmantout.pdf) and I noticed that I had a problem in my use of bridges. My program originally looked something like this: \n", + "\n", + " (overlap, word) = unused_step(...) or one_word_bridge(...) or two_word_bridge(...)\n", + " \n", + "That is, I only considered two-word bridges when there was no one-word bridge, on the theory that one word is shorter than two. But Murphy showed that my theory was wrong: I had `bridges['w']['c'] = 'workaholic'`, a one-word bridge, but he had the two-word bridge `'war' + 'arc' = 'warc'`, which saves six letters over my single word. After seeing that, I shamelessly copied his approach, and now I too get a four-letter bridge for `'w' + 'c'` (sometimes `'warc'` and sometimes `'we' + 'etc' = 'wetc'`)." ] }, { @@ -1121,7 +1531,7 @@ { "data": { "text/plain": [ - "('eat', [(1, 'eat')])" + "(2, (1, 'war'), (2, 'arc'))" ] }, "execution_count": 38, @@ -1130,516 +1540,37 @@ } ], "source": [ - "A = 'e'\n", - "min((bridges[A + B] for B in firsts), key=bridgelen)" + "W.bridges['w']['c']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Some more examples of bridges:" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "('airaq', [(1, 'air'), (2, 'iraq')])" - ] - }, - "execution_count": 39, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "bridges['a' + 'q']" - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "('you', [(1, 'you')])" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "bridges['y' + 'u']" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "('xericolloq', [(1, 'xeric'), (1, 'colloq')])" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "max(bridges.values(), key=bridgelen)" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[(3, 261),\n", - " (4, 256),\n", - " (5, 69),\n", - " (6, 34),\n", - " (2, 25),\n", - " (7, 21),\n", - " (8, 7),\n", - " (1, 2),\n", - " (10, 1)]" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Counter(map(bridgelen, bridges.values())).most_common() # How long are bridges?" - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'bq': ('bocciraq', [(1, 'bocci'), (1, 'iraq')]),\n", - " 'gq': ('geniiraq', [(1, 'genii'), (1, 'iraq')]),\n", - " 'jq': ('jinniraq', [(1, 'jinni'), (1, 'iraq')]),\n", - " 'oq': ('obeliraq', [(1, 'obeli'), (1, 'iraq')]),\n", - " 'qq': ('quasiraq', [(1, 'quasi'), (1, 'iraq')]),\n", - " 'vj': ('vetchajj', [(1, 'vetch'), (1, 'hajj')]),\n", - " 'xj': ('xericonj', [(1, 'xeric'), (1, 'conj')]),\n", - " 'xq': ('xericolloq', [(1, 'xeric'), (1, 'colloq')])}" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "{AB: bridges[AB] for AB in bridges if bridgelen(bridges[AB]) >= 8} # Longest bridges" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Note 1:* My `compute_bridges` only does one-letter-to-one-letter bridges. It just seemed simpler to only fill in a 26×26 `bridges` table, and only maintain 26 entries in `firsts`. But that means sometimes we get a long bridge when we could have found a shorter one with more work. Here's an example where the previous word is `cogito` and there's only one word left, `'question'`, which starts with `'q'`. We end up with the bridge `'oculiraq'`, adding 6 letters between the `'o'` and `'q'`:" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[(1, 'obeli'), (1, 'iraq')]" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "bridging_words('cogito', {'q': 1}, bridges)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "But if we had arranged for multi-letter-to-multi-letter bridges, we could have noticed that the bridge `'toque'` adds *zero* letters between the `'to'` at the end of `'cogito'` and the `'que'` at the start of `'question'`. How could I find `'toque'` in this situation? One approach would be to precompute a `bridges` table with up to two letters on the left and three on the right. That would mean about $26^5 = 12$ million table entries (although most of them will be empty). Another approach would be to call `unused_word`, but give it a selection of words that have been used before, and then check if it ends up in a place that allows us to make progress on the next move. On the one hand, this seems like a promising idea to explore. On the other hand, I did a quick measurement, and the average number of letters added per bridge under the current algorithm is just about 1, so there's not that much room to improve. Maybe this approach could reduce the number of letters in $S$ by 1%." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Note 2:* I should say that I started watching Tom Murphy's highly-entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI), but then I paused it because I wanted the fun of solving the problem mostly on my own. After I finished the first version of my program, I returned to the video and [paper](http://tom7.org/portmantout/murphy2015portmantout.pdf) and I noticed that \n", - "Murphy had an approach to bridges that was very similar to mine, but in one way clearly superior: In my original program I only considered two-word bridges when there was no one-word bridge, on the theory that one word is shorter than two. My program looked something like:\n", + "# Conclusion\n", "\n", - " (overlap, word) = unused_word(...) or one_word_bridge(...) or two_word_bridge(...)\n", - " \n", - "But Murphy showed that my theory was wrong: I had `bridges['w' + 'c'] = 'workaholic'`, but he had `'warc'`, a portmanteau of `'war'` and `'arc'`, which saves six letters over my single word. After seeing this, I shamelessly copied his approach, and now I too get a four-letter bridge for `'w' + 'c'` (sometimes `'warc'` and sometimes `'wet' + 'etc' = 'wetc'`), and my portmantout strings are about 0.5% shorter." - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "('wetc', [(1, 'wet'), (2, 'etc')])" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "bridges['w' + 'c']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "(*English trivia:* my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a **bridge**. Murphy's program starts by building a large pool of small portmanteaux that he calls **particles**, and when he can build no more particles, his next step is to put two particles together, so he calls it a **join**. The different metaphors for what our programs are doing lead to different terminology for the same idea.)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# A Solution\n", + "I'll stop here, but you should feel free to do more experimentation of your own. \n", "\n", - "We're finally ready to solve problems! First the tiny `W2` word set, for which we must specify the starting word, or it will fail:" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[(0, 'eskimo'),\n", - " (4, 'kimono'),\n", - " (4, 'monogram'),\n", - " (4, 'grammarian'),\n", - " (2, 'anarchy')]" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "natalie(W2, start='eskimo')" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'eskimonogrammarianarchy'" - ] - }, - "execution_count": 47, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "portman(natalie(W2, start='eskimo'), W2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now the big `W` word set, which works fine when it selects a starting word on its own, but, following Tom Murphy and doing him one letter better, I will supply the starting word `portmanteaux`:" - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 24.4 s, sys: 89.4 ms, total: 24.5 s\n", - "Wall time: 24.6 s\n" - ] - }, - { - "data": { - "text/plain": [ - "570002" - ] - }, - "execution_count": 48, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "P = natalie(W, start='portmanteaux')\n", - "S = portman(P, W)\n", - "len(S)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A little over half a million letters long. Here is the start of the string $S$:" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'portmanteauxiliariespritselfwardressestetsonshipsideswipingrassessorshipkeepergnestlikeneditorializationshorelinesmantillascarsonickeledematouslessoneditorializestfullynchessmanganousefulnessesquicentenniallylstonedifiersatzestygiantismshirtwaistlinesmenadsorptionosphericitywardenshipbuilderstwhiledgaroteddyingsinewsstandstilliestablisheriffdomicilingamsterdamnitrosemariestablishmentsaritzaspersersafaristocraciestimatorsionallyratelysianthologizedswampinessentiallyonnaisecularizershankingwoodwax'" - ] - }, - "execution_count": 49, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "S[:500]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It is easier to grasp by looking at the first 50 items in $P$:" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[(0, 'portmanteaux'),\n", - " (3, 'auxiliaries'),\n", - " (2, 'esprits'),\n", - " (3, 'itself'),\n", - " (4, 'selfward'),\n", - " (4, 'wardresses'),\n", - " (3, 'sestets'),\n", - " (5, 'stetsons'),\n", - " (4, 'sonships'),\n", - " (5, 'shipside'),\n", - " (4, 'sideswiping'),\n", - " (4, 'pingrasses'),\n", - " (5, 'assessorship'),\n", - " (4, 'shipkeeper'),\n", - " (4, 'epergnes'),\n", - " (3, 'nestlike'),\n", - " (4, 'likened'),\n", - " (2, 'editorializations'),\n", - " (3, 'onshore'),\n", - " (5, 'shorelines'),\n", - " (5, 'linesman'),\n", - " (3, 'mantillas'),\n", - " (3, 'lascars'),\n", - " (4, 'carson'),\n", - " (5, 'arsonic'),\n", - " (3, 'nickeled'),\n", - " (2, 'edematous'),\n", - " (4, 'tousles'),\n", - " (3, 'lessoned'),\n", - " (2, 'editorializes'),\n", - " (3, 'zestfully'),\n", - " (2, 'lynches'),\n", - " (4, 'chessman'),\n", - " (3, 'manganous'),\n", - " (2, 'usefulness'),\n", - " (7, 'fulnesses'),\n", - " (3, 'sesquicentennially'),\n", - " (4, 'allyls'),\n", - " (1, 'stoned'),\n", - " (2, 'edifiers'),\n", - " (3, 'ersatzes'),\n", - " (3, 'zesty'),\n", - " (3, 'stygian'),\n", - " (4, 'giantisms'),\n", - " (1, 'shirtwaist'),\n", - " (5, 'waistlines'),\n", - " (5, 'linesmen'),\n", - " (3, 'menads'),\n", - " (3, 'adsorption'),\n", - " (3, 'ionospheric')]" - ] - }, - "execution_count": 50, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "P[:50]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "I'll write the whole string to the file [natalie.txt](natalie.txt):" - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "570002" - ] - }, - "execution_count": 51, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "open('natalie.txt', 'w').write(S)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can create a report on how we did:" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [], - "source": [ - "def report(S, W, P):\n", - " sub = subwords(W)\n", - " nonsub = W - sub\n", - " bridge = len(P) - len(nonsub)\n", - " def Len(words): return sum(map(len, words))\n", - " print(f'W has {len(W):,d} words ({len(nonsub):,d} nonsubwords; {len(sub):,d} subwords).')\n", - " print(f'P has {len(P):,d} words ({len(nonsub):,d} nonsubwords; {bridge:,d} bridge words).')\n", - " print(f'S has {len(S):,d} letters; W has {Len(W):,d}; nonsubs have {Len(nonsub):,d}.')\n", - " print(f'The average overlap in P is {(Len(w for _,w in P)-len(S))/(len(P)-1):.2f} letters.')\n", - " print(f'The compression ratio (W/S) is {Len(W)/len(S):.2f}.')" - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "W has 108,709 words (64,389 nonsubwords; 44,320 subwords).\n", - "P has 105,009 words (64,389 nonsubwords; 40,620 bridge words).\n", - "S has 570,002 letters; W has 931,823; nonsubs have 595,805.\n", - "The average overlap in P is 1.34 letters.\n", - "The compression ratio (W/S) is 1.63.\n" - ] - } - ], - "source": [ - "report(S, W, P)" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "W has 22 words (5 nonsubwords; 17 subwords).\n", - "P has 5 words (5 nonsubwords; 0 bridge words).\n", - "S has 23 letters; W has 90; nonsubs have 37.\n", - "The average overlap in P is 3.50 letters.\n", - "The compression ratio (W/S) is 3.91.\n" - ] - } - ], - "source": [ - "report(S1, W2, P1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Next Steps\n", + "Here are some things you could do to make the portmantouts more interesting:\n", "\n", - "Each time I restart this notebook, I get a slightly different result. (*Python trivia*: every time you start a new `python` instance, you get a slightly different [`hash`](https://docs.python.org/3.4/reference/datamodel.html#object.__hash__) function, which means that an iteration over a set may yield elements in a different order. This is to prevent a kind of denial of service attack on web servers, but for my program it means that the iteration over `unused` words is in a different order, so the results are different.) I had originally planned to use the **random restart** approach: add some randomness to the word selections and take the best of multiple trials. However, all the trials I have seen so far with my current program fall within a narrow range of 570,000 to 578,000 letters, so I think it is not worth the effort for what would probably be a 1% improvement at best. \n", - "\n", - "To compare my [program](portman.py) to [Murphy's](https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/portmantout/): I used a greedy approach that incrementally builds up a single long portmanteau, extending it via a bridge when necessary. Murphy first built a pool of smaller portmanteaux, then joined them all together. (I'm reminded of the [Traveling Salesperson Problem](TSP.ipynb) where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments.) The two approaches are different, but it is not clear whether one is better than the other. \n", - "\n", - "In terms of implementation, mine is in Python and is concise (112 lines); Murphy's is in C++ and is verbose (1867 lines). Murphy's code does a lot of extra work that mine doesn't: generating diagrams, and running multiple threads in parallel to implement the random restart idea. It appears Murphy didn't quite have the complete concept of **subwords**: he did mention that when he adds `'bulleting'`, he crosses `'bullet'` and `'bulletin'` off the list, but somehow [his string](http://tom7.org/portmantout/murphy2015portmantout.pdf) contains both `'spectacular'` and `'spectaculars'` in two different places. My guess is that when he adds `'spectaculars'` he crosses off `'spectacular'`, but if he happens to add `'spectacular'` first, he will later add `'spectaculars'`. Support for this view is that his output in `bench.txt` says \"I skipped 24319 words that were already substrs\", but I computed that there are 44,320 such words; he found about half of them. I think those missing 20,001 words are the main reason why my shortest string ([570,002 letters](natalie570002.txt)) is about 40,000 letters shorter than his (611,820 letters).\n", - "\n", - "I'll stop here, but you should feel free to do some experimentation of your own. Some ideas:\n", - "\n", - "- With the set of 108,709 words it is always possible to bridge from any letter to any other letter in at most two steps. But for smaller word lists, that might not be the case. You could consider three-word bridges, and consider what to do when there is no bridging sequence at all (perhaps back up and remove a previously-placed word; perhaps use a beam search to keep several alternatives open at once; perhaps allow the addition of words to the start as well as the end of `P`).\n", "- Use linguistic resources (such as [pretrained word embeddings](https://nlp.stanford.edu/projects/glove/)) to teach your program what words are related to each other. Encourage the program to place related words next to each other.\n", - "- Use linguistic resources (such as [NLTK](https://github.com/nltk/)) to teach your program where syllable breaks are in words. Encourage the program to make overlaps match syllables. (That's why \"preferendumdums\" sounds better than \"fortyphonshore\".)\n", - "- Here are some ideas to minimize the length of $S$. Can you implement these or think of your own?\n", - " - Get better overlap of bridging words, perhaps by implementing the multi-letter-to-multi-letter approach.\n", - " - Use fewer bridging words by planning ahead to avoid the need for them. Perhaps `compute_startswith` could sort the words in each key's bucket so that the \"difficult\" words (say, the ones that end in unusual letters) are encountered earlier in the program's execution, when there are more available words for them to connect to.\n", - " - You can't alter the number of nonsubwords, but you might be able to get better overlap between them. Perhaps make more informed choices of which ones to use when. For example, if there is an affix that is the prefix of only one word, and the suffix of only one other word, then probably those two words should overlap.\n", - " - Alter either Murphy's code or mine to combine his idea of joining particles with my idea of completely handling subwords.\n", - " - Use a completely different strategy. What can you come up with?" + "- Use linguistic resources (such as [NLTK](https://github.com/nltk/)) to teach your program where syllable breaks are in words, and what each syllable sounds like. Encourage the program to make overlaps match syllables. (That's why \"preferendumdums\" sounds better than \"fortyphonshore\".)\n", + "\n", + "Here are some things you could do to make $S$ shorter:\n", + "\n", + "- **Lookahead**: Unused words are chosen based on the degree of overlap, but nothing else. It might help to prefer unused words which have a suffix that matches the prefix of another unused word. A single-word lookahead or a bbeam search could be used.\n", + "\n", + "- **Reserving words**: It seems like `haydn` and `dnieper` are made to go together; they're the only words with `dn` as an affix. If `haydn` was selected first, `unused_step` would select `dnieper` next, but if `dnieper` as selected first, we'd lose the chance to take advantage of that overlap. Maybe there could be a system that assures `haydn` comes first, or a preprocessing step that joined together words that uniquely go together. This is getting close to what Murphy did in his program.\n", + "\n", + "- **Word choice ordering**: Perhaps `compute_startswith` could sort the words in each key's bucket so that the \"difficult\" words (say, the ones that end in unusual letters) are encountered earlier in the program's execution, when there are more available words for them to connect to.\n", + " \n", + "Here are some things you could do to make the program more robust:\n", + "\n", + "- Write and run unit tests.\n", + "\n", + "- Find other word lists and try the program on them.\n", + "\n", + "- Consider what to do for a wordset that has missing bridges. You could try three-word bridges, you could allow the program to back up and remove a previously-placed word; you could allow the addition of words to the start as well as the end of `P`)." ] } ],