diff --git a/ipynb/Portmantout.ipynb b/ipynb/Portmantout.ipynb index 5a43452..2a8d169 100644 --- a/ipynb/Portmantout.ipynb +++ b/ipynb/Portmantout.ipynb @@ -11,46 +11,47 @@ "A [***portmanteau***](https://en.wikipedia.org/wiki/Portmanteau) is a word that squishes together two words, like *math* + *athlete* = *mathlete*. Inspired by [**Darius Bacon**](http://wry.me/), I covered this as a programming exercise in my 2012 [**Udacity course**](https://www.udacity.com/course/design-of-computer-programs--cs212). In 2018 I was re-inspired by [**Tom Murphy VII**](http://tom7.org/), who added a new twist: [***portmantout words***](http://www.cs.cmu.edu/~tom7/portmantout/) (*tout* from the French for *all*), which are defined as:\n", "\n", "> A **portmantout** of a set of words $W$ is a string $S$ such that:\n", - "* Every word in $W$ is a **substring** of $S$.\n", - "* The words **overlap**: every word (except the first) starts at an index that is equal to or before the end of another word.\n", - "* **Nothing else** is in $S$: every letter in $S$ comes from the overlapping words. (But a word may be repeated any number of times.)\n", + "* Every word in $W$ is a **substring** of $S$. (Although a word may appear more than once in $S$.)\n", + "* The words **overlap**: the start of every word (except the first) is equal to or before the end of another word.\n", + "* **Nothing else** is in $S$: every letter in $S$ comes from the overlapping words. \n", "\n", - "Although not part of the definition, the goal is to get as short an $S$ as possible, and to do it for a set of about 100,000 words. Developing a program to do that is the goal of this notebook. My program (also available as [`portman.py`](portman.py)) helped me discover:\n", + "This notebook attempts to find a portmantout that is as short as possible, while covering a set $W$ of over 100,000 words. My program discovered:\n", + "\n", + "- **preferendumdums**: a political commentary
portmanteau of {prefer, referendum, dumdums}\n", + "- **impromptutankhamenability**: a willingness to see the Egyptian exhibit on the spur of the moment
portmanteau of {impromptu, tutankhamen, amenability}\n", + "- **dashikimonogrammarianarchy**: chaos when a linguist enscribes African/Japanese garb
portmanteau of {dashiki, kimono, monogram, grammarian, anarchy}\n", + "- **allegestionstage**: a brutal theatre review
portmanteau of {alleges, egestions, onstage}\n", + "- **skymanipulablearsplittinglers**: a nerve-damaging aviator
portmanteau of {skyman, manipulable, blears, earsplitting, tinglers}\n", + "- **edinburgherselflesslylyricize**: a Scottish music review
portmanteau of {edinburgh, burghers, herself, selflessly, slyly, lyricize}\n", + "- **fortyphonshore**: a dire weather report
portmanteau of {forty, typhons, onshore} \n", "\n", - "- **preferendumdums**: a political commentary portmanteau of {prefer, referendum, dumdums}\n", - "- **fortyphonshore**: a dire weather report portmanteau of {forty, typhons, onshore}; \n", - "- **allegestionstage**: a brutal theatre critic portmanteau of {alleges, egestions, onstage}. \n", - "- **skymanipulablearsplittingler**: a nerve-damaging aviator portmanteau of {skyman, manipulable, blears, earsplitting, tinglers}\n", - "- **edinburgherselflesslylyricize**: a Scottish music review portmanteau of {edinburgh, burghers, herself, selflessly, slyly, lyricize}\n", - "- **impromptutankhamenability**: a portmanteau of {impromptu, tutankhamen, amenability}, indicating a willingness to see the Egyptian exhibit on the spur of the moment.\n", - "- **dashikimonogrammarianarchy**: a portmanteau of {dashiki, kimono, monogram, grammarian, anarchy}, describing the chaos that ensues when a linguist gets involved in choosing how to enscribe African/Japanese garb. \n", "\n", "# Problem-Solving Strategy\n", "\n", - "Although I haven't proved it, my intuition is that finding a shortest $S$ is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find an optimal solution in a reasonable amount of time. A common approach to NP-hard problems is a **greedy algorithm**: make the locally best choice at each step, in the hope that the steps will fit together into a solution that is not too far from the globally best solution. \n", + "Although I haven't proved it, my intuition is that finding a shortest $S$ is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find an optimal solution in a reasonable amount of time. A common approach to NP-hard problems is a **greedy algorithm**: make the locally best choice at each step, in the hope that the steps will fit together into a solution that is not too far from the globally shortest solution. \n", "\n", "Thus, my approach will be to build up a **path**, starting with one word, and then adding **steps** to the end of the evolving path, one at a time. Each step consists of a word from *W* that overlaps the end of the previous word by at least one letter. I will choose the step that seems to be the best choice at the time (the one with the **lowest cost**), and will never undo a step, even if the path seems to get stuck later on. I distinguish two types of steps:\n", "\n", "- **Unused word step**: using a word for the first time. (Once we use them all, we're done.)\n", - "- **Bridging word step**: if no remaining unused word overlaps the previous word, we need to do something to get back on track to consuming unused words. I call that something a **bridge**: a step that **repeats** a previously-used word and has a word ending that matches the start of some unused word. Repeating a word may add letters to the final length of $S$, and it doesn't get us closer to the requirement of using all the words, but it is sometimes necessary. Sometimes two words are required to build a bridge, but never more than two (at least with the 100,000 word set we will be dealing with). \n", + "- **Bridging word step**: if no remaining unused words overlap the previous word, we need to do something to get back on track to consuming unused words. I call that something a **bridge**: a step that **repeats** a previously-used word and has an ending that matches the start of some unused word. Repeating a word may add letters to the final length of $S$, and it doesn't get us closer to the requirement of using all the words, but it is sometimes necessary. Sometimes two words are required to build a bridge, but never more than two (at least with the 100,000 word set we will be dealing with). \n", "\n", - "There's actually a third type of word, but it doesn't need a corresponding type of step: a **subword** is a word that is a substring of another word. If, say, `ajar` is in $W$, then we know we have to place it in some step along the path. But if `jar` is also in $W$, we don't need a step for it—whenever we place `ajar`, we have automatically placed `jar`. The program can save computation time save time by initializing the set of **unused words** to be the **nonsubwords** in $W$. (We are still free to use a subword as a **bridging word** if needed.)\n", + "There's actually a third type of word, but it doesn't need a corresponding type of step: a **subword** is a word that is a substring of another word. If, say, `ajar` is in $W$, then we know we have to place it in some step along the path. But if `jar` is also in $W$, we don't need a separate step for it—whenever we place `ajar`, we have automatically placed `jar`. The program can save computation time save time by initializing the set of **unused words** to be just the **nonsubwords** in $W$. (We are still free to use a subword as a **bridging word** if needed.)\n", "\n", - "(*English trivia:* I use the clumsy term \"nonsubwords\" rather than \"superwords\", because there are a small number of words, like \"cozy\" and \"pugs\" and \"july,\" that are not subwords of any other words but are also not superwords: they have no subwords.)\n", + "(*English trivia:* I use the clumsy term \"nonsubwords\" rather than \"superwords\", because there are a small number of words, like \"cozy\" and \"pugs\" and \"july,\" that are not subwords and also not superwords.)\n", "\n", "If we're going to be **greedy**, we need to know what we're being greedy about. I will always choose a step that minimizes the **number of excess letters** that the step adds. To be precise: I arbitrarily assume a baseline model in which all the words are concatenated with no repeated words and no overlap between them. (That's not a valid solution, but it is useful as a benchmark.) So if I add an unused word, and overlap it with the previous word by three letters, that is an excess of -3 (a negative excess is a positive thing): I've saved three letters over just concatenating the unused word. For a bridging word, the excess is the number of letters that do not overlap either the previous word or the next word. \n", "\n", - "**Examples:** In each row of the table below, `'ajar'` is the previous word, but each row makes different assumptions about what unused words remain, and thus we get different choices for the step to take. The table shows the overlapping letters between the previous word and the step, and in the case of bridges, it shows the next unused word that the step is bridging to. In the final two columns we give the excess score and excess letters.\n", + "**Examples:** In each row of the table below, `'ajar'` is the previous word, but each row makes different assumptions about what unused words remain, and thus we get different choices for the step to take. The table shows the overlapping letters between the previous word and the step, and in the case of bridges, it shows the next unused word that the step is bridging to. In the final column we give the excess letters (if any) and excess score.\n", "\n", - "|Previous|Step(s)|Overlap|Bridge to|Type of Step|Excess|Letters|\n", - "|--------|----|----|---|---|---|---|\n", - "| ajar|jarring|jar||unused word |-3| |\n", - "| ajar|arbitrary|ar||unused word |-2||\n", - "| ajar|argot|ar|goths|one-step bridge |0||\n", - "| ajar|arrow|ar|owlets| one-step bridge|1|r|\n", - "| ajar|rani+iraq|r|quizzed| two-step bridge|5 |anira | \n", + "|Previous|Step(s)|Overlap|Bridge to|Type of Step|Excess|\n", + "|--------|----|----|---|---|---|\n", + "| ajar|jarring|jar||unused word |-3 |\n", + "| ajar|arbitrary|ar||unused word |-2|\n", + "| ajar|argot|ar|goths|one-step bridge |0|\n", + "| ajar|arrow|ar|owlets| one-step bridge|\"r\" 1|\n", + "| ajar|rani+iraq|r|quizzed| two-step bridge|\"anira\" 5 | \n", "\n", - "Let's go over the example steps one at a time:\n", + "Let's go over the example steps one row at a time:\n", "- **jarring**: Here we assume `jarring` is an unused word, and it overlaps by 3 letters, giving it an excess cost of -3, which is the best possible (an overlap of 4 would mean `ajar` is a subword, and we already agreed to eliminate subwords).\n", "- **arbitrary**: an unused word that overlaps by 2 letters, so it would only be chosen if there were no unused words with 3-letter overlap.\n", "- **argot**: From here on down, we assume there are no unused words that overlap, which means that `argot` is a best-possible bridge, because it completely overlaps the `ar` in `ajar` and the `got` in an unused word, `goths`.\n", @@ -66,19 +67,14 @@ "\n", "Here I describe how to implement the main data types in Python:\n", "\n", - "- **Word**: a Python `str` (as are subparts of words, like suffixes or individual letters).\n", "- **Wordset**: a subclass of `set`, denoting a set of words.\n", + "- **Word**: a Python `str` (as are subparts of words, like suffixes or individual letters).\n", + "- **Step**: a namedtuple of an overlap and a word; the step adding `jarring` to `ajar` would be `Step(3, 'jarring')`. The first step in a path should have an overlap of 0; all others should have a positive integer overlap.\n", "- **Path**: a Python `list` of steps.\n", - "- **Step**: a tuple of `(overlap, word)`. The step adding `jarring` to `ajar` would be `(3, 'jarring')`, indicating an overlap of three letters. The first step in a path should have an overlap of 0; all others should have a positive integer overlap.\n", - "- **Bridge**: a tuple of an excess cost followed by one or two steps, e.g. `(1, (2, 'arrow'))`.\n", - "- **Bridges**: a precomputed and cached table mapping a prefixes and suffix to a bridge. For example:\n", + "- **Bridge**: a namedtuple of an excess cost followed by a list of steps, e.g. `Bridge(1, [Step(2, 'arrow')])`.\n", "\n", - " W.bridges['ar']['ow'] == (1, (2, 'arrow'))\n", - " W.bridges['r']['q'] == (5, (1, 'rani'), (1, 'iraq'))\n", "\n", - "(*Python trivia:* I implemented `Wordset` as a subclass of `set` so that I can add attributes like `W.bridges`. You can do that with a user-defined subclass, but not with a builtin class.)\n", - "\n", - "In the following code, data tyes are **Capitalized** and indexes into tuples are **UPPERCASE**." + "(*Python trivia:* I implemented `Wordset` as a subclass of `set` so that I can add attributes (such as `W.subwords`) to it later. You can do that with a user-defined subclass, but not with a builtin class.)" ] }, { @@ -87,25 +83,9 @@ "metadata": {}, "outputs": [], "source": [ - "from collections import defaultdict, Counter\n", - "from typing import List, Tuple, Set, Dict, Any\n", - "\n", - "Word = str\n", - "class Wordset(set): \"\"\"A set of words.\"\"\"\n", - "Step = Tuple[int, str] # An (overlap, word) pair.\n", - "OVERLAP, WORD = 0, 1 # Indexes of the two parts of a Step.\n", - "Path = List[Step] # A list of steps.\n", - "Bridge = (int, Step,...) # An excess letter count and step(s), e.g. (1, (2, 'arrow')).\n", - "EXCESS, STEPS = 0, slice(1, None) # Indexes of the two parts of a bridge." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 108,709 Word Set \n", - "\n", - "We can make Tom Murphy's 108,709-word list `\"wordlist.asc\"` into a `Wordset`, $W$:" + "from collections import defaultdict, Counter, namedtuple\n", + "from typing import List, Tuple, Set, Dict, Any\n", + "from statistics import mean" ] }, { @@ -114,9 +94,51 @@ "metadata": {}, "outputs": [], "source": [ - "! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc\n", + "class Wordset(set): \"\"\"A set of words.\"\"\"\n", "\n", - "W = Wordset(open('wordlist.asc').read().split()) " + "Word = str\n", + "Step = namedtuple('Step', 'overlap, word')\n", + "Path = List[Step]\n", + "Bridge = namedtuple('Bridge', 'excess, steps')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 108,709 Word Set: W \n", + "\n", + "We can make Tom Murphy's 108,709-word list `\"wordlist.asc\"` into a `Wordset`, $W$:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "108709" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "W = Wordset(open('wordlist.asc').read().split())\n", + "len(W)" ] }, { @@ -125,7 +147,7 @@ "source": [ "# Overall Program Design\n", "\n", - "I thought I would define a major function, `portman`, to generate the portmantout string $S$ from the set of words $W$ according to the strategy outlined above, and a minor function, `is_portman`, to verify the result. But I found verification to be difficult. For example, given $S =$ `'...helloworld...'` I would reject that as non-overlapping if I parsed it as `'hello'` + `'world'`, but I would accept it if parsed as `'hell'` + `'low'` + `'world'`. It was hard for `is_portman` to decide which parse was intended, which is a shame because `portman` *knew* which was intended, but discarded the information. \n", + "I thought I would define a major function, `portman`, to generate the portmantout string $S$ from the set of words $W$ according to the strategy outlined above, and a minor function, `is_portman`, to verify the result. But I found verification to be difficult. For example, given $S =$ `'helloworld'` I would reject that as non-overlapping if I parsed it as `'hello'` + `'world'`, but I would accept it if parsed as `'hello'` + `'low'` + `'world'`. It was hard for `is_portman` to decide which parse was intended, which is a shame because `portman` *knew* which was intended, but discarded the information. \n", "\n", "Therefore, I decided to change the interface: I'll have one function that takes $W$ as input and returns a path $P$, and a second function to generate the string $S$ from $P$. I decided on the following calling and [naming](https://en.wikipedia.org/wiki/Natalie_Portman) conventions:\n", "\n", @@ -139,12 +161,12 @@ " \n", "# portman\n", "\n", - "Here is the definition of `portman` and the result of `portman(P1)`, covering the word set `W1` (which has five superwords and 16 subwords):" + "Here is the definition of `portman` and the result of `portman(P1)`, where the path `P1` covers the word set `W1` (which has five nonsubwords and 16 subwords):" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -155,24 +177,23 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ - "W1 = Wordset(('anarchy', 'dashiki', 'grammarian', 'kimono', 'monogram',\n", - " 'an', 'am', 'arc', 'arch', 'aria', 'as', 'ash', 'dash', \n", - " 'gram', 'grammar', 'i', 'mar', 'narc', 'no', 'on', 'ram'))\n", - "P1 = [(0, 'dashiki'),\n", - " (2, 'kimono'),\n", - " (4, 'monogram'),\n", - " (4, 'grammarian'),\n", - " (2 , 'anarchy')]\n", - "S1 = 'dashikimonogrammarianarchy'" + "W1 = Wordset('''anarchy dashiki grammarian kimono monogram an am arc arch i\n", + " aria as ash dash gram grammar mar narc no on ram'''.split())\n", + "P1 = [Step(0, 'dashiki'),\n", + " Step(2, 'kimono'),\n", + " Step(4, 'monogram'),\n", + " Step(4, 'grammarian'),\n", + " Step(2 , 'anarchy')]\n", + "S1 = 'dashikimonogrammarianarchy'" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -181,7 +202,7 @@ "'dashikimonogrammarianarchy'" ] }, - "execution_count": 5, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -204,28 +225,28 @@ " \n", "These structures are somewhat complicated, so don't be discouraged if you have to go over a line of code or a prose description word-by-word several times before you understand exactly how it works.\n", "\n", - "After the precomputation, `natalie` loops until there are no more unused words. On each turn we call `unused_step`, which returns a list of one step if an unused word overlaps, or the empty list if it doesn't, in which case we call `bridging_steps`, which always returns a bridge of either one or two steps. We then append the step(s) to the path `P`, and call `used(W, word)` to mark that `word` has been used (reducing the size of `W.unused` and updating `W.startswith` if `word` was previously unused). \n", + "After the precomputation, `natalie` loops until there are no more unused words. On each turn we call `unused_step`, which returns a list of one step if an unused word overlaps, or the empty list if it doesn't, in which case we call `bridging_steps`, which always returns a bridge of either one or two steps. We then append the step(s) to the path `P`, and call `mark_as_used(W, word)` to mark that `word` has been used (reducing the size of `W.unused` and updating `W.startswith` if `word` was previously unused). \n", "\n", "It is important that every bridge leads to an unused word. That way we know the program will **always terminate**: if $N$ is the number of unused nonsubwords in $W$, then consider the quantity $(2N + (1$ `if` the last step overlaps an unused word `else` $0))$. Every iteration of the `while` loop decreases this quantity by at least 1; therefore the quantity will eventually be zero, and when it is zero, it must be that `W.unused` is empty and the loop terminates." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def natalie(W: Wordset, start=None) -> Path:\n", " \"\"\"Return a portmantout path containing all words in W.\"\"\"\n", " precompute(W)\n", - " word = start or first(W.unused)\n", - " used(W, word)\n", - " P = [(0, word)]\n", + " word = start or next(iter(W.unused))\n", + " mark_as_used(W, word)\n", + " P = [Step(0, word)]\n", " while W.unused:\n", - " steps = unused_step(W, word) or bridging_steps(W, word)\n", - " for (overlap, word) in steps:\n", - " P.append((overlap, word))\n", - " used(W, word)\n", + " for step in unused_step(W, word) or bridging_steps(W, word):\n", + " P.append(step)\n", + " word = step.word\n", + " mark_as_used(W, word)\n", " return P" ] }, @@ -233,7 +254,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`unused_steps` considers every suffix of the previous word, longest suffix first. If a suffix starts any unused words, we choose the first such word as the step. Since we're going longest-suffix first, no other word could do better, and since unused steps have negative costs and bridges don't, the unused step will always be better than any bridge.\n", + "`unused_step` considers every suffix of the previous word, longest suffix first. If a suffix starts any unused words, we choose the first such word as the step. Since we're going longest-suffix first, no other word could do better, and since unused steps have negative costs and bridges don't, the unused step will always be better than any bridge.\n", "\n", "`bridging_steps` also tries every suffix of the previous word, and for each one it looks in the `W.bridges[suf]` table to see what prefixes (of unused words) we can bridge to from this suffix. Consider all such `W.bridges[suf][pre]` entries that bridge to the prefix of an unused word (as maintained in `W.startswith[pre]`). Out of all such bridges, take one with the minimal excess cost, and return the one- or two-step sequence that makes up the bridge.\n", "\n", @@ -242,33 +263,30 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def unused_step(W: Wordset, prev_word: Word) -> List[Step]:\n", - " \"\"\"Return [(overlap, unused_word)] or [].\"\"\"\n", + " \"\"\"Return [Step(overlap, unused_word)] or [].\"\"\"\n", " for suf in suffixes(prev_word):\n", - " for unused_word in W.startswith.get(suf, ()):\n", - " overlap = len(suf)\n", - " return [(overlap, unused_word)]\n", + " for word in W.startswith.get(suf, ()):\n", + " return [Step(len(suf), word)]\n", " return []\n", "\n", "def bridging_steps(W: Wordset, prev_word: Word) -> List[Step]:\n", - " \"\"\"The steps from the shortest bridge that bridges \n", + " \"\"\"The steps from the bridge with minimum excesss that bridges \n", " from a suffix of prev_word to a prefix of an unused word.\"\"\"\n", " bridge = min(W.bridges[suf][pre] \n", " for suf in suffixes(prev_word) if suf in W.bridges\n", " for pre in W.bridges[suf] if W.startswith[pre])\n", - " return bridge[STEPS]" + " return bridge.steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "(*Python trivia:* in `unused_step` I do `W.startswith.get(suf, ())`, not `W.startswith[suf]` because the dict in question is a `defaultdict(set)`, and if there is no entry there, I don't want to insert an empty set entry.)\n", - "\n", "**Failure is not an option**: what happens if we have a small word set that can't make a portmantout? First `unused_step` will fail to find an unused word, which is fine; then `bridging_steps` will fail to find a bridge, and will raise `ValueError: min() arg is an empty sequence`. You could catch that error and return, say, an empty path if you wanted to, but my intended use is for word sets where this can never happen." ] }, @@ -278,12 +296,12 @@ "source": [ "# precompute etc.\n", "\n", - "Here are a bunch of the subfunctions that make the code above work:" + "Here are the subfunctions that make the code above work:" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -297,9 +315,9 @@ " W.unused = W - W.subwords\n", " W.startswith = compute_startswith(W.unused)\n", " \n", - "def used(W, word):\n", - " \"\"\"Remove word from `W.unused` and, for each prefix, from `W.startswith[pre]`.\"\"\"\n", - " assert word in W, f'used \"{word}\", which is not in the word set'\n", + "def mark_as_used(W, word):\n", + " \"\"\"Record the fact that `word` has been used.\n", + " Remove word from `W.unused` and from `W.startswith[pre]`, for each prefix.\"\"\"\n", " if word in W.unused:\n", " W.unused.remove(word)\n", " for pre in prefixes(word):\n", @@ -307,8 +325,6 @@ " if not W.startswith[pre]:\n", " del W.startswith[pre]\n", " \n", - "def first(iterable, default=None): return next(iter(iterable), default)\n", - "\n", "def multimap(pairs) -> Dict[Any, set]:\n", " \"\"\"Given (key, val) pairs, make a dict of {key: {val,...}}.\"\"\"\n", " result = defaultdict(set)\n", @@ -355,24 +371,25 @@ "source": [ "# Building Bridges\n", "\n", - "The last piece of the program is the construction of the `W.bridges` table. Recall that we want `W.bridges[suf][pre]` to be a bridge between a suffix of the previous word and a prefix of an unused word, as in the examples:\n", "\n", - " W.bridges['ar']['ow'] == (1, (2, 'arrow'))\n", - " W.bridges['ar']['c'] == (0, (2, 'arc'))\n", - " W.bridges['r']['q'] == (5, (1, 'rani'), (1, 'iraq'))\n", - " \n", - "We build all the bridges once and for all in `precompute`, and don't update them as words are used. Thus, `W.bridges['r']['q']` says \"if there are any unused words starting with `'q'`, you can use this bridge, but I'm not promising there are any.\" The caller (i.e. `bridging_steps`) is responsible for checking that `W.startswith['q']` contains unused word(s).\n", - " \n", - "Bridges should be short. We don't need to consider `antidisestablishmentarianism` as a possible bridge word. Instead, from our 108,709 word set $W$, we'll select the 10,273 words with length up to 5, plus 20 six-letter words that end in any of 'qujvz', the rarest letters. (For other word sets, you may have to tune these parameters.) I call these `shortwords`. I also compute a `shortstartswith` table for the `shortwords`, where, for example,\n", + "The last piece of the program is the construction of the `W.bridges` table. Recall that we want `W.bridges[suf][pre]` to be a bridge between a suffix of the previous word and a prefix of some new word, as in these examples:\n", "\n", - " shortstartswith['som'] == {'soma', 'somas', 'some'} # but not 'somebodies', 'somersaulting', ...\n", + " W.bridges['ar']['ow'] == Bridge(1, [Step(2, 'arrow')])\n", + " W.bridges['ar']['c'] == Bridge(0, [Step(2, 'arc')])\n", + " W.bridges['r']['q'] == Bridge(5, [Step(1, 'rani'), Step(1, 'iraq')])\n", + " \n", + "We build all the bridges once and for all in `precompute`, and don't update them as words are used. Thus, `W.bridges['ar']['c']` says \"if there are any unused words starting with `'c'`, you can use this bridge, but I'm not promising there are any.\" The caller (i.e. `bridging_steps`) is responsible for checking that `W.startswith['c']` contains unused word(s).\n", + " \n", + "Bridges should be short. We don't need to consider `antidisestablishmentarianism` as a possible bridge word. Instead, from our 108,709 word set $W$, we'll select the words with length up to 5 (10,273 words), plus six-letter words that end in any of the rarest letters, 'qujvz' (20 words). For other word sets, you may have to modify these parameters. I call these `shortwords`. I also compute a `shortstartswith` table for the `shortwords`, where, for example,\n", + "\n", + " shortstartswith['som'] == {'soma', 'somas', 'some'} # but not 'somersaulting', ...\n", " \n", "To build one-word bridges, consider every shortword, and split it up in all possible ways into a prefix that will overlap the previous word, a suffix that will overlap the next word, and a count of zero or more excess letters in the middle that don't overlap anything. For example:" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -385,7 +402,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -403,7 +420,7 @@ " (3, 'a', 'w')]" ] }, - "execution_count": 10, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -421,16 +438,16 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def try_bridge(bridges, pre, suf, excess, word, step2=None):\n", " \"\"\"Store a new bridge if it has less excess than the previous bridges[pre][suf].\"\"\"\n", - " if suf not in bridges[pre] or excess < bridges[pre][suf][EXCESS]:\n", - " bridge = (excess, (len(pre), word))\n", - " if step2: bridge += (step2,)\n", - " bridges[pre][suf] = bridge" + " if pre != suf and (suf not in bridges[pre] or excess < bridges[pre][suf].excess):\n", + " steps = [Step(len(pre), word)]\n", + " if step2: steps.append(step2)\n", + " bridges[pre][suf] = Bridge(excess, steps)" ] }, { @@ -444,13 +461,13 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def build_bridges(W: Wordset, maxlen=5, end='qujvz'):\n", - " \"\"\"A table of bridges[pre][suf] == (excess, (overlap, word)), e.g.\n", - " bridges['ar']['c'] == (0, (2, 'arc')).\"\"\"\n", + " \"\"\"A table of bridges[pre][suf] == Bridge(excess, steps), e.g.\n", + " bridges['ar']['c'] == Bridge(0, [Step(2, 'arc')]).\"\"\"\n", " bridges = defaultdict(dict)\n", " shortwords = [w for w in W if len(w) <= maxlen + (w[-1] in end)]\n", " shortstartswith = compute_startswith(shortwords)\n", @@ -464,9 +481,8 @@ " for word2 in shortstartswith[suf]: \n", " excess = len(word1) + len(word2) - len(suf) - 2\n", " A, B = word1[0], word2[-1]\n", - " if A != B:\n", - " step2 = (len(suf), word2)\n", - " try_bridge(bridges, A, B, excess, word1, step2)\n", + " step2 = Step(len(suf), word2)\n", + " try_bridge(bridges, A, B, excess, word1, step2)\n", " return bridges" ] }, @@ -481,7 +497,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -495,7 +511,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 16, "metadata": {}, "outputs": [ { @@ -504,7 +520,7 @@ "set()" ] }, - "execution_count": 14, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } @@ -523,7 +539,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 17, "metadata": {}, "outputs": [ { @@ -532,7 +548,7 @@ "(630, 650)" ] }, - "execution_count": 15, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -546,27 +562,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Portmantout Solutions\n", + "# Portmantout for W1\n", "\n", "**Finally!** We're ready to make portmantouts. First for the small word list `W1`:" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[(0, 'dashiki'),\n", - " (2, 'kimono'),\n", - " (4, 'monogram'),\n", - " (4, 'grammarian'),\n", - " (2, 'anarchy')]" + "[Step(overlap=0, word='dashiki'),\n", + " Step(overlap=2, word='kimono'),\n", + " Step(overlap=4, word='monogram'),\n", + " Step(overlap=4, word='grammarian'),\n", + " Step(overlap=2, word='anarchy')]" ] }, - "execution_count": 16, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } @@ -577,7 +593,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -586,7 +602,7 @@ "'dashikimonogrammarianarchy'" ] }, - "execution_count": 17, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -599,81 +615,66 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's make the portmantout and see how many steps and how many letters it is, and how long it takes:" + "# Portmantout for W\n", + "\n", + "Now the portmantout for the full word set $W$:" ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 9.1 s, sys: 151 ms, total: 9.25 s\n", - "Wall time: 9.56 s\n" + "CPU times: user 7.81 s, sys: 48.9 ms, total: 7.86 s\n", + "Wall time: 7.89 s\n" ] }, { "data": { "text/plain": [ - "(103470, 553747)" + "103088" ] }, - "execution_count": 18, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time P = natalie(W)\n", - "S = portman(P)\n", - "len(P), len(S)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can specify the starting word, as Tom Murphy did:" + "len(P)" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 21, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 8.79 s, sys: 103 ms, total: 8.89 s\n", - "Wall time: 9.06 s\n" - ] - }, { "data": { "text/plain": [ - "(103466, 553742)" + "553761" ] }, - "execution_count": 19, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "%time P = natalie(W, start='portmanteau')\n", "S = portman(P)\n", - "len(P), len(S)" + "len(S)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "I thought it might take 10 minutes, so under 10 seconds is super. " + "I thought it might take 10 minutes, so under 10 seconds is great. But I'm not really excited to look at half a million letters in $S$; I'll define a better way to explore the portmantout." ] }, { @@ -682,16 +683,17 @@ "source": [ "# Making it Prettier\n", "\n", - "Notice I haven't actually *looked* at the portmantout yet. I didn't want to dump half a million letters into an output cell. Instead, I'll define `report` to print various statistics, summarize the begin and end of the portmantout, and save the full string $S$ into [a file](natalie.txt). " + "I'll define `report` to print various statistics, summarize the start and end of the portmantout, and save $S$ into [a file](natalie.txt). " ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def report(W, P, steps=50, letters=500, save='natalie.txt'):\n", + " '''Print information about the portmantout and optionally `save` to file.'''\n", " S = portman(P)\n", " sub = W.subwords\n", " nonsub = W - sub\n", @@ -699,26 +701,27 @@ " missing = len(missing_bridges(W)) or \"no\"\n", " valid = \"is\" if is_portman(P, W) else \"IS NOT\"\n", " def L(words): return sum(map(len, words)) # Number of letters\n", - " print(f'W has {len(W):,d} words ({len(nonsub):,d} nonsubwords; {len(sub):,d} subwords).')\n", - " print(f'P has {len(P):,d} steps ({len(nonsub):,d} nonsubwords; {bridge:,d} bridge words).')\n", - " print(f'S has {len(S):,d} letters; W has {L(W):,d}; nonsubs have {L(nonsub):,d}.')\n", - " print(f'P has an average overlap of {(L(w for _,w in P)-len(S))/(len(P)-1):.2f} letters.')\n", - " print(f'S has a compression ratio (letters(W)/letters(S)) of {L(W)/len(S):.2f}.')\n", - " print(f'P (and thus S) {valid} a valid portmantout of W.')\n", - " print(f'W has {missing} missing one-letter-to-one-letter bridges.')\n", + " print(f'''\n", + " W has {len(W):,d} words ({len(sub):,d} subwords; {len(nonsub):,d} nonsubwords)\n", + " P has {len(P):,d} steps ({bridge:,d} bridge words)\n", + " S has {len(S):,d} letters; W has {L(W):,d}; nonsubs have {L(nonsub):,d}\n", + " P has an average overlap of {(L(w for _,w in P)-len(S))/(len(P)-1):.2f} letters\n", + " S has a compression ratio (letters(W)/letters(S)) of {L(W)/len(S):.2f}\n", + " P (and thus S) {valid} a valid portmantout of W\n", + " W has {missing} missing one-letter-to-one-letter bridges''')\n", " if save:\n", - " print(f'S saved as \"{save}\", {open(save, \"w\").write(S)} bytes.')\n", + " print(f' S saved as \"{save}\", ({open(save, \"w\").write(S)} bytes)')\n", " print(f'\\nThe first and last {steps} steps are:\\n')\n", - " for step in [*P[:steps], '... ...', *P[-steps:]]:\n", - " print(step)\n", + " for step in (*P[:steps], Step('.', '...'), *P[-steps:]):\n", + " print(step.overlap, step.word)\n", " print(f'\\nThe first and last {letters} letters are:\\n\\n{S[:letters]} ... {S[-letters:]}')\n", "\n", "def is_portman(P: Path, W: Wordset) -> str:\n", " \"\"\"Verify that P forms a valid portmantout string for W.\"\"\"\n", - " all_words = (W - W.subwords) <= set(w for (_, w) in P) <= W\n", + " all_words = (W - W.subwords) <= set(w.word for w in P) <= W\n", " overlaps = all(overlap > 0 and P[i - 1][1][-overlap:] == word[:overlap]\n", " for i, (overlap, word) in enumerate(P[1:], 1))\n", - " return all_words and overlaps and P[0][OVERLAP] == 0 # first step has 0 overlap" + " return all_words and overlaps and P[0].overlap == 0 " ] }, { @@ -730,7 +733,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 23, "metadata": { "scrolled": false }, @@ -739,122 +742,123 @@ "name": "stdout", "output_type": "stream", "text": [ - "W has 108,709 words (64,389 nonsubwords; 44,320 subwords).\n", - "P has 103,466 steps (64,389 nonsubwords; 39,077 bridge words).\n", - "S has 553,742 letters; W has 931,823; nonsubs have 595,805.\n", - "P has an average overlap of 1.65 letters.\n", - "S has a compression ratio (letters(W)/letters(S)) of 1.68.\n", - "P (and thus S) is a valid portmantout of W.\n", - "W has no missing one-letter-to-one-letter bridges.\n", - "S saved as \"natalie.txt\", 553742 bytes.\n", + "\n", + " W has 108,709 words (44,320 subwords; 64,389 nonsubwords)\n", + " P has 103,088 steps (38,699 bridge words)\n", + " S has 553,761 letters; W has 931,823; nonsubs have 595,805\n", + " P has an average overlap of 1.64 letters\n", + " S has a compression ratio (letters(W)/letters(S)) of 1.68\n", + " P (and thus S) is a valid portmantout of W\n", + " W has no missing one-letter-to-one-letter bridges\n", + " S saved as \"natalie.txt\", (553761 bytes)\n", "\n", "The first and last 50 steps are:\n", "\n", - "(0, 'portmanteau')\n", - "(2, 'autographic')\n", - "(7, 'graphicness')\n", - "(3, 'essayists')\n", - "(2, 'tsarisms')\n", - "(1, 'skydiving')\n", - "(3, 'ingenuously')\n", - "(3, 'slyest')\n", - "(4, 'yesterdays')\n", - "(4, 'daysides')\n", - "(5, 'sideswiping')\n", - "(4, 'pingrasses')\n", - "(5, 'assessee')\n", - "(3, 'seeking')\n", - "(4, 'kinged')\n", - "(2, 'editresses')\n", - "(3, 'sestets')\n", - "(5, 'stetsons')\n", - "(4, 'sonships')\n", - "(5, 'shipshape')\n", - "(5, 'shapelessness')\n", - "(3, 'essayers')\n", - "(3, 'erstwhile')\n", - "(5, 'whiles')\n", - "(3, 'lessening')\n", - "(3, 'ingested')\n", - "(4, 'stedhorses')\n", - "(6, 'horseshoes')\n", - "(5, 'shoestrings')\n", - "(5, 'ringsides')\n", - "(5, 'sideslipping')\n", - "(3, 'ingrafting')\n", - "(4, 'tingles')\n", - "(3, 'lessees')\n", - "(4, 'seesawing')\n", - "(4, 'wingedly')\n", - "(2, 'lyrically')\n", - "(4, 'allyls')\n", - "(1, 'syllabicate')\n", - "(4, 'cateresses')\n", - "(3, 'sessile')\n", - "(4, 'silentness')\n", - "(3, 'essays')\n", - "(1, 'sheening')\n", - "(3, 'ingratiating')\n", - "(5, 'atingle')\n", - "(6, 'tinglers')\n", - "(3, 'ersatzes')\n", - "(3, 'zestfulness')\n", - "(7, 'fulnesses')\n", - "... ...\n", - "(1, 'quarrelled')\n", - "(1, 'deli')\n", - "(1, 'iraq')\n", - "(1, 'quakily')\n", - "(1, 'yoni')\n", - "(1, 'iraq')\n", - "(1, 'quaffing')\n", - "(1, 'gorki')\n", - "(1, 'iraq')\n", - "(1, 'quarts')\n", - "(1, 'sir')\n", - "(2, 'iraq')\n", - "(1, 'qaids')\n", - "(1, 'sir')\n", - "(2, 'iraq')\n", - "(1, 'quarantining')\n", - "(1, 'gorki')\n", - "(1, 'iraq')\n", - "(1, 'qatar')\n", - "(1, 'rani')\n", - "(1, 'iraq')\n", - "(1, 'quinquina')\n", - "(1, 'aqua')\n", - "(3, 'quakier')\n", - "(1, 'rani')\n", - "(1, 'iraq')\n", - "(1, 'quailing')\n", - "(1, 'gorki')\n", - "(1, 'iraq')\n", - "(1, 'quahogs')\n", - "(1, 'sir')\n", - "(2, 'iraq')\n", - "(1, 'quaaludes')\n", - "(1, 'sir')\n", - "(2, 'iraq')\n", - "(1, 'quakerism')\n", - "(1, 'maqui')\n", - "(3, 'quitclaimed')\n", - "(1, 'deli')\n", - "(1, 'iraq')\n", - "(1, 'quarries')\n", - "(1, 'sir')\n", - "(2, 'iraq')\n", - "(1, 'quinols')\n", - "(1, 'sir')\n", - "(2, 'iraq')\n", - "(1, 'quicklime')\n", - "(1, 'emir')\n", - "(2, 'iraq')\n", - "(1, 'quakingly')\n", + "0 leadworks\n", + "5 workstations\n", + "3 onstage\n", + "5 stagey\n", + "3 geysers\n", + "3 ersatzes\n", + "3 zesting\n", + "5 stingier\n", + "2 eroding\n", + "4 dinginess\n", + "5 inessential\n", + "9 essentially\n", + "4 allyls\n", + "1 straightest\n", + "4 testings\n", + "1 solderers\n", + "3 erstwhile\n", + "5 whiles\n", + "3 lessening\n", + "3 ingressive\n", + "2 vending\n", + "4 dinghy\n", + "2 hygienics\n", + "1 snooted\n", + "3 tediums\n", + "1 sandpile\n", + "4 pileups\n", + "3 upshots\n", + "4 hotspurs\n", + "4 pursed\n", + "3 seducible\n", + "3 blending\n", + "4 dingles\n", + "3 lessees\n", + "4 seesawed\n", + "3 wedgier\n", + "2 erupted\n", + "3 teddy\n", + "4 eddying\n", + "5 dyings\n", + "1 suffocated\n", + "3 tediously\n", + "3 slyest\n", + "4 yesteryears\n", + "4 earshots\n", + "4 hotshots\n", + "2 tsunamis\n", + "4 amiss\n", + "4 misspoke\n", + "5 spokesman\n", + ". ...\n", + "1 yeti\n", + "1 iraq\n", + "1 quotients\n", + "1 sir\n", + "2 iraq\n", + "1 quaverers\n", + "1 sir\n", + "2 iraq\n", + "1 quartzite\n", + "1 emir\n", + "2 iraq\n", + "1 quinins\n", + "1 sir\n", + "2 iraq\n", + "1 quitclaiming\n", + "1 genii\n", + "1 iraq\n", + "1 quicksteps\n", + "1 sir\n", + "2 iraq\n", + "1 quarterlies\n", + "1 sir\n", + "2 iraq\n", + "1 quakers\n", + "1 sir\n", + "2 iraq\n", + "1 quackster\n", + "1 rani\n", + "1 iraq\n", + "1 quicksets\n", + "1 sir\n", + "2 iraq\n", + "1 quasi\n", + "1 iraq\n", + "1 quarrellers\n", + "1 sir\n", + "2 iraq\n", + "1 quantifies\n", + "1 sir\n", + "2 iraq\n", + "1 quitclaimed\n", + "1 deli\n", + "1 iraq\n", + "1 quaggier\n", + "1 rani\n", + "1 iraq\n", + "1 quotationally\n", + "1 yeti\n", + "1 iraq\n", + "1 quarantined\n", "\n", "The first and last 500 letters are:\n", "\n", - "portmanteautographicnessayistsarismskydivingenuouslyesterdaysideswipingrassesseekingeditressestetsonshipshapelessnessayerstwhilesseningestedhorseshoestringsideslippingraftinglesseesawingedlyricallylsyllabicateressessilentnessaysheeningratiatinglersatzestfulnessestercesareanschlussresistentorsionallyonnaiseminarsenidesolatediumsquirtingeditoriallysergichoroustaboutstationstagersiltiercelsiusurpingotsaristsaritzastrodometerselysiantitankersparestatingeingrowingspreadsheetsarsaparillasciviouslylyce ... iraquandaryoniraquotidianoiraqianaquaggiestaxiraquodsiraquandobeliraquondamaquiversiraquaffersiraquantumaquicklyoniraquaintlyoniraqurushairaquaverersiraquakiestaxiraquaversiraquizzingorkiraquarrellersiraquicksetsiraquickiesiraquintalsiraquackishnessiraquietismsiraquizzedeliraquailedeliraquarrelledeliraquakilyoniraquaffingorkiraquartsiraqaidsiraquarantiningorkiraqataraniraquinquinaquakieraniraquailingorkiraquahogsiraquaaludesiraquakerismaquitclaimedeliraquarriesiraquinolsiraquicklimemiraquakingly\n" + "leadworkstationstageysersatzestingierodinginessentiallylstraightestingsoldererstwhilesseningressivendinghygienicsnootediumsandpileupshotspurseduciblendinglesseesawedgierupteddyingsuffocatediouslyesteryearshotshotsunamisspokesmangilyricizingypsumsomewhereasestinassimilableariesteemsmelterythrocytesteesophagallowsesquicentenniallyricizedseductivenessayingspongingerysipelasticizingersyphiloidiumbilicalciferoustedhorseshoerstediousnessayeditorialistablemansuetudesalterselysianacondastardlyricistsets ... raquacksalveraniraquartsiraquickeraniraquoinedeliraquoinsiraquainteraniraquarriedeliraquinquinaquahogsiraquietusesiraquitrentsiraquotablyetiraquarrelingeniiraquintilesiraquarrelersiraquintillionthsiraquailingeniiraquietensiraquiveringlyetiraquackedeliraquaveredeliraquakinglyetiraquotientsiraquaverersiraquartzitemiraquininsiraquitclaimingeniiraquickstepsiraquarterliesiraquakersiraquacksteraniraquicksetsiraquasiraquarrellersiraquantifiesiraquitclaimedeliraquaggieraniraquotationallyetiraquarantined\n" ] } ], @@ -870,21 +874,24 @@ "\n", "The program is complete, but there are still many interesting things to explore. \n", "\n", - "**My first question**: is there an imbalance in starting and ending letters of words? That could lead to a need for many two-word bridges. We saw that the last 50 steps of $P$ all involved words that start with `q`, or bridges to them. " + "**Question**: is there an imbalance in starting and ending letters of words? \n", + "\n", + "This is important because it could lead to a need for many two-word bridges. We saw that the last 50 steps of $P$ all involved words that start with `q`, or bridges to them. What letters besides `q` might lead to problems?" ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ - "precompute(W)" + "precompute(W)\n", + "words = W.unused # The nonsub words in W" ] }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 25, "metadata": {}, "outputs": [ { @@ -918,18 +925,19 @@ " ('x', 51)]" ] }, - "execution_count": 23, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "Counter(w[0] for w in W.unused).most_common() # How many nonsubwords start with each letter?" + "# How many nonsubwords start with each letter?\n", + "Counter(w[0] for w in words).most_common() " ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 26, "metadata": {}, "outputs": [ { @@ -957,31 +965,72 @@ " ('w', 42),\n", " ('z', 21),\n", " ('u', 11),\n", - " ('v', 6),\n", - " ('b', 6)]" + " ('b', 6),\n", + " ('v', 6)]" ] }, - "execution_count": 24, + "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "Counter(w[-1] for w in W.unused).most_common() # How many nonsubwords end with each letter?" + "# How many nonsubwords end with each letter?\n", + "Counter(w[-1] for w in words).most_common() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Yes, there is a problem: there are 330 words that start with `q` and no nonsubwords that end in `q` (there are two subwords, `colloq` and `iraq`). There is also a problem with 29,056 nonsubwords ending in `s` and only 7,388 starting with `s`. But many words start with combinations like `as` or `es` or `ps`, whereas there are few chances for a `q` at the start of a word to match up with a `q` near the end of a word.\n", - "\n", - "Here are all the words that have a `q` as one of their last three letters:" + "Yes, there is a problem: there are 330 words that start with `q` and no nonsubwords that end in `q` (there are two subwords, `colloq` and `iraq`). Here's a comparison of how many times a `q` appears in the first three letters and last three letters of nonsubwords:" ] }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "710" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qstart = {w for w in words if 'q' in w[:3]}\n", + "len(qstart)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "23" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qend = {w for w in words if 'q' in w[-3:]}\n", + "len(qend)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, "metadata": { "scrolled": false }, @@ -989,63 +1038,112 @@ { "data": { "text/plain": [ - "('diplomatique faqir mystique relique mosque arabesque macaque maqui mozambique catafalque plaque brusque bisque unique obloquy manque perique applique claque boutique iraq grotesque cheque picaresque statuesque oblique opaque marque toque basque cinque obsequy aqua iraqis barque cinematheque critique odalisque albuquerque prosequi colloq tuque pulque pratique remarque baroque colloquy iraqi soliloquy technique burlesque placque discotheque hdqrs pique antique bosque cliquy semiopaque picturesque cirque romanesque torque yanqui ventriloquy casque clique physique masque bezique communique risque',\n", - " 72)" + "'albuquerque statuesque diplomatique semiopaque mozambique romanesque odalisque colloquy cliquy risque maqui picaresque perique manque prosequi iraqis bezique soliloquy obloquy ventriloquy placque obsequy hdqrs'" ] }, - "execution_count": 25, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "q3 = {w for w in W if 'q' in w[-3:]}\n", - "' '.join(q3), len(q3)" + "' '.join(qend)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**My second question**: what are the most common steps in $P$? These will be bridge words. What do they have in common?" + "There is also a problem with 29,056 nonsubwords ending in `s` and only 7,388 starting with `s`. But many words start with combinations like `as` or `es` or `ps`, so the overall ratio is not as bad:" ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[((1, 'so'), 2561),\n", - " ((1, 'sap'), 2536),\n", - " ((1, 'dab'), 2360),\n", - " ((1, 'sic'), 2223),\n", - " ((1, 'of'), 2039),\n", - " ((2, 'lyre'), 1643),\n", - " ((1, 'sun'), 1519),\n", - " ((1, 'sin'), 1400),\n", - " ((1, 'yam'), 867),\n", - " ((2, 'lye'), 734),\n", - " ((1, 'go'), 679),\n", - " ((1, 'yow'), 612),\n", - " ((1, 'spa'), 610),\n", - " ((1, 'econ'), 609),\n", - " ((1, 'gem'), 562),\n", - " ((1, 'gun'), 487),\n", - " ((1, 'yen'), 465),\n", - " ((3, 'erst'), 454),\n", - " ((2, 'type'), 447),\n", - " ((1, 'she'), 390),\n", - " ((1, 'you'), 371),\n", - " ((1, 'sex'), 324),\n", - " ((1, 'simp'), 317),\n", - " ((1, 'tv'), 312),\n", - " ((1, 'gal'), 297)]" + "12247" ] }, - "execution_count": 26, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sstart = {w for w in words if 's' in w[:3]}\n", + "len(sstart)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "32394" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "send = {w for w in words if 's' in w[-3:]}\n", + "len(send)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Question**: what are the most common steps in $P$? \n", + "\n", + "These will be bridge words, which can be repeated any number of times. What do they have in common?" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(Step(overlap=1, word='sac'), 3170),\n", + " (Step(overlap=1, word='so'), 2079),\n", + " (Step(overlap=2, word='lyre'), 1649),\n", + " (Step(overlap=1, word='dab'), 1628),\n", + " (Step(overlap=1, word='of'), 1591),\n", + " (Step(overlap=1, word='sun'), 1487),\n", + " (Step(overlap=1, word='gab'), 1483),\n", + " (Step(overlap=1, word='sin'), 1407),\n", + " (Step(overlap=1, word='sip'), 1062),\n", + " (Step(overlap=1, word='sam'), 971),\n", + " (Step(overlap=1, word='yam'), 958),\n", + " (Step(overlap=1, word='sew'), 789),\n", + " (Step(overlap=2, word='lye'), 707),\n", + " (Step(overlap=1, word='spa'), 612),\n", + " (Step(overlap=1, word='gun'), 503),\n", + " (Step(overlap=3, word='erst'), 489),\n", + " (Step(overlap=1, word='yen'), 466),\n", + " (Step(overlap=2, word='type'), 450),\n", + " (Step(overlap=1, word='gip'), 437),\n", + " (Step(overlap=1, word='econ'), 423),\n", + " (Step(overlap=1, word='she'), 392),\n", + " (Step(overlap=1, word='go'), 389),\n", + " (Step(overlap=1, word='sex'), 356),\n", + " (Step(overlap=1, word='yep'), 323),\n", + " (Step(overlap=1, word='you'), 321)]" + ] + }, + "execution_count": 32, "metadata": {}, "output_type": "execute_result" } @@ -1063,16 +1161,16 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0.2971507548373379" + "0.2984246469036163" ] }, - "execution_count": 27, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } @@ -1085,12 +1183,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**My third question:** What is the distribution of word lengths? What is the longest word? What is the distribution of letters?" + "**Question:** What is the distribution of word lengths? " ] }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -1120,18 +1218,25 @@ " 28: 1})" ] }, - "execution_count": 28, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "Counter(sorted(map(len, W.unused))) # Counter of word lengths" + "Counter(sorted(map(len, words))) # Counter of word lengths" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Question:** What is the longest nonsubword? The shortest?" ] }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 35, "metadata": {}, "outputs": [ { @@ -1140,18 +1245,45 @@ "'antidisestablishmentarianism'" ] }, - "execution_count": 29, + "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "max(W, key=len)" + "max(words, key=len)" ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'fbi', 'ibm'}" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{w for w in words if len(w) <= 3}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Question**: What is the distribution of letters?" + ] + }, + { + "cell_type": "code", + "execution_count": 37, "metadata": {}, "outputs": [ { @@ -1185,342 +1317,22 @@ " ('q', 1066)]" ] }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Counter(L for w in W.unused for L in w).most_common() # Counter of letters" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**My fourth question**: How many bridges are there? How many excess letters do they have? What words do they use? " - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "56477" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Make a list of all bridges, B, and see how many there are\n", - "B = [(suf, pre, W.bridges[suf][pre]) for suf in W.bridges for pre in W.bridges[suf]]\n", - "len(B)" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('s', 'laty', (0, (1, 'slaty'))),\n", - " ('p', 'ram', (0, (1, 'pram'))),\n", - " ('a', 'lpha', (0, (1, 'alpha'))),\n", - " ('d', 'ces', (1, (1, 'duces'))),\n", - " ('h', 'uts', (0, (1, 'huts'))),\n", - " ('ke', 'ab', (1, (2, 'kebab'))),\n", - " ('f', 'izz', (0, (1, 'fizz'))),\n", - " ('ho', 'wls', (0, (2, 'howls'))),\n", - " ('c', 'lo', (2, (1, 'cello'))),\n", - " ('g', 'ogo', (0, (1, 'gogo'))),\n", - " ('l', 'th', (1, (1, 'loth'))),\n", - " ('b', 'ola', (0, (1, 'bola'))),\n", - " ('ne', 'ro', (1, (2, 'negro'))),\n", - " ('riv', 'n', (1, (3, 'riven'))),\n", - " ('li', 'szt', (0, (2, 'liszt'))),\n", - " ('on', 'ces', (0, (2, 'onces'))),\n", - " ('na', 'l', (1, (2, 'nail'))),\n", - " ('ov', 'um', (0, (2, 'ovum'))),\n", - " ('br', 'ke', (1, (2, 'broke'))),\n", - " ('sti', 'le', (0, (3, 'stile'))),\n", - " ('ax', 'els', (0, (2, 'axels'))),\n", - " ('yea', 'n', (1, (3, 'yearn'))),\n", - " ('whel', 'p', (0, (4, 'whelp'))),\n", - " ('cabe', 'r', (0, (4, 'caber'))),\n", - " ('fal', 'ls', (0, (3, 'falls'))),\n", - " ('cza', 'r', (0, (3, 'czar'))),\n", - " ('snuc', 'k', (0, (4, 'snuck'))),\n", - " ('scen', 'e', (0, (4, 'scene'))),\n", - " ('apne', 'a', (0, (4, 'apnea')))]" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "B[::2000] # Sample every 2000th bridge" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Counter({0: 37189, 1: 16708, 2: 2425, 3: 95, 4: 32, 5: 21, 6: 6, 8: 1})" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Counter of bridge excess letters\n", - "Counter(x for (_, _, (x, *_)) in B)" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.3916638631655364" - ] - }, - "execution_count": 34, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def average(counter):\n", - " return sum(x * counter[x] for x in counter) / sum(counter.values())\n", - "\n", - "average(_) # Average excess across all bridges" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Counter({1: 56327, 2: 150})" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# How many 1-step and 2-step bridges are there?\n", - "Counter(len(steps) for (_, _, (_, *steps)) in B)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**My fifth question**: What strange letter combinations are there? Let's look at two-letter suffixes or prefixes that only appear in one or two nonsubwords. " - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": { - "scrolled": false - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{'jn': {'jnanas'},\n", - " 'dv': {'dvorak'},\n", - " 'if': {'iffiness'},\n", - " 'ym': {'ymca'},\n", - " 'kw': {'kwachas', 'kwashiorkor'},\n", - " 'fj': {'fjords'},\n", - " 'ek': {'ekistics'},\n", - " 'aj': {'ajar'},\n", - " 'xi': {'xiphoids', 'xiphosuran'},\n", - " 'sf': {'sforzatos'},\n", - " 'yc': {'ycleped', 'yclept'},\n", - " 'hd': {'hdqrs'},\n", - " 'dn': {'dnieper'},\n", - " 'ip': {'ipecacs'},\n", - " 'ee': {'eelgrasses', 'eelworm'},\n", - " 'qa': {'qaids', 'qatar'},\n", - " 'ie': {'ieee'},\n", - " 'oj': {'ojibwas'},\n", - " 'pf': {'pfennigs'},\n", - " 'wu': {'wurzel'},\n", - " 'uf': {'ufos'},\n", - " 'ik': {'ikebanas', 'ikons'},\n", - " 'tc': {'tchaikovsky'},\n", - " 'bw': {'bwanas'},\n", - " 'zw': {'zwiebacks'},\n", - " 'gj': {'gjetosts'},\n", - " 'iv': {'ivories', 'ivory'},\n", - " 'xm': {'xmases'},\n", - " 'zl': {'zlotys'},\n", - " 'll': {'llamas', 'llanos'},\n", - " 'ct': {'ctrl'},\n", - " 'qo': {'qophs'},\n", - " 'gw': {'gweducks', 'gweducs'},\n", - " 'ez': {'ezekiel'},\n", - " 'mc': {'mcdonald'},\n", - " 'ay': {'ayahs', 'ayatollahs'},\n", - " 'fb': {'fbi'}}" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "{pre: W.startswith[pre] # Rare two-letter prefixes\n", - " for pre in W.startswith if len(pre) == 2 and len(W.startswith[pre]) in (1, 2)}" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'lm': {'stockholm', 'unhelm'},\n", - " 'yx': {'styx'},\n", - " 'ao': {'chiao', 'ciao'},\n", - " 'oe': {'monroe'},\n", - " 'oi': {'hanoi', 'polloi'},\n", - " 'tl': {'peyotl', 'shtetl'},\n", - " 'nx': {'bronx', 'meninx'},\n", - " 'rf': {'waldorf', 'windsurf'},\n", - " 'wa': {'kiowa', 'okinawa'},\n", - " 'lu': {'honolulu'},\n", - " 'ho': {'groucho'},\n", - " 'bm': {'ibm', 'icbm'},\n", - " 'vo': {'concavo'},\n", - " 'zo': {'diazo', 'palazzo'},\n", - " 'ud': {'aloud', 'overproud'},\n", - " 'pa': {'tampa'},\n", - " 'xo': {'convexo'},\n", - " 'hr': {'kieselguhr'},\n", - " 'hm': {'microhm'},\n", - " 'ef': {'unicef'},\n", - " 'rb': {'cowherb'},\n", - " 'ji': {'fiji'},\n", - " 'ep': {'asleep', 'shlep'},\n", - " 'td': {'retd'},\n", - " 'po': {'troppo'},\n", - " 'gm': {'apophthegm'},\n", - " 'ub': {'beelzebub'},\n", - " 'ku': {'haiku'},\n", - " 'hu': {'buchu'},\n", - " 'xe': {'deluxe', 'maxixe'},\n", - " 'gn': {'champaign'},\n", - " 'ug': {'bedrug', 'sparkplug'},\n", - " 'ec': {'filespec', 'quebec'},\n", - " 'nu': {'vishnu'},\n", - " 'ru': {'nehru'},\n", - " 'mb': {'clomb', 'whitecomb'},\n", - " 'ui': {'maqui', 'prosequi'},\n", - " 'sr': {'ussr'},\n", - " 'ln': {'lincoln'},\n", - " 'xs': {'duplexs'},\n", - " 'mp': {'prestamp'},\n", - " 'ab': {'skylab'},\n", - " 'hn': {'mendelssohn'},\n", - " 'cd': {'recd'},\n", - " 'uc': {'caoutchouc'},\n", - " 'dt': {'rembrandt'},\n", - " 'nc': {'dezinc', 'quidnunc'},\n", - " 'sz': {'grosz'},\n", - " 'we': {'zimbabwe'},\n", - " 'ai': {'bonsai'},\n", - " 'mt': {'daydreamt', 'undreamt'},\n", - " 'zt': {'liszt'},\n", - " 'ua': {'joshua'},\n", - " 'aa': {'markkaa'},\n", - " 'fa': {'khalifa'},\n", - " 'ob': {'blowjob'},\n", - " 'ko': {'gingko', 'stinko'},\n", - " 'zm': {'transcendentalizm'},\n", - " 'dn': {'haydn'},\n", - " 'oz': {'kolkhoz'},\n", - " 'eh': {'mikveh', 'yahweh'},\n", - " 'tu': {'impromptu'},\n", - " 'za': {'organza'},\n", - " 'su': {'shiatsu'},\n", - " 'vt': {'govt'},\n", - " 'ou': {'thankyou'},\n", - " 'nz': {'franz'}}" - ] - }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "endswith = multimap((w[-2:], w) for w in W.unused)\n", - "\n", - "{suf: endswith[suf] # Rare two-letter suffixes\n", - " for suf in endswith if len(endswith[suf]) in (1, 2)}" + "Counter(L for w in words for L in w).most_common() # Counter of letters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The two-letter prefixes definitely include some strange words.\n", + "**Question**: How many bridges are there? How many excess letters do they have? How many steps?\n", "\n", - "The list of two-letter suffixes is mostly pointing out flaws in the word list. For example, lots of words end in `ab`: blab, cab, jab, lab, etc. But must of them are subwords (of blabs, cabs, jabs, labs, etc.); only `skylab` made it into the word list in singular form but not plural." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Comparison to Tom Murphy's Program\n", - "\n", - "To compare my [program](portman.py) to [Murphy's](https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/portmantout/): I used a greedy approach that incrementally builds up a single long portmanteau, extending it via a bridge when necessary. Murphy first built a pool of smaller portmanteaux, then joined them all together. I'm reminded of the [Traveling Salesperson Problem](TSP.ipynb) where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments. The two approaches are different, but it is not clear whether one is better than the other. You could try it!\n", - "\n", - "(*English trivia:* my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a **bridge**. Murphy's program starts by building a large pool of small portmanteaux that he calls **particles**, and when he can build no more particles, his next step is to put two particles together, so he calls it a **join**. The different metaphors for what our programs are doing lead to different terminology for the same idea.)\n", - "\n", - "In terms of implementation, mine is in Python and is concise (139 lines); Murphy's is in C++ and is verbose (1867 lines), although Murphy's code does a lot of extra work that mine doesn't: generating diagrams and animations, and running multiple threads in parallel to implement the random restart idea. \n", - "\n", - "It appears Murphy didn't quite have the complete concept of **subwords**. He did mention that when he adds `'bulleting'`, he crosses `'bullet'` and `'bulletin'` off the list, but somehow [his string](http://tom7.org/portmantout/murphy2015portmantout.pdf) contains both `'spectacular'` and `'spectaculars'` in two different places. My guess is that when he adds `'spectaculars'` he crosses off `'spectacular'`, but if he happens to add `'spectacular'` first, he will later add `'spectaculars'`. Support for this view is that his output in `bench.txt` says \"I skipped 24319 words that were already substrs\", but I computed that there are 44,320 such subwords; he found about half of them. I think those missing 20,001 words are the main reason why my strings are coming in at around 554,000 letters, about 57,000 letters shorter than Murphy's 611,820 letters.\n", - "\n", - "Also, Murphy's joins are always between one-letter prefixes and suffixes. I do the same thing for two-word bridges, because having a `W.bridges[A][B]` for every letter `A` and `B` is the easiest way to prove that the program will terminate. But for one-word bridges, I allow prefixes and suffixes of any length up to a total of 6 for `len(pre) + len(suf)`. I can get away with this because I limited my candidate pool to the 10,000 `shortwords`. It would have been untenable to build all bridges for all 100,000 words, and probably would not have helped shorten $S$ appreciably.\n", - "\n", - "*Note 2:* I should say that I stole one important trick from Murphy. I started watching his highly-entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI), but then I paused it because I wanted the fun of solving the problem mostly on my own. After I finished the first version of my program, I returned to the video and [paper](http://tom7.org/portmantout/murphy2015portmantout.pdf) and I noticed that I had a problem in my use of bridges. My program originally looked something like this: \n", - "\n", - " (overlap, word) = unused_step(...) or one_word_bridge(...) or two_word_bridge(...)\n", - " \n", - "That is, I only considered two-word bridges when there was no one-word bridge, on the theory that one word is shorter than two. But Murphy showed that my theory was wrong: I had `bridges['w']['c'] = 'workaholic'`, a one-word bridge, but he had the two-word bridge `'war' + 'arc' = 'warc'`, which saves six letters over my single word. After seeing that, I shamelessly copied his approach, and now I too get a four-letter bridge for `'w' + 'c'` (sometimes `'warc'` and sometimes `'we' + 'etc' = 'wetc'`)." + "Note this is across all the bridges that exist, without regard for whether (or how often) they are used in `P`." ] }, { @@ -1531,7 +1343,7 @@ { "data": { "text/plain": [ - "(2, (1, 'war'), (2, 'arc'))" + "56430" ] }, "execution_count": 38, @@ -1539,6 +1351,368 @@ "output_type": "execute_result" } ], + "source": [ + "# A list of all bridges\n", + "B = sorted((W.bridges[suf][pre], pre, suf) \n", + " for suf in W.bridges for pre in W.bridges[suf])\n", + "len(B)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Counter({0: 37172, 1: 16689, 2: 2417, 3: 92, 4: 32, 5: 21, 6: 6, 8: 1})" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Counter of bridge excess letters\n", + "Counter(b.excess for (b, *_) in B)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.39121034910508595" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Average excess\n", + "mean(_.elements())" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Counter({1: 56280, 2: 150})" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Number of steps\n", + "Counter(len(b.steps) for (b, *_) in B)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I'm surprised there are only 150 two-step bridges.\n", + "\n", + "**Question**: What do some bridges look like?" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[(Bridge(excess=0, steps=[Step(overlap=1, word='aahed')]), 'ahed', 'a'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='bra')]), 'ra', 'b'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='darn')]), 'arn', 'd'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='fines')]), 'ines', 'f'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='heman')]), 'eman', 'h'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='left')]), 'eft', 'l'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='newel')]), 'ewel', 'n'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='puffy')]), 'uffy', 'p'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='shoes')]), 'hoes', 's'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='thug')]), 'hug', 't'),\n", + " (Bridge(excess=0, steps=[Step(overlap=1, word='winy')]), 'iny', 'w'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='bests')]), 'sts', 'be'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='cored')]), 'red', 'co'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='etc')]), 'c', 'et'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='grimy')]), 'imy', 'gr'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='kilns')]), 'lns', 'ki'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='molto')]), 'lto', 'mo'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='plaid')]), 'aid', 'pl'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='scabs')]), 'abs', 'sc'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='swop')]), 'op', 'sw'),\n", + " (Bridge(excess=0, steps=[Step(overlap=2, word='waits')]), 'its', 'wa'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='baby')]), 'y', 'bab'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='clink')]), 'nk', 'cli'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='educt')]), 'ct', 'edu'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='gooky')]), 'ky', 'goo'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='khaki')]), 'ki', 'kha'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='moods')]), 'ds', 'moo'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='polis')]), 'is', 'pol'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='sepoy')]), 'oy', 'sep'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='tempi')]), 'pi', 'tem'),\n", + " (Bridge(excess=0, steps=[Step(overlap=3, word='widow')]), 'ow', 'wid'),\n", + " (Bridge(excess=0, steps=[Step(overlap=4, word='bored')]), 'd', 'bore'),\n", + " (Bridge(excess=0, steps=[Step(overlap=4, word='dregs')]), 's', 'dreg'),\n", + " (Bridge(excess=0, steps=[Step(overlap=4, word='haste')]), 'e', 'hast'),\n", + " (Bridge(excess=0, steps=[Step(overlap=4, word='meaty')]), 'y', 'meat'),\n", + " (Bridge(excess=0, steps=[Step(overlap=4, word='quick')]), 'k', 'quic'),\n", + " (Bridge(excess=0, steps=[Step(overlap=4, word='stave')]), 'e', 'stav'),\n", + " (Bridge(excess=0, steps=[Step(overlap=4, word='woken')]), 'n', 'woke'),\n", + " (Bridge(excess=1, steps=[Step(overlap=1, word='cappy')]), 'ppy', 'c'),\n", + " (Bridge(excess=1, steps=[Step(overlap=1, word='fauve')]), 'uve', 'f'),\n", + " (Bridge(excess=1, steps=[Step(overlap=1, word='inked')]), 'ked', 'i'),\n", + " (Bridge(excess=1, steps=[Step(overlap=1, word='mopes')]), 'pes', 'm'),\n", + " (Bridge(excess=1, steps=[Step(overlap=1, word='reeks')]), 'eks', 'r'),\n", + " (Bridge(excess=1, steps=[Step(overlap=1, word='town')]), 'wn', 't'),\n", + " (Bridge(excess=1, steps=[Step(overlap=2, word='angel')]), 'el', 'an'),\n", + " (Bridge(excess=1, steps=[Step(overlap=2, word='diary')]), 'ry', 'di'),\n", + " (Bridge(excess=1, steps=[Step(overlap=2, word='henry')]), 'ry', 'he'),\n", + " (Bridge(excess=1, steps=[Step(overlap=2, word='nerd')]), 'd', 'ne'),\n", + " (Bridge(excess=1, steps=[Step(overlap=2, word='shags')]), 'gs', 'sh'),\n", + " (Bridge(excess=1, steps=[Step(overlap=2, word='vodka')]), 'ka', 'vo'),\n", + " (Bridge(excess=1, steps=[Step(overlap=3, word='cered')]), 'd', 'cer'),\n", + " (Bridge(excess=1, steps=[Step(overlap=3, word='grace')]), 'e', 'gra'),\n", + " (Bridge(excess=1, steps=[Step(overlap=3, word='nicks')]), 's', 'nic'),\n", + " (Bridge(excess=1, steps=[Step(overlap=3, word='sonar')]), 'r', 'son'),\n", + " (Bridge(excess=2, steps=[Step(overlap=1, word='bulk')]), 'k', 'b'),\n", + " (Bridge(excess=2, steps=[Step(overlap=1, word='tachs')]), 'hs', 't'),\n", + " (Bridge(excess=2, steps=[Step(overlap=2, word='relax')]), 'x', 're')]" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "B[::1000] # Sample every 1000th bridge" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Question**: What strange letter combinations are there? \n", + "\n", + "Let's look at two-letter suffixes or prefixes that only appear in one or two nonsubwords. " + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'aj': {'ajar'},\n", + " 'ay': {'ayahs', 'ayatollahs'},\n", + " 'bw': {'bwanas'},\n", + " 'ct': {'ctrl'},\n", + " 'dn': {'dnieper'},\n", + " 'dv': {'dvorak'},\n", + " 'ee': {'eelgrasses', 'eelworm'},\n", + " 'ek': {'ekistics'},\n", + " 'ez': {'ezekiel'},\n", + " 'fb': {'fbi'},\n", + " 'fj': {'fjords'},\n", + " 'gj': {'gjetosts'},\n", + " 'gw': {'gweducks', 'gweducs'},\n", + " 'hd': {'hdqrs'},\n", + " 'ie': {'ieee'},\n", + " 'if': {'iffiness'},\n", + " 'ik': {'ikebanas', 'ikons'},\n", + " 'ip': {'ipecacs'},\n", + " 'iv': {'ivories', 'ivory'},\n", + " 'jn': {'jnanas'},\n", + " 'kw': {'kwachas', 'kwashiorkor'},\n", + " 'll': {'llamas', 'llanos'},\n", + " 'mc': {'mcdonald'},\n", + " 'oj': {'ojibwas'},\n", + " 'pf': {'pfennigs'},\n", + " 'qa': {'qaids', 'qatar'},\n", + " 'qo': {'qophs'},\n", + " 'sf': {'sforzatos'},\n", + " 'tc': {'tchaikovsky'},\n", + " 'uf': {'ufos'},\n", + " 'wu': {'wurzel'},\n", + " 'xi': {'xiphoids', 'xiphosuran'},\n", + " 'xm': {'xmases'},\n", + " 'yc': {'ycleped', 'yclept'},\n", + " 'ym': {'ymca'},\n", + " 'zl': {'zlotys'},\n", + " 'zw': {'zwiebacks'}}" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{pre: W.startswith[pre] # Rare two-letter prefixes\n", + " for pre in sorted(W.startswith) \n", + " if len(pre) == 2 and 1 <= len(W.startswith[pre]) <= 2}" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'aa': {'markkaa'},\n", + " 'ab': {'skylab'},\n", + " 'ai': {'bonsai'},\n", + " 'ao': {'chiao', 'ciao'},\n", + " 'bm': {'ibm', 'icbm'},\n", + " 'cd': {'recd'},\n", + " 'dn': {'haydn'},\n", + " 'dt': {'rembrandt'},\n", + " 'ec': {'filespec', 'quebec'},\n", + " 'ef': {'unicef'},\n", + " 'eh': {'mikveh', 'yahweh'},\n", + " 'ep': {'asleep', 'shlep'},\n", + " 'fa': {'khalifa'},\n", + " 'gm': {'apophthegm'},\n", + " 'gn': {'champaign'},\n", + " 'hm': {'microhm'},\n", + " 'hn': {'mendelssohn'},\n", + " 'ho': {'groucho'},\n", + " 'hr': {'kieselguhr'},\n", + " 'hu': {'buchu'},\n", + " 'ji': {'fiji'},\n", + " 'ko': {'gingko', 'stinko'},\n", + " 'ku': {'haiku'},\n", + " 'lm': {'stockholm', 'unhelm'},\n", + " 'ln': {'lincoln'},\n", + " 'lu': {'honolulu'},\n", + " 'mb': {'clomb', 'whitecomb'},\n", + " 'mp': {'prestamp'},\n", + " 'mt': {'daydreamt', 'undreamt'},\n", + " 'nc': {'dezinc', 'quidnunc'},\n", + " 'nu': {'vishnu'},\n", + " 'nx': {'bronx', 'meninx'},\n", + " 'nz': {'franz'},\n", + " 'ob': {'blowjob'},\n", + " 'oe': {'monroe'},\n", + " 'oi': {'hanoi', 'polloi'},\n", + " 'ou': {'thankyou'},\n", + " 'oz': {'kolkhoz'},\n", + " 'pa': {'tampa'},\n", + " 'po': {'troppo'},\n", + " 'rb': {'cowherb'},\n", + " 'rf': {'waldorf', 'windsurf'},\n", + " 'ru': {'nehru'},\n", + " 'sr': {'ussr'},\n", + " 'su': {'shiatsu'},\n", + " 'sz': {'grosz'},\n", + " 'td': {'retd'},\n", + " 'tl': {'peyotl', 'shtetl'},\n", + " 'tu': {'impromptu'},\n", + " 'ua': {'joshua'},\n", + " 'ub': {'beelzebub'},\n", + " 'uc': {'caoutchouc'},\n", + " 'ud': {'aloud', 'overproud'},\n", + " 'ug': {'bedrug', 'sparkplug'},\n", + " 'ui': {'maqui', 'prosequi'},\n", + " 'vo': {'concavo'},\n", + " 'vt': {'govt'},\n", + " 'wa': {'kiowa', 'okinawa'},\n", + " 'we': {'zimbabwe'},\n", + " 'xe': {'deluxe', 'maxixe'},\n", + " 'xo': {'convexo'},\n", + " 'xs': {'duplexs'},\n", + " 'yx': {'styx'},\n", + " 'za': {'organza'},\n", + " 'zm': {'transcendentalizm'},\n", + " 'zo': {'diazo', 'palazzo'},\n", + " 'zt': {'liszt'}}" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "endswith = multimap((w[-2:], w) for w in words)\n", + "\n", + "{suf: endswith[suf] # Rare two-letter suffixes\n", + " for suf in sorted(endswith) \n", + " if 1 <= len(endswith[suf]) <= 2}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The two-letter prefixes definitely include some strange words.\n", + "\n", + "The list of two-letter suffixes is mostly pointing out flaws in the word list. For example, lots of words end in `ab`: blab, cab, dab, gab, etc. But must of them are subwords (of blabs, cabs, dabs, gabs, etc.); only `skylab` made it into the word list in singular form but not plural." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comparison to Tom Murphy's Program\n", + "\n", + "To compare my program to [Murphy's](https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/portmantout/): I used a greedy approach that incrementally builds up a single long portmanteau, extending it via a bridge when necessary. Murphy first built a pool of smaller portmanteaux, then joined them all together. I'm reminded of the [Traveling Salesperson Problem](TSP.ipynb) where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments. The two approaches are different, but it is not clear whether one is definitively better than the other (either for TSP or for portmantout). You could try it!\n", + "\n", + "(*English trivia:* my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a **bridge**. Murphy's program starts by building a large pool of small portmanteaux that he calls **particles**, and when he can build no more particles, his next step is to put two particles together, so he calls it a **join**. The different metaphors for what our programs are doing lead to different terminology for the same idea.)\n", + "\n", + "In terms of implementation, mine is in Python and written at a high level of abstraction; Murphy's is in C++ and is at a lower level. That's one reason why his code is 10 times longer; another reason is that he does a lot of cool extra work with generating diagrams and animations. \n", + "\n", + "It appears Murphy didn't quite have the complete concept of **subwords**. He did mention that when he adds `'bulleting'`, he crosses `'bullet'` and `'bulletin'` off the list, but somehow [his string](http://tom7.org/portmantout/murphy2015portmantout.pdf) contains both `'spectacular'` and `'spectaculars'` in two different places. My guess is that when he adds `'spectaculars'` he crosses off `'spectacular'`, but if he happens to add `'spectacular'` first, he will later add `'spectaculars'`. Support for this guess is that his output in `bench.txt` says \"I skipped 24319 words that were already substrs\"; there are 44,320 such subwords, and he found about half of them. I think those missing 20,001 words are the main reason why my $S$ strings are coming in at around 554,000 letters, about 57,000 letters shorter than Murphy's best case of 611,820 letters.\n", + "\n", + "Also, Murphy's *joins* are always between one-letter prefixes and suffixes. I do the same thing for two-word bridges, because having a `W.bridges[A][B]` for every letter `A` and `B` is the easiest way to prove that the program will terminate. But for one-word bridges, I allow prefixes and suffixes of any length up to a total of 6 for `len(pre) + len(suf)`. I can get away with this because I limited my candidate pool to the 10,000 `shortwords`. It would have been untenable to build all bridges of all lengths for all 100,000 words, and probably would not have helped shorten $S$ appreciably.\n", + "\n", + "I should say that I stole a trick from Murphy. I started watching his highly-entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI), but then I paused it because I wanted the fun of solving the problem mostly on my own. After I finished the first version of my program, I returned to the video and [paper](http://tom7.org/portmantout/murphy2015portmantout.pdf) and I noticed that I had a problem in my use of bridges. My program originally looked something like this: \n", + "\n", + " step = unused_step(...) or one_word_bridge(...) or two_word_bridge(...)\n", + " \n", + "That is, I only considered two-word bridges when there was no one-word bridge, on the theory that one word is shorter than two. But Murphy showed that my theory was wrong: I had `'workaholic'`, a one-word bridge, for `bridges['w']['c']`, but Murphy had the two-word bridge `'war' + 'arc' = 'warc'`, which saves six letters. After seeing that, I shamelessly copied his approach, and now I too get a four-letter `bridges['w']['c']`:" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Bridge(excess=2, steps=[Step(overlap=1, word='we'), Step(overlap=1, word='etc')])" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "W.bridges['w']['c']" ] @@ -1566,11 +1740,11 @@ " \n", "Here are some things you could do to make the program more robust:\n", "\n", - "- Write and run unit tests.\n", + "- Write and run unit tests. (I have some cells that informally test things, but no formal tests.)\n", "\n", - "- Find other word lists and try the program on them.\n", + "- Find other word lists, perhaps in other languages, and try the program on them.\n", "\n", - "- Consider what to do for a wordset that has missing bridges. You could try three-word bridges, you could allow the program to back up and remove a previously-placed word; you could allow the addition of words to the start as well as the end of `P`)." + "- Consider what to do for a wordset that has missing bridges. You could try three-word bridges, you could allow the program to back up and remove a previously-placed word; you could allow the addition of words to the start as well as the end of `P`." ] } ], @@ -1590,7 +1764,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.7.7" } }, "nbformat": 4,