{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"right\" style=\"text-align: right\"><i>Peter Norvig<br>Dec 2018<br>Updated Jun 2020</i></div>\n",
    "\n",
    "# Portmantout Words\n",
    "\n",
    "A [***portmanteau***](https://en.wikipedia.org/wiki/Portmanteau) is a word that squishes together two words, like  ***mathlete*** = ***math*** + ***athlete***.  Inspired by [**Darius Bacon**](http://wry.me/), I covered this as a programming exercise in my 2012 [**Udacity course**](https://www.udacity.com/course/design-of-computer-programs--cs212). In 2018 I was re-inspired by [**Tom Murphy VII**](http://tom7.org/), who added a new twist:  [***portmantout words***](http://www.cs.cmu.edu/~tom7/portmantout/) ([***tout***](https://www.duolingo.com/dictionary/French/tout/fd4dc453d9be9f32b7efe838ebc87599) from the French for *all*).\n",
    "\n",
    "\n",
    "Informally, **portmantout** means that you take all the words in a set and mush them together in some order such that each word overlaps the next. More formally, a partmantout of a set of words *W* is a string *S* constructed as follows:\n",
    "- Make an ordered list *L* that contains all the words in *W*.\n",
    "- It is allowable for a word to appear more than once in *L*.\n",
    "- Every word in *L* (except the first) must *overlap*: it must begin with a prefix that is the same as a suffix of the previous word.\n",
    "- The string *S* is formed by concatenating the letters in *L*, but using each overlap only once, not twice.\n",
    "- For example, for the set *W* = {world, hello, lowbrow}, we can use *L* = [hel<u>lo</u>, <u>lo</u>wbro<u>w</u>, <u>w</u>orld], giving us *S* = \"hellowbroworld\".\n",
    "\n",
    "\n",
    "This notebook develops a program that can find a portmantout *S* for a set *W* of over 100,000 words in a few seconds. The program attempts to minimize the length of *S*, but does not guarantee that it is the shortest possible. Along the way it found these interesting portmanteaux:\n",
    "\n",
    "\n",
    "- **preferendumdums** [prefer, referendum, dumdums]: agreeable  uninformed voters.\n",
    "- **fortyphonshore** [forty, typhons, onshore]: a dire weather report. \n",
    "- **allegestionstage** [alleges, egestions, onstage]: a brutal theatre critique.\n",
    "- **skymanipulablearsplittingler** [skyman, manipulable, blears, earsplitting, tinglers]: a nerve-damaging aviator.\n",
    "- **edinburgherselflesslylyricize** [edinburgh, burghers, herself, selflessly, slyly, lyricize]: a Scottish music review.\n",
    "- **impromptutankhamenability** [impromptu, tutankhamen, amenability]: willingness to see the Egyptian exhibit on the spur of the moment.\n",
    "- **dashikimonogrammarianarchy** [dashiki, kimono, monogram, grammarian, anarchy]: the chaos that ensues when a linguist gets involved in choosing how to enscribe African/Japanese garb. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Problem-Solving Strategy\n",
    "\n",
    "I originally thought I would define a major function, `portman`, to generate the portmantout string *S* from the set of words *W*, and a minor function, `is_portman`, to verify the result:\n",
    "\n",
    "    portman(W: Wordset) -> str              # Compute the string S from the set of words W\n",
    "    is_portman(S: str, W: Wordset) -> bool  # Verify that S is a valid portmantout covering W\n",
    "\n",
    "But then I realized that verification would be difficult. For example,  *S* =  `'helloworld'` would be rejected as non-overlapping if parsed as `'hello'` + `'world'`, but accepted if parsed as `'hello'` + `'low'` + `'world'`. It was hard for `is_portman` to decide which parse was intended, which is a shame because `portman` *knew* which was intended, as it built up the list of words, *L*. \n",
    "\n",
    "To make everything explicit (and more efficient), I will construct not just a **list** of words *L*, but rather what I'll call a **path**, where each **step** in the path *P* consists of a word from *W* and an integer that specifies the number of characters in the overlap with the previous word. With that in mind, I decided on the following calling and [naming](https://en.wikipedia.org/wiki/Natalie_Portman) conventions:\n",
    "\n",
    "    natalie(W: Wordset) -> Path             # Find a portmantout path P for a set of words W\n",
    "    portman(P: Path) -> str                 # Compute the string S represented by the path P\n",
    "    is_portman(P: Path, W: Wordset) -> bool # Verify that P is a valid path covering W\n",
    "\n",
    "Thus I can generate a portmantout string *S* with:\n",
    "\n",
    "    S = portman(natalie(W))\n",
    "    \n",
    "I distinguish two types of steps:\n",
    "\n",
    "- **Unused word step**: using a word from *W* for the first time. The unused word must have a prefix that matches the suffix of the previous word. \n",
    "- **Bridging word step**: if no unused word overlaps the previous word, we need to do something to get back on track. I call that something a **bridge**: a step that repeats a previously-used word in order to provide a new suffix that will match the prefix of some unused word.  Sometimes there is no such single word, and multiple words are required to build a bridge to an unused word. (On our large word set we never need more than a two-word bridge.)\n",
    "\n",
    "My intuition is that finding a shortest *S* is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find a guaranteed shortest possible string in a reasonable amount of time. A common approach to NP-hard problems is a **greedy algorithm**: make the locally best choice at each step, in the hope that the steps will fit together into a solution that is not too far from the best solution. At each turn I will choose the step that minimizes the number of **excess letters** added to the path, and will never undo a step. Here is the exact definition of the metric we are trying to minimize:\n",
    "\n",
    "- **Excess letters**: the number of letters that a step adds, relative to a baseline model in which all the words are concatenated with no repeated words and no overlap between them. (That's not a valid solution, but it is useful as a benchmark.) So if a step adds an unused word, and it woverlaps with the previous word by three letters, that is an excess of -3: I've saved three letters over just concatenating the unused word. (Note that a negative excess is a positive thing.) For a bridging word step, the excess is the number of letters that do not overlap either the previous word or the next word. So all unused word steps have negative excess and all bridging steps are non-negative. (Therefore if there is an unused word step, we don't need to consider a bridging step.)\n",
    "\n",
    "**Examples:** In each row of the table below, `'ajar'` is the previous word,  but each row makes different assumptions about what unused words remain, and thus we get different choices for the step to take. The table shows the overlapping letters between the previous word and the step, and in the case of bridges, it shows one of the  possible unused words that the step is bridging to. The final column shows the excess letter score (and actual letters).\n",
    "\n",
    "|Previous|Step(s)|Overlap|Bridge to|Type of Step(s)|Excess|\n",
    "|--------|----|----|---|---|---|\n",
    "| ajar|jarring|jar||*unused word* |-3| \n",
    "| ajar|arbitrary|ar||*unused word* |-2|\n",
    "| ajar|rabbits|r||*unused word*|-1|\n",
    "| ajar|argot|ar|goths|*one-step bridge* |0|\n",
    "| ajar|arrow|ar|owlets| *one-step bridge*|1 (r)|\n",
    "| ajar|rani, iraq|r|quizzed| *two-step bridge*|5 (anira) | \n",
    "\n",
    "Let's go over the examples:\n",
    "- **jarring**: Here we assume `jarring` is an unused word. It overlaps with `ajar` by 3 letters, giving it an excess cost of -3.\n",
    "- **arbitrary** and **rabbits**: unused word that overlap by fewer than 3 letters, so would only be chosen if there were no unused words with more overlap.\n",
    "- **argot** and **arrow**: One-step bridges; a bridge with the least excess (non-overlapping letters) would be chosen.\n",
    "- **rani, iraq**: a two-step bridge. Suppose `quizzed` is the only remaining unused word. There is no single word that bridges from any suffix of `ajar` to any prefix of `quizzed`. But `rani` can bridge from `'r'` to `'i'` and `iraq` can bridge from `'i'` to `'q'`. This two-word bridge has an excess score of 5 due to the letters `anira` not overlapping anything.\n",
    "\n",
    "One optimization is to recognize that some words are **subwords** of other words. For example, `jar` is a subword of `ajar`. So anytime we place `ajar` into a step, we've also automatically placed `jar` there. We can save computation time by initializing the set of unused words to be the **nonsubwords** in *W*. A subword will never be added in an unused word step, but it may be used in a bridging word step. \n",
    "\n",
    "# Data Type Implementation\n",
    "\n",
    "Here I describe how to implement the main data types in Python:\n",
    "\n",
    "- **Word**: a Python `str` (as are subparts of words, like suffixes or individual letters).\n",
    "- **Wordset**: a subclass of `set`, denoting a set of words, plus some cached attributes to be explained later.\n",
    "- **Path**: a Python `list` of steps.\n",
    "- **Step**: a named tuple of an overlap and a word. Following `ajar` with `jarring`  is `Step(3, 'jarring')`. \n",
    "- **Bridge**: a named tuple of an excess cost followed by a list of one or two steps, e.g. `Bridge(1, [Step(2, 'arrow')])`.\n",
    "- **Bridges**: a table mapping a prefix and a suffix to a bridge. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict, Counter, namedtuple\n",
    "\n",
    "Word    = str\n",
    "Step    = namedtuple('Step', 'overlap, word')\n",
    "Bridge  = namedtuple('Bridge', 'excess, steps')\n",
    "Path    = list[Step]\n",
    "Bridges = dict[str, dict[str, Bridge]] # bridges[prefix][suffix] = Bridge(...)\n",
    "\n",
    "class Wordset(set): \n",
    "    \"\"\"A set of words, with slots to hold some cached information.\"\"\"\n",
    "    __slots__ = ('subwords', 'short_words', 'bridges', 'unused_words', 'unused_startswith')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# portman and is_portman\n",
    "\n",
    "Here we define the functions `portman` and `is_portman`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def portman(P: Path) -> Word:\n",
    "    \"\"\"Compute the portmantout string S from the path P.\"\"\"\n",
    "    # Concatenate the non-overlapping part of each step\n",
    "    return ''.join(word[overlap:] for (overlap, word) in P)\n",
    "\n",
    "def is_portman(P: Path, W: Wordset) -> bool:\n",
    "    \"\"\"Is the Path P a portmantout of the Wordset W?\"\"\"\n",
    "    S = portman(P)\n",
    "    return (all(word in S for word in W) and         # 1. Every word in W is a substring of S\n",
    "            all(step.overlap > 0 for step in P[1:])) # 2. Every word (except first) overlaps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A tiny example wordset `W1`, path `P1`, and string `S1`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'dashikimonogrammarianarchy'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "W1 = Wordset({'anarchy', 'dashiki', 'grammarian', 'kimono', 'monogram',\n",
    "              'a', 'am', 'an', 'arc', 'arch', 'aria', 'as', 'ash', 'dash', 'gram', \n",
    "              'grammar', 'i', 'mar', 'maria', 'mono', 'narc', 'no', 'on', 'ram'})\n",
    "\n",
    "P1 = [Step(0, 'dashiki'),\n",
    "      Step(2,      'kimono'),\n",
    "      Step(4,        'monogram'),\n",
    "      Step(4,            'grammarian'),\n",
    "      Step(2,                    'anarchy')]\n",
    "\n",
    "S1 = portman(P1)\n",
    "S1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "is_portman(P1, W1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# natalie\n",
    "\n",
    "The function `natalie` does a greedy search for a portmantout path. As stated above, the approach is to start with a path of one word (either given as an optional argument or chosen arbitrarily from the word set *W*), and then repeatedly adds steps, each step being either an `unused_word_step` or a sequence of one or two `bridging_steps`. The call `initialize(W)` does some computation on the word set `W` such as precomputing the set of possible bridges and setting `W.unused_words` to be the set of nonsubwords.\n",
    "\n",
    "The function `unused_word_step` returns zero or one step, and `bridging_steps` returns one or two steps; I decided the simplest interface is to have them both return a list of steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def natalie(W: Wordset, start=None) -> Path:\n",
    "    \"\"\"Return a portmantout path containing all words in W. You can optionally give the start word.\"\"\"\n",
    "    initialize(W)\n",
    "    first_word = start or first(W.unused_words)\n",
    "    P = add_step([], W, Step(0, first_word))\n",
    "    while W.unused_words:\n",
    "        prev = P[-1].word\n",
    "        steps = unused_word_step(W, prev) or bridging_steps(W, prev)\n",
    "        for step in steps:\n",
    "            P = add_step(P, W, step)\n",
    "    return P"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Steps: unused_word_step and bridging_steps\n",
    "\n",
    "`unused_word_step` considers every suffix of the previous word, where the function `suffixes` is guaranteed to order the longest suffixes first. If a suffix starts an unused words, we choose it. Since we're going longest-suffix first, no other word choice could do better on the excess letters metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def unused_word_step(W: Wordset, prev_word: Word) -> list[Step]:\n",
    "    \"\"\"Return Step(overlap, unused_word) or None.\"\"\"\n",
    "    for suffix in suffixes(prev_word):\n",
    "        unused_word = first(W.unused_startswith.get(suffix, ()))\n",
    "        if unused_word:\n",
    "            return [Step(len(suffix), unused_word)]\n",
    "    return []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(*Python trivia:* in `unused_word_step` I do `W.unused_startswith.get(suf, ())`, not `W.unused_startswith[suf]`  because the dict in question is a `defaultdict`, and if there is no entry there, I don't want to insert a default entry.)\n",
    "\n",
    "`bridging_steps` also considers every suffix of the previous word, and for each one it looks in the `W.bridges[suf]` table (see below) to see what prefixes (of unused words) we can bridge to from this suffix. Out of all the bridges in `W.bridges[suf][pre]` entries, take one with the minimal excess cost, and return the steps that make up that bridge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bridging_steps(W: Wordset, prev_word: Word) -> list[Step]:\n",
    "    \"\"\"The steps from the minimal-excess bridge that bridges \n",
    "    from a suffix of prev_word to a prefix of any unused word.\"\"\"\n",
    "    bridges = [W.bridges[suf][pre] \n",
    "               for suf in suffixes(prev_word) if suf in W.bridges\n",
    "               for pre in W.bridges[suf] if W.unused_startswith[pre]]\n",
    "    return min(bridges).steps # Choose the bridge with minimal excess"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# initialize and add_step\n",
    "\n",
    "To make it efficient to find steps, the function  `initialize(W)` caches the following information on `W`:\n",
    "  - `W.unused_words`: initially the set of nonsubwords in `W`; when a word is used it is removed from the set.\n",
    "  - `W.unused_startswith`: a dict that maps from a prefix to all the unused words that start with the prefix. Updated when a word is used.\n",
    "      - Example: `W.unused_startswith['somet'] == {'something', 'sometimes'}`.\n",
    "  - `W.subwords`: a set of all the words that are contained within another word in `W`.\n",
    "  - `W.bridges`: a dict where `W.bridges[suf][pre]` gives the best bridge between the affixes.\n",
    "  - `W.short_words`: a set of short words used to build bridges. (See `build_bridges`.)\n",
    "\n",
    "  \n",
    "These structures are complicated, so don't be discouraged if you have to go over the code several times. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def initialize(W: Wordset) -> Wordset:\n",
    "    \"\"\"Precompute and cache data structures on attributes of W.\"\"\"\n",
    "    # Initialize .bridges and .subwords only once; they're unchanging\n",
    "    if not hasattr(W, 'bridges'): \n",
    "        W.bridges       = build_bridges(W)\n",
    "        W.subwords      = {subword for w in W for subword in subparts(w) & W}     \n",
    "    # Re-initialize .unused_words and .unused_startswith every time we want to generate a portmantout\n",
    "    # They are updated every time we add a step to a path\n",
    "    W.unused_words      = W - W.subwords   \n",
    "    W.unused_startswith = startswith_table(W.unused_words)\n",
    "\n",
    "    return W"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can add a step: append the step to the path we are building up, remove the word from the unused words, and remove the word from all the places where it is stored as an unused word with a given prefix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_step(P: Path, W: Wordset, step: Step) -> Path:\n",
    "    \"\"\"Add step to P; remove step's word from `W.unused_words` and `W.unused_startswith[pre] for each pre`.\"\"\"\n",
    "    P.append(step)\n",
    "    word = step.word\n",
    "    if word in W.unused_words: # Maintain W.unused_words and W.unused_startswith\n",
    "        W.unused_words.remove(word)\n",
    "        for pre in prefixes(word):\n",
    "            W.unused_startswith[pre].remove(word)\n",
    "            if not W.unused_startswith[pre]: # clean up\n",
    "                del W.unused_startswith[pre]\n",
    "    return P"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building Bridges\n",
    "\n",
    "The last major piece of the program is the construction of the `W.bridges` table. Recall that we want `W.bridges[suf][pre]` to be a bridge between a suffix of the previous word and a prefix of a nonsubword, as in the examples:\n",
    "\n",
    "\n",
    "      W.bridges['ar']['ow'] == Bridge(1, [Step(2, 'arrow')])\n",
    "      W.bridges['ar']['c']  == Bridge(0, [Step(2, 'arc')])\n",
    "      W.bridges['r']['q']   == Bridge(5, [Step(1, 'rani'), Step(1, 'iraq')])\n",
    "      \n",
    "We build all the bridges in `initialize`, and don't update them. Thus, `W.bridges['r']['q']` says \"if the previous word ends in `'r'` and there is an unused words starting with `'q'`, you can use this bridge.\" The caller is responsible for checking that `W.unused_startswith['q']` contains unused word(s).\n",
    "      \n",
    "Bridges should be short (they should have a small excess letter count). We don't need to consider `antidisestablishmentarianism` as a possible bridge word. Instead, from our 108,709  word set *W*, we'll define `W.short_words` to be a set of 11,330 words that are no more than a limit of 5 letters long, except that we add one letter to the limit if the first letter is a rare one, and add one more if the last letter is a rare one.  I also compute a `short_startswith` table for the `short_words`, where, for example,\n",
    "\n",
    "     short_startswith['som'] == {'soma', 'somas', 'some'} # but not 'somebodies', 'something', ...\n",
    "\n",
    "To build one-word bridges, consider every short word, and split it up in all possible ways into a prefix that will overlap the previous word, a suffix that will overlap the next word, and a count of zero or more excess letters in the middle that don't overlap anything. The function `splits` helps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def splits(word: Word) -> list[tuple[str, int, str]]: \n",
    "    \"\"\"Ways to split up word, as a list of (prefix, excess, suffix) tuples.\"\"\"\n",
    "    return [(word[:i], excess, word[i+excess:])\n",
    "            for excess in range(len(word) - 1)\n",
    "            for i in range(1, len(word) - excess)]\n",
    "\n",
    "def is_short_word(word: Word, maxlen=5, rare_start='jkqxyz', rare_end='bfikopquvwxz') -> bool: \n",
    "    \"\"\"Is this a short word, suitable for use in bridges?\"\"\"\n",
    "    return len(word) <= maxlen + int(word[0] in rare_start) + int(word[-1] in rare_end)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('a', 0, 'rrow'),\n",
       " ('ar', 0, 'row'),\n",
       " ('arr', 0, 'ow'),\n",
       " ('arro', 0, 'w'),\n",
       " ('a', 1, 'row'),\n",
       " ('ar', 1, 'ow'),\n",
       " ('arr', 1, 'w'),\n",
       " ('a', 2, 'ow'),\n",
       " ('ar', 2, 'w'),\n",
       " ('a', 3, 'w')]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splits('arrow')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first element of the list says that `'arrow'` can bridge from `'a'` to `'rrow'` with 0 excess letters; the last says it can bridge from `'a'` to `'w'` with 3 excess letters (which happen to be `'rro'`). \n",
    "\n",
    "Each possible split is passed on to `build_bridge`, which records a bridge in the table under `bridges[pre][suf]` unless there is already ridge stored there that has a smaller ecess letter count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_bridge(bridges: Bridges, word: Word, pre: str, excess: int, suf: str, step2=None):\n",
    "    \"\"\"Store a new bridge (of 1 or 2 steps) if it has less excess than the previous bridges[pre][suf].\"\"\"\n",
    "    if suf not in bridges[pre] or excess < bridges[pre][suf].excess:\n",
    "        steps = [Step(len(pre), word)]\n",
    "        if step2: steps.append(step2)\n",
    "        bridges[pre][suf] = Bridge(excess, steps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is an unfortunate fact that if we only allow one-word bridges, the algorithm can get stuck in a dead end. But if we allow all possible two-word bridges the algorithm would be slow, bogged down with all those bridges to check. Thus, I decided to only use two-word bridges that bridge from a single last letter in the previous word to a single first letter in a following word. If we can construct a bridge from every single letter to every other single letter, then we know the algorithm can never get stuck.\n",
    "\n",
    "Here's the complete `build_bridges` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_bridges(W: Wordset):\n",
    "    \"\"\"A table of bridges[pre][suf] == Bridge(excess, [Step(overlap, word)]), e.g.\n",
    "    bridges['ar']['c'] == Bridge(0, [Step(2, 'arc')]).\n",
    "    bridges['d']['z']  == Bridge(3, [Step(1, 'do'), Step(1, 'oyez')])\"\"\"\n",
    "    W.short_words    = set(filter(is_short_word, W))\n",
    "    short_startswith = startswith_table(W.short_words)\n",
    "    bridges          = defaultdict(dict)\n",
    "    # One-word bridges: every way to split up every short word into suffix/excess/prefix\n",
    "    for word in W.short_words: \n",
    "        for split in splits(word):\n",
    "            build_bridge(bridges, word, *split)\n",
    "    # Two-word bridges: only bridge from a single letter to a single letter\n",
    "    for word1 in W.short_words:\n",
    "        for suf in suffixes(word1): \n",
    "            for word2 in short_startswith[suf]: \n",
    "                excess = len(word1) + len(word2) - len(suf) - 2\n",
    "                A, B = word1[0], word2[-1] # First and last letters\n",
    "                if A != B: # No sense bridging from A to A\n",
    "                    step2 = Step(len(suf), word2)\n",
    "                    build_bridge(bridges, word1, A, excess, B, step2)\n",
    "    return bridges"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Utility Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def multimap(pairs) -> dict[object, set]:\n",
    "    \"\"\"Given (key, val) pairs, make a dict of {key: {val,...}}.\"\"\"\n",
    "    result = defaultdict(set)\n",
    "    for key, val in pairs:\n",
    "        result[key].add(val)\n",
    "    return result\n",
    "\n",
    "def startswith_table(words) -> dict[str, set[Word]]: \n",
    "    \"\"\"A dict mapping a prefix to all the words it starts:\n",
    "    {'somet': {'something', 'sometimes'},...}.\"\"\"\n",
    "    return multimap((pre, w) for w in words for pre in prefixes(w))   \n",
    "    \n",
    "def suffixes(word: Word) -> list[str]:\n",
    "    \"\"\"All non-empty proper suffixes of word, longest first.\"\"\"\n",
    "    return [word[i:] for i in range(1, len(word))]\n",
    "\n",
    "def prefixes(word: Word) -> list[str]:\n",
    "    \"\"\"All non-empty proper prefixes of word.\"\"\"\n",
    "    return [word[:i] for i in range(1, len(word))]\n",
    "\n",
    "def subparts(word: Word) -> set[str]:\n",
    "    \"\"\"All non-empty proper substrings of word\"\"\"\n",
    "    return {word[i:j] \n",
    "            for i in range(len(word)) \n",
    "            for j in range(i + 1, len(word) + 1)} - {word}\n",
    "\n",
    "def first(iterable) -> object | None:\n",
    "    \"\"\"The first element in an iterable, or None\"\"\"\n",
    "    return next(iter(iterable), None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(*Math trivia:* In this context, \"proper\" means \"not whole\". A proper subset is a subset that is not the whole set itself; a proper substring of a word is a substring that is not the whole word itself.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# W: Tom Murphy's Wordset \n",
    "\n",
    "We can make Tom Murphy's 108,709 word file `\"wordlist.asc\"` into a `Wordset` called `W`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc\n",
    "\n",
    "W = Wordset(open('wordlist.asc').read().split()) \n",
    "assert len(W) == 108709"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a\n",
      "aahed\n",
      "aahing\n",
      "aardvark\n",
      "aardvarks\n",
      "aardwolf\n",
      "abaci\n",
      "aback\n",
      "abacus\n",
      "abacuses\n"
     ]
    }
   ],
   "source": [
    "!head wordlist.asc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Portmantout Solutions\n",
    "\n",
    "**Finally!** We're ready to make portmantouts. First for the tiny word set `W1`, for which we must carefully choose the starting word:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Step(overlap=0, word='dashiki'),\n",
       " Step(overlap=2, word='kimono'),\n",
       " Step(overlap=4, word='monogram'),\n",
       " Step(overlap=4, word='grammarian'),\n",
       " Step(overlap=2, word='anarchy')]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P1 = natalie(W1, start='dashiki')\n",
    "P1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'dashikimonogrammarianarchy'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "portman(P1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now for the big word set `W`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 5.45 s, sys: 34 ms, total: 5.49 s\n",
      "Wall time: 5.5 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "103069"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time P = natalie(W)\n",
    "len(P)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I thought it might take several minutes, so 5 seconds to find a 100,000+ step path is great! Now to generate (but not print) the portmantout string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "553434"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "S = portman(P)\n",
    "len(S)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " The string is about a half-million letters long."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Failure is Not an Option\n",
    "\n",
    "Is `natalie` guaranteed to terminate with a solution in finite time? Every iteration either uses up an unused word, or builds a bridge to an unused word that will be used on the next iteration. So, eventually all the unused words will be used and `natalie` will return a solution. The only way this can fail is if we get stuck in a situation where there is no bridge to an unused word. I can prove that this can't happen if I can verify that there is a bridge from every one-letter suffix to every one-letter prefix. The function `missing_bridges` checks for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "alphabet     = 'abcdefghijklmnopqrstuvwxyz'\n",
    "letter_pairs = [A + B for A in alphabet for B in alphabet if A != B]\n",
    "\n",
    "def missing_bridges(W: Wordset) -> list[str]:\n",
    "    \"\"\"What 1-letter-suffix to 1-letter-prefix bridges are missing from W.bridges?\"\"\"\n",
    "    return [A + B for (A, B) in letter_pairs if B not in W.bridges[A]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! *W* has no missing bridges. But the tiny *W1* is missing 623 out of 26 × 25 = 650 1-to-1-letter bridges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "621"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(missing_bridges(initialize(W1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pretty Output\n",
    "\n",
    "Notice I haven't actually *looked* at the portmantout yet. I didn't want to dump half a million letters into an output cell. Instead, I'll define `report` to print various statistics, summarize the begin and end of the portmantout, and save the full string *S* into a file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def report(W: Wordset, P: Path, steps=100, letters=1000, save='natalie.txt'):\n",
    "    S       = portman(P)\n",
    "    sub     = W.subwords \n",
    "    nonsub  = W - sub\n",
    "    uniq    = {step.word for step in P} # unique step words in P\n",
    "    bridge  = len(P) - len(nonsub) # number of bridge steps in P\n",
    "    bridges = sum(len(W.bridges[pre]) for pre in W.bridges) # number of bridges in W\n",
    "    def L(words) -> int: return sum(map(len, words)) # Number of letters\n",
    "    print(f'W has {len(W):,d} words ({len(nonsub):,d} nonsubwords; {len(sub):,d} subwords).')\n",
    "    print(f'W has {bridges:,d} bridges from {len(W.short_words):,d} short_words, '\n",
    "          f'and {len(missing_bridges(W))} missing 1-to-1-letter bridges.')\n",
    "    print(f'P has {len(P):,d} steps ({len(uniq):,d} unique words; {bridge:,d} bridge words).')\n",
    "    print(f'P has an average overlap of {(L(s.word for s in P)-len(S))/(len(P)-1):.2f} letters.')\n",
    "    print(f'P (and thus S) is {\"\" if is_portman(P, W) else \"NOT \"}a valid portmantout of W.')\n",
    "    print(f'S has a compression ratio (letters(W)/letters(S)) of {L(W)/len(S):.2f}.')\n",
    "    print(f'S has {len(S):,d} letters; W has {L(W):,d}; nonsubs have {L(nonsub):,d}.')\n",
    "    if save: open(save, \"w\").write(S)\n",
    "    print(f'S saved as the file \"{save}\".')\n",
    "    print(f'\\nThe first and last {letters} letters:\\n\\n{S[:letters]}\\n...\\n{S[-letters:]}')\n",
    "    steps1 = ', '.join(w[:i] + '⋅' + w[i:] for i, w in P[:steps])\n",
    "    steps2 = ', '.join(w[:i] + '⋅' + w[i:] for i, w in P[-steps:])\n",
    "    print(f'\\nThe first and last {steps} steps:\\n\\n{steps1}\\n...\\n{steps2}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A step such as `Step(1, 'sir')` is printed as `s⋅ir` to indicate that `s` is the 1-letter overlap with the previous word.\n",
    "\n",
    "I will redefine `is_portman` to be faster. *Python trivia:* if `X, Y` and `Z` are sets, `X <= Y <= Z` means \"is `X` a subset of `Y` and `Y` a subset of `Z`?\" We use the notation here to say that the set of words in *P* must contain all the nonsubwords and can only contain words from *W*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_portman(P: Path, W: Wordset) -> str:\n",
    "    \"\"\"Verify that P forms a valid portmantout string for W.\"\"\"\n",
    "    uses_right_words = (W - W.subwords) <= set(step.word for step in P) <= W\n",
    "    overlaps_match   = all((overlap > 0 and P[i - 1].word[-overlap:] == word[:overlap])\n",
    "                           for i, (overlap, word) in enumerate(P[1:], 1))\n",
    "    return uses_right_words and overlaps_match and P[0].overlap == 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "W has 108,709 words (64,389 nonsubwords; 44,320 subwords).\n",
      "W has 68,178 bridges from 11,330 short_words, and 0 missing 1-to-1-letter bridges.\n",
      "P has 103,069 steps (65,059 unique words; 38,680 bridge words).\n",
      "P has an average overlap of 1.65 letters.\n",
      "P (and thus S) is a valid portmantout of W.\n",
      "S has a compression ratio (letters(W)/letters(S)) of 1.68.\n",
      "S has 553,434 letters; W has 931,823; nonsubs have 595,805.\n",
      "S saved as the file \"natalie.txt\".\n",
      "\n",
      "The first and last 1000 letters:\n",
      "\n",
      "demultiplexeskimosquitoeshoestringsidespinieruptionshorelessenedematadorsalsifyarestyledwardshipsidestrokestrelsewherefromagesturersatzestfulnessesamescalismithyselfhoodsociologicallylsympathizingynecologiestonianswerabilityphusesquicentenniallymphocyteslasherselfingerboardspeediestockshrivelledentatestimoniescalopediatricianswererstwhileditoriallynchingsubscriptediousnessayersagiestoppingstomachicallymphaticallymphoidiumlauteddiesestinessentiallyricallyratediumsupplementarythmiasmatavistsarismsalutarythmicroprogrammingyratoryxescaladeducedarwoodgrainingratiatingleditorializingeddieducativelocityfieducatorsoscillatestsunamicabilitiesteemediusefulnessayingsquawkershallowinglessonshipshapeablendedicationalongshoremendsinnedemasculinizedsourwoodeduciblessingsongstressestercesareanschlussreinforcesuraeriedisongbooksellersprawlierasmuscatspawstupefieductorsoesophagusheddeductediouslyerbaseborneonstageyserswartypecastingraysubliminglerskimpinessaysolicitorshipbuildingierotizingiestrayingraft\n",
      "...\n",
      "quotablyogiraquantifiesiraquarrelersiraquahogsiraquartilesiraquakieraniraquaisiraquoinedeairaquarriesiraquotidianoiraquoinsiraquartermastersiraqataraniraquantizedeairaqophsiraqintarsiraquantizesiraqueasyogiraquakedeairaquackieraniraquotersiraquarticsiraquagmiryogiraquakilyogiraquackyogiraquailsiraquarriersiraquarterlyogiraquarterbacksiraqueenedeairaquothaquarterliesiraquantaliraquaintlyogiraquackeryogiraquarantinablemiraquondamaxiraqurushairaquantedeairaquaintnessiraquoitsiraquotientsiraquarksiraquayagesiraquarantinesiraquarrymenoiraquacksalveraniraquarreledeairaquartsiraquarriedeairaquakiestuqueersiraquarterdecksiraqueriersiraquernsiraquartesiraquainteraniraquaverersiraqueasilyogiraquantityogiraquantitiesiraquarrellersiraquandaryogiraquaaludesiraquantitativelyogiraquaysidesiraquanticolloquaffedeairaquackismsiraquaggiestuquenchersiraqianaquagsiraquackishnessiraquackingeniiraquorumsiraquailedeairaquarrelledeairaquebecolloquantifiedeairaquartzitemiraquailingeniiraquantimeteraniraqueening\n",
      "\n",
      "The first and last 100 steps:\n",
      "\n",
      "⋅demultiplexes, es⋅kimos, mos⋅quitoes, toes⋅hoes, shoes⋅trings, rings⋅ides, sides⋅pin, spin⋅ier, er⋅uptions, ons⋅hore, shore⋅less, less⋅ened, ed⋅emata, mata⋅dors, dors⋅als, sals⋅ify, y⋅arest, rest⋅yled, ed⋅wards, wards⋅hips, ships⋅ide, side⋅strokes, kes⋅trels, els⋅ewhere, where⋅from, from⋅ages, ges⋅turers, ers⋅atzes, zes⋅tfulness, fulness⋅es, ses⋅ames, mes⋅calism, sm⋅ithy, thy⋅self, self⋅hoods, s⋅ociologically, ally⋅ls, s⋅ympathizing, zing⋅y, gy⋅necologies, logies⋅t, est⋅onians, ans⋅werability, ty⋅phuses, ses⋅quicentennially, ly⋅mphocytes, tes⋅las, slas⋅hers, hers⋅elf, self⋅ing, fing⋅erboards, s⋅peediest, diest⋅ocks, s⋅hrivelled, ed⋅entates, tes⋅timonies, es⋅caloped, ped⋅iatricians, ans⋅werers, ers⋅twhile, while⋅d, ed⋅itorially, ly⋅nchings, s⋅ubscripted, ted⋅iousness, ess⋅ayers, s⋅agiest, est⋅opping, topping⋅s, s⋅tomachically, ly⋅mphatically, ly⋅mphoid, oid⋅ium, um⋅lauted, ted⋅dies, dies⋅es, ses⋅tines, ines⋅sential, essential⋅ly, ly⋅rically, ly⋅rated, ted⋅iums, s⋅upplementary, ary⋅thmia, mia⋅smata, ata⋅vists, ts⋅arisms, s⋅alutary, ary⋅thmic, mic⋅roprogramming, ming⋅y, gy⋅ratory, ory⋅xes, es⋅caladed, ded⋅uced, ced⋅arwood, wood⋅graining, ing⋅ratiating, ating⋅le, tingle⋅d\n",
      "...\n",
      "s⋅ir, ir⋅aq, q⋅uarried, d⋅eair, ir⋅aq, q⋅uakiest, t⋅uque, que⋅ers, s⋅ir, ir⋅aq, q⋅uarterdecks, s⋅ir, ir⋅aq, q⋅ueriers, s⋅ir, ir⋅aq, q⋅uerns, s⋅ir, ir⋅aq, q⋅uartes, s⋅ir, ir⋅aq, q⋅uainter, r⋅ani, i⋅raq, q⋅uaverers, s⋅ir, ir⋅aq, q⋅ueasily, y⋅ogi, i⋅raq, q⋅uantity, y⋅ogi, i⋅raq, q⋅uantities, s⋅ir, ir⋅aq, q⋅uarrellers, s⋅ir, ir⋅aq, q⋅uandary, y⋅ogi, i⋅raq, q⋅uaaludes, s⋅ir, ir⋅aq, q⋅uantitatively, y⋅ogi, i⋅raq, q⋅uaysides, s⋅ir, ir⋅aq, q⋅uantic, c⋅olloq, q⋅uaffed, d⋅eair, ir⋅aq, q⋅uackisms, s⋅ir, ir⋅aq, q⋅uaggiest, t⋅uque, que⋅nchers, s⋅ir, ir⋅aq, q⋅iana, a⋅qua, qua⋅gs, s⋅ir, ir⋅aq, q⋅uackishness, s⋅ir, ir⋅aq, q⋅uacking, g⋅enii, i⋅raq, q⋅uorums, s⋅ir, ir⋅aq, q⋅uailed, d⋅eair, ir⋅aq, q⋅uarrelled, d⋅eair, ir⋅aq, q⋅uebec, c⋅olloq, q⋅uantified, d⋅eair, ir⋅aq, q⋅uartzite, e⋅mir, ir⋅aq, q⋅uailing, g⋅enii, i⋅raq, q⋅uantimeter, r⋅ani, i⋅raq, q⋅ueening\n"
     ]
    }
   ],
   "source": [
    "report(W, P)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Questions\n",
    "\n",
    "The program is complete, but there are still many interesting things to explore, and questions to answer.\n",
    "\n",
    "**Question: is there an imbalance in starting and ending letters of words?** That could lead to a need for many bridges. We saw in the last 100 steps of *P* multiple repetitions of the two-word bridge \"s⋅ir, ir⋅aq\". That suggests there are too many words that end in \"s\" and too many that start with \"q\". Let's investigate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Letter Starts   Ends Ratio\n",
      "------ ------ ------ -----\n",
      "    a   3,528    384   9:1\n",
      "    b   3,776      6 629:1\n",
      "    c   5,849    908   6:1\n",
      "    d   4,093  7,520   1:2\n",
      "    e   2,470  3,215   1:1\n",
      "    f   2,794     51  55:1\n",
      "    g   2,177  6,343   1:3\n",
      "    h   2,169    351   6:1\n",
      "    i   2,771    128  22:1\n",
      "    j     638      0   1:0\n",
      "    k     566    157   4:1\n",
      "    l   1,634  1,182   1:1\n",
      "    m   3,405    657   5:1\n",
      "    n   1,542  1,860   1:1\n",
      "    o   1,797    113  16:1\n",
      "    p   4,977    123  40:1\n",
      "    q     330      0   1:0\n",
      "    r   3,811  1,994   2:1\n",
      "    s   7,388 29,056   1:4\n",
      "    t   3,097  2,107   1:1\n",
      "    u   2,557     11 232:1\n",
      "    v   1,032      6 172:1\n",
      "    w   1,561     42  37:1\n",
      "    x      51     68   1:1\n",
      "    y     207  8,086  1:39\n",
      "    z     169     21   8:1\n"
     ]
    }
   ],
   "source": [
    "initialize(W)\n",
    "\n",
    "starts = Counter(w[0]  for w in W.unused_words)\n",
    "ends   = Counter(w[-1] for w in W.unused_words)\n",
    "\n",
    "def ratio(L: str) -> str:\n",
    "    \"\"\"Approximate ratio of words that start with L to words that end with L.\"\"\"\n",
    "    s, e = starts[L], ends[L]\n",
    "    return f'{round(s/e)}:1' if (s > e and e != 0) else f'1:{round(e/s)}'\n",
    "\n",
    "print('Letter Starts   Ends Ratio')\n",
    "print('------ ------ ------ -----')\n",
    "for L in sorted(starts):\n",
    "    print(f'{L:>5}  {starts[L]:6,d} {ends[L]:6,d} {ratio(L):>5}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yes, there is a problem: there are many more words that start with `b`, `f`, `p`, `u`, `u` and `v` than that end with those letters. In the other direction 45% of all words end in `s`, but only a quarter of that number start with `s`. The start:end ratio for `y` is 1:39."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.451257202317166"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ends['s'] / len(W.unused_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: what are the most common words in path *P*?** \n",
    "\n",
    "These will be bridge words. What do they have in common?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('sap', 2545),\n",
       " ('so', 2319),\n",
       " ('sic', 1793),\n",
       " ('of', 1792),\n",
       " ('lyre', 1685),\n",
       " ('dab', 1596),\n",
       " ('gab', 1515),\n",
       " ('sun', 1495),\n",
       " ('sin', 1429),\n",
       " ('yam', 1294),\n",
       " ('saw', 1000),\n",
       " ('lye', 713)]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(step.word for step in P).most_common(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed,  bridging away from `s` is a big concern (half of the top dozen bridges). Also, `lyre` and `lye` bridge away from an adverb ending, `ly` (as can `yep`).\n",
    "\n",
    "I'm surprised that `of` shows up so frequently. Let's see what it is bridging from:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('so', 1327),\n",
       " ('go', 270),\n",
       " ('do', 180),\n",
       " ('to', 8),\n",
       " ('cairo', 1),\n",
       " ('furioso', 1),\n",
       " ('francisco', 1),\n",
       " ('franco', 1),\n",
       " ('fortissimo', 1),\n",
       " ('vulgo', 1),\n",
       " ('fresno', 1)]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(P[i-1].word for i, step in enumerate(P) if step.word == 'of').most_common()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that `of` is used in two-word bridges to go from `s`, `g`, and  `d` to `f`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: What is the distribution of word lengths?** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({8: 11964,\n",
       "         9: 11950,\n",
       "         7: 8672,\n",
       "         10: 8443,\n",
       "         11: 6093,\n",
       "         12: 4423,\n",
       "         6: 4364,\n",
       "         13: 2885,\n",
       "         5: 1796,\n",
       "         14: 1765,\n",
       "         15: 1017,\n",
       "         16: 469,\n",
       "         17: 198,\n",
       "         4: 186,\n",
       "         18: 91,\n",
       "         19: 33,\n",
       "         20: 22,\n",
       "         21: 9,\n",
       "         22: 4,\n",
       "         23: 2,\n",
       "         3: 2,\n",
       "         28: 1})"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(map(len, W.unused_words)) # Counter of word lengths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: What is the longest word?** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'antidisestablishmentarianism'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "max(W, key=len)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: What is the distribution of letters in the Wordset?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('e', 68038),\n",
       " ('s', 60080),\n",
       " ('i', 53340),\n",
       " ('a', 43177),\n",
       " ('n', 42145),\n",
       " ('r', 41794),\n",
       " ('t', 38093),\n",
       " ('o', 35027),\n",
       " ('l', 32356),\n",
       " ('c', 23100),\n",
       " ('d', 22448),\n",
       " ('u', 19898),\n",
       " ('g', 17815),\n",
       " ('p', 16128),\n",
       " ('m', 16062),\n",
       " ('h', 12673),\n",
       " ('y', 11889),\n",
       " ('b', 11581),\n",
       " ('f', 7885),\n",
       " ('v', 5982),\n",
       " ('k', 4892),\n",
       " ('w', 4880),\n",
       " ('z', 2703),\n",
       " ('x', 1677),\n",
       " ('j', 1076),\n",
       " ('q', 1066)]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(L for w in W.unused_words for L in w).most_common() # Counter of letters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: How many bridges are there?** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "68178"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Make a list of all bridges, B\n",
    "B = [W.bridges[suf][pre] for suf in W.bridges for pre in W.bridges[suf]]\n",
    "len(B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Bridge(excess=0, steps=[Step(overlap=1, word='tauts')]),\n",
       " Bridge(excess=1, steps=[Step(overlap=1, word='seers')]),\n",
       " Bridge(excess=1, steps=[Step(overlap=1, word='wasps')]),\n",
       " Bridge(excess=2, steps=[Step(overlap=1, word='hiccup')]),\n",
       " Bridge(excess=1, steps=[Step(overlap=1, word='jell')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=1, word='mopy')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=1, word='buxom')]),\n",
       " Bridge(excess=1, steps=[Step(overlap=2, word='doffs')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=1, word='cumin')]),\n",
       " Bridge(excess=1, steps=[Step(overlap=1, word='fade')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=1, word='gazebo')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=1, word='nebs')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=2, word='tees')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=1, word='imams')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=3, word='gene')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=3, word='workup')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=1, word='view')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=2, word='send')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=2, word='synod')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=2, word='into')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=2, word='iraqi')]),\n",
       " Bridge(excess=1, steps=[Step(overlap=3, word='tiled')]),\n",
       " Bridge(excess=2, steps=[Step(overlap=2, word='ochre')]),\n",
       " Bridge(excess=2, steps=[Step(overlap=2, word='usurp')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=4, word='jihad')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=2, word='vats')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=3, word='retie')]),\n",
       " Bridge(excess=2, steps=[Step(overlap=2, word='lying')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=3, word='fader')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=4, word='lethe')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=4, word='firth')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=4, word='zarfs')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=4, word='seels')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=4, word='kobold')]),\n",
       " Bridge(excess=0, steps=[Step(overlap=4, word='cappy')])]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B[::2000] # Sample every 2000th bridge"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: How many excess letters do the bridges have?** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({0: 42395, 1: 20588, 2: 4539, 3: 578, 4: 50, 5: 21, 6: 6, 8: 1})"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Counter of bridge excess letters\n",
    "BC = Counter(b.excess for b in B)\n",
    "BC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of the bridges have 0 or 1 excess letter, so we're doing pretty well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.46567807797236643"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import mean\n",
    "\n",
    "mean(BC.elements())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: How many 1-step and 2-step bridges are there?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({1: 68033, 2: 145})"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(len(b.steps) for b in B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are only 148 2-step bridges; we might as well see them all:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'af': Bridge(excess=2, steps=[Step(overlap=1, word='ago'), Step(overlap=1, word='of')]),\n",
       " 'ag': Bridge(excess=2, steps=[Step(overlap=1, word='an'), Step(overlap=1, word='nag')]),\n",
       " 'aj': Bridge(excess=4, steps=[Step(overlap=1, word='ash'), Step(overlap=1, word='hadj')]),\n",
       " 'aq': Bridge(excess=3, steps=[Step(overlap=1, word='air'), Step(overlap=2, word='iraq')]),\n",
       " 'av': Bridge(excess=1, steps=[Step(overlap=1, word='at'), Step(overlap=1, word='tv')]),\n",
       " 'bc': Bridge(excess=2, steps=[Step(overlap=1, word='bar'), Step(overlap=2, word='arc')]),\n",
       " 'bj': Bridge(excess=4, steps=[Step(overlap=1, word='bacon'), Step(overlap=3, word='conj')]),\n",
       " 'bq': Bridge(excess=6, steps=[Step(overlap=1, word='belli'), Step(overlap=1, word='iraq')]),\n",
       " 'bv': Bridge(excess=2, steps=[Step(overlap=1, word='but'), Step(overlap=1, word='tv')]),\n",
       " 'cv': Bridge(excess=2, steps=[Step(overlap=1, word='cot'), Step(overlap=1, word='tv')]),\n",
       " 'df': Bridge(excess=1, steps=[Step(overlap=1, word='do'), Step(overlap=1, word='of')]),\n",
       " 'dj': Bridge(excess=4, steps=[Step(overlap=1, word='doc'), Step(overlap=1, word='conj')]),\n",
       " 'dq': Bridge(excess=5, steps=[Step(overlap=1, word='deair'), Step(overlap=2, word='iraq')]),\n",
       " 'dr': Bridge(excess=1, steps=[Step(overlap=1, word='do'), Step(overlap=1, word='or')]),\n",
       " 'du': Bridge(excess=3, steps=[Step(overlap=1, word='doe'), Step(overlap=1, word='emu')]),\n",
       " 'dv': Bridge(excess=2, steps=[Step(overlap=1, word='dot'), Step(overlap=1, word='tv')]),\n",
       " 'dz': Bridge(excess=3, steps=[Step(overlap=1, word='do'), Step(overlap=1, word='oyez')]),\n",
       " 'ej': Bridge(excess=3, steps=[Step(overlap=1, word='econ'), Step(overlap=3, word='conj')]),\n",
       " 'ep': Bridge(excess=2, steps=[Step(overlap=1, word='emu'), Step(overlap=1, word='up')]),\n",
       " 'eq': Bridge(excess=4, steps=[Step(overlap=1, word='emir'), Step(overlap=2, word='iraq')]),\n",
       " 'ev': Bridge(excess=2, steps=[Step(overlap=1, word='eat'), Step(overlap=1, word='tv')]),\n",
       " 'ew': Bridge(excess=2, steps=[Step(overlap=1, word='era'), Step(overlap=2, word='raw')]),\n",
       " 'ez': Bridge(excess=3, steps=[Step(overlap=1, word='elf'), Step(overlap=1, word='fez')]),\n",
       " 'fc': Bridge(excess=2, steps=[Step(overlap=1, word='for'), Step(overlap=2, word='orc')]),\n",
       " 'fj': Bridge(excess=5, steps=[Step(overlap=1, word='fish'), Step(overlap=1, word='hadj')]),\n",
       " 'fq': Bridge(excess=3, steps=[Step(overlap=1, word='fir'), Step(overlap=2, word='iraq')]),\n",
       " 'fv': Bridge(excess=2, steps=[Step(overlap=1, word='fat'), Step(overlap=1, word='tv')]),\n",
       " 'gc': Bridge(excess=2, steps=[Step(overlap=1, word='go'), Step(overlap=1, word='orc')]),\n",
       " 'gf': Bridge(excess=1, steps=[Step(overlap=1, word='go'), Step(overlap=1, word='of')]),\n",
       " 'gj': Bridge(excess=5, steps=[Step(overlap=1, word='goth'), Step(overlap=1, word='hadj')]),\n",
       " 'gq': Bridge(excess=6, steps=[Step(overlap=1, word='genii'), Step(overlap=1, word='iraq')]),\n",
       " 'gr': Bridge(excess=1, steps=[Step(overlap=1, word='go'), Step(overlap=1, word='or')]),\n",
       " 'gv': Bridge(excess=2, steps=[Step(overlap=1, word='gut'), Step(overlap=1, word='tv')]),\n",
       " 'hc': Bridge(excess=2, steps=[Step(overlap=1, word='he'), Step(overlap=1, word='etc')]),\n",
       " 'hq': Bridge(excess=4, steps=[Step(overlap=1, word='hair'), Step(overlap=2, word='iraq')]),\n",
       " 'hu': Bridge(excess=2, steps=[Step(overlap=1, word='hem'), Step(overlap=2, word='emu')]),\n",
       " 'hv': Bridge(excess=2, steps=[Step(overlap=1, word='hit'), Step(overlap=1, word='tv')]),\n",
       " 'ic': Bridge(excess=2, steps=[Step(overlap=1, word='is'), Step(overlap=1, word='sic')]),\n",
       " 'ig': Bridge(excess=2, steps=[Step(overlap=1, word='is'), Step(overlap=1, word='sag')]),\n",
       " 'ij': Bridge(excess=3, steps=[Step(overlap=1, word='icon'), Step(overlap=3, word='conj')]),\n",
       " 'io': Bridge(excess=1, steps=[Step(overlap=1, word='is'), Step(overlap=1, word='so')]),\n",
       " 'iu': Bridge(excess=2, steps=[Step(overlap=1, word='if'), Step(overlap=1, word='flu')]),\n",
       " 'iv': Bridge(excess=1, steps=[Step(overlap=1, word='it'), Step(overlap=1, word='tv')]),\n",
       " 'iw': Bridge(excess=2, steps=[Step(overlap=1, word='is'), Step(overlap=1, word='sew')]),\n",
       " 'iz': Bridge(excess=2, steps=[Step(overlap=1, word='if'), Step(overlap=1, word='fez')]),\n",
       " 'jc': Bridge(excess=2, steps=[Step(overlap=1, word='jet'), Step(overlap=2, word='etc')]),\n",
       " 'jq': Bridge(excess=6, steps=[Step(overlap=1, word='jinni'), Step(overlap=1, word='iraq')]),\n",
       " 'jv': Bridge(excess=2, steps=[Step(overlap=1, word='jet'), Step(overlap=1, word='tv')]),\n",
       " 'kc': Bridge(excess=3, steps=[Step(overlap=1, word='kepi'), Step(overlap=3, word='epic')]),\n",
       " 'kj': Bridge(excess=5, steps=[Step(overlap=1, word='kith'), Step(overlap=1, word='hadj')]),\n",
       " 'km': Bridge(excess=3, steps=[Step(overlap=1, word='keel'), Step(overlap=2, word='elm')]),\n",
       " 'kq': Bridge(excess=5, steps=[Step(overlap=1, word='kepi'), Step(overlap=1, word='iraq')]),\n",
       " 'lc': Bridge(excess=2, steps=[Step(overlap=1, word='let'), Step(overlap=2, word='etc')]),\n",
       " 'lj': Bridge(excess=4, steps=[Step(overlap=1, word='loco'), Step(overlap=2, word='conj')]),\n",
       " 'lq': Bridge(excess=3, steps=[Step(overlap=1, word='lira'), Step(overlap=3, word='iraq')]),\n",
       " 'lv': Bridge(excess=2, steps=[Step(overlap=1, word='lit'), Step(overlap=1, word='tv')]),\n",
       " 'lz': Bridge(excess=3, steps=[Step(overlap=1, word='levi'), Step(overlap=2, word='viz')]),\n",
       " 'mj': Bridge(excess=4, steps=[Step(overlap=1, word='mac'), Step(overlap=1, word='conj')]),\n",
       " 'mq': Bridge(excess=5, steps=[Step(overlap=1, word='maxi'), Step(overlap=1, word='iraq')]),\n",
       " 'mz': Bridge(excess=4, steps=[Step(overlap=1, word='merit'), Step(overlap=3, word='ritz')]),\n",
       " 'nf': Bridge(excess=1, steps=[Step(overlap=1, word='no'), Step(overlap=1, word='of')]),\n",
       " 'nj': Bridge(excess=5, steps=[Step(overlap=1, word='narco'), Step(overlap=2, word='conj')]),\n",
       " 'nq': Bridge(excess=4, steps=[Step(overlap=1, word='noir'), Step(overlap=2, word='iraq')]),\n",
       " 'nv': Bridge(excess=2, steps=[Step(overlap=1, word='not'), Step(overlap=1, word='tv')]),\n",
       " 'oi': Bridge(excess=2, steps=[Step(overlap=1, word='of'), Step(overlap=1, word='fbi')]),\n",
       " 'oj': Bridge(excess=4, steps=[Step(overlap=1, word='ooh'), Step(overlap=1, word='hadj')]),\n",
       " 'op': Bridge(excess=2, steps=[Step(overlap=1, word='or'), Step(overlap=1, word='rap')]),\n",
       " 'oq': Bridge(excess=6, steps=[Step(overlap=1, word='oculi'), Step(overlap=1, word='iraq')]),\n",
       " 'ou': Bridge(excess=2, steps=[Step(overlap=1, word='of'), Step(overlap=1, word='flu')]),\n",
       " 'ov': Bridge(excess=2, steps=[Step(overlap=1, word='oft'), Step(overlap=1, word='tv')]),\n",
       " 'ow': Bridge(excess=2, steps=[Step(overlap=1, word='or'), Step(overlap=1, word='row')]),\n",
       " 'pj': Bridge(excess=4, steps=[Step(overlap=1, word='poco'), Step(overlap=2, word='conj')]),\n",
       " 'pq': Bridge(excess=4, steps=[Step(overlap=1, word='pair'), Step(overlap=2, word='iraq')]),\n",
       " 'pv': Bridge(excess=2, steps=[Step(overlap=1, word='pat'), Step(overlap=1, word='tv')]),\n",
       " 'qb': Bridge(excess=4, steps=[Step(overlap=1, word='quip'), Step(overlap=1, word='pub')]),\n",
       " 'qj': Bridge(excess=5, steps=[Step(overlap=1, word='qoph'), Step(overlap=1, word='hadj')]),\n",
       " 'qv': Bridge(excess=3, steps=[Step(overlap=1, word='quit'), Step(overlap=1, word='tv')]),\n",
       " 'qw': Bridge(excess=4, steps=[Step(overlap=1, word='quip'), Step(overlap=1, word='paw')]),\n",
       " 'qx': Bridge(excess=4, steps=[Step(overlap=1, word='quip'), Step(overlap=1, word='pox')]),\n",
       " 'rj': Bridge(excess=4, steps=[Step(overlap=1, word='recon'), Step(overlap=3, word='conj')]),\n",
       " 'rq': Bridge(excess=5, steps=[Step(overlap=1, word='rani'), Step(overlap=1, word='iraq')]),\n",
       " 'ru': Bridge(excess=3, steps=[Step(overlap=1, word='rug'), Step(overlap=1, word='gnu')]),\n",
       " 'rv': Bridge(excess=2, steps=[Step(overlap=1, word='rut'), Step(overlap=1, word='tv')]),\n",
       " 'sf': Bridge(excess=1, steps=[Step(overlap=1, word='so'), Step(overlap=1, word='of')]),\n",
       " 'sj': Bridge(excess=3, steps=[Step(overlap=1, word='shad'), Step(overlap=3, word='hadj')]),\n",
       " 'sq': Bridge(excess=3, steps=[Step(overlap=1, word='sir'), Step(overlap=2, word='iraq')]),\n",
       " 'tf': Bridge(excess=1, steps=[Step(overlap=1, word='to'), Step(overlap=1, word='of')]),\n",
       " 'tj': Bridge(excess=4, steps=[Step(overlap=1, word='taco'), Step(overlap=2, word='conj')]),\n",
       " 'tq': Bridge(excess=5, steps=[Step(overlap=1, word='taxi'), Step(overlap=1, word='iraq')]),\n",
       " 'tz': Bridge(excess=2, steps=[Step(overlap=1, word='tv'), Step(overlap=1, word='viz')]),\n",
       " 'ub': Bridge(excess=2, steps=[Step(overlap=1, word='us'), Step(overlap=1, word='sob')]),\n",
       " 'uf': Bridge(excess=2, steps=[Step(overlap=1, word='ufo'), Step(overlap=1, word='of')]),\n",
       " 'ug': Bridge(excess=2, steps=[Step(overlap=1, word='ufo'), Step(overlap=2, word='fog')]),\n",
       " 'uj': Bridge(excess=4, steps=[Step(overlap=1, word='ugh'), Step(overlap=1, word='hadj')]),\n",
       " 'uq': Bridge(excess=5, steps=[Step(overlap=1, word='ugli'), Step(overlap=1, word='iraq')]),\n",
       " 'uw': Bridge(excess=2, steps=[Step(overlap=1, word='us'), Step(overlap=1, word='sew')]),\n",
       " 'uz': Bridge(excess=3, steps=[Step(overlap=1, word='us'), Step(overlap=1, word='suez')]),\n",
       " 'vc': Bridge(excess=2, steps=[Step(overlap=1, word='vet'), Step(overlap=2, word='etc')]),\n",
       " 'vf': Bridge(excess=3, steps=[Step(overlap=1, word='vino'), Step(overlap=1, word='of')]),\n",
       " 'vj': Bridge(excess=6, steps=[Step(overlap=1, word='vedic'), Step(overlap=1, word='conj')]),\n",
       " 'vk': Bridge(excess=3, steps=[Step(overlap=1, word='vow'), Step(overlap=1, word='wok')]),\n",
       " 'vm': Bridge(excess=2, steps=[Step(overlap=1, word='via'), Step(overlap=1, word='am')]),\n",
       " 'vq': Bridge(excess=5, steps=[Step(overlap=1, word='vizir'), Step(overlap=2, word='iraq')]),\n",
       " 'wc': Bridge(excess=2, steps=[Step(overlap=1, word='war'), Step(overlap=2, word='arc')]),\n",
       " 'wj': Bridge(excess=5, steps=[Step(overlap=1, word='wash'), Step(overlap=1, word='hadj')]),\n",
       " 'wq': Bridge(excess=4, steps=[Step(overlap=1, word='whir'), Step(overlap=2, word='iraq')]),\n",
       " 'wu': Bridge(excess=2, steps=[Step(overlap=1, word='we'), Step(overlap=1, word='emu')]),\n",
       " 'wv': Bridge(excess=2, steps=[Step(overlap=1, word='wit'), Step(overlap=1, word='tv')]),\n",
       " 'xb': Bridge(excess=4, steps=[Step(overlap=1, word='xmas'), Step(overlap=1, word='sob')]),\n",
       " 'xf': Bridge(excess=5, steps=[Step(overlap=1, word='xmas'), Step(overlap=3, word='massif')]),\n",
       " 'xg': Bridge(excess=4, steps=[Step(overlap=1, word='xmas'), Step(overlap=1, word='sag')]),\n",
       " 'xh': Bridge(excess=3, steps=[Step(overlap=1, word='xmas'), Step(overlap=3, word='mash')]),\n",
       " 'xi': Bridge(excess=4, steps=[Step(overlap=1, word='xmas'), Step(overlap=1, word='ski')]),\n",
       " 'xj': Bridge(excess=6, steps=[Step(overlap=1, word='xenic'), Step(overlap=1, word='conj')]),\n",
       " 'xk': Bridge(excess=3, steps=[Step(overlap=1, word='xmas'), Step(overlap=3, word='mask')]),\n",
       " 'xl': Bridge(excess=5, steps=[Step(overlap=1, word='xebec'), Step(overlap=2, word='eccl')]),\n",
       " 'xo': Bridge(excess=3, steps=[Step(overlap=1, word='xmas'), Step(overlap=1, word='so')]),\n",
       " 'xp': Bridge(excess=3, steps=[Step(overlap=1, word='xmas'), Step(overlap=2, word='asp')]),\n",
       " 'xq': Bridge(excess=8, steps=[Step(overlap=1, word='xenic'), Step(overlap=1, word='colloq')]),\n",
       " 'xt': Bridge(excess=3, steps=[Step(overlap=1, word='xmas'), Step(overlap=3, word='mast')]),\n",
       " 'xu': Bridge(excess=4, steps=[Step(overlap=1, word='xylem'), Step(overlap=2, word='emu')]),\n",
       " 'xv': Bridge(excess=5, steps=[Step(overlap=1, word='xmas'), Step(overlap=1, word='shiv')]),\n",
       " 'xw': Bridge(excess=4, steps=[Step(overlap=1, word='xmas'), Step(overlap=1, word='sew')]),\n",
       " 'xy': Bridge(excess=4, steps=[Step(overlap=1, word='xenic'), Step(overlap=2, word='icy')]),\n",
       " 'xz': Bridge(excess=5, steps=[Step(overlap=1, word='xmas'), Step(overlap=1, word='suez')]),\n",
       " 'yb': Bridge(excess=3, steps=[Step(overlap=1, word='yon'), Step(overlap=1, word='nab')]),\n",
       " 'yc': Bridge(excess=2, steps=[Step(overlap=1, word='yet'), Step(overlap=2, word='etc')]),\n",
       " 'yf': Bridge(excess=3, steps=[Step(overlap=1, word='yogi'), Step(overlap=1, word='if')]),\n",
       " 'yj': Bridge(excess=5, steps=[Step(overlap=1, word='yeah'), Step(overlap=1, word='hadj')]),\n",
       " 'yo': Bridge(excess=2, steps=[Step(overlap=1, word='yon'), Step(overlap=1, word='no')]),\n",
       " 'yq': Bridge(excess=5, steps=[Step(overlap=1, word='yogi'), Step(overlap=1, word='iraq')]),\n",
       " 'yv': Bridge(excess=2, steps=[Step(overlap=1, word='yet'), Step(overlap=1, word='tv')]),\n",
       " 'yx': Bridge(excess=3, steps=[Step(overlap=1, word='yale'), Step(overlap=2, word='lex')]),\n",
       " 'yz': Bridge(excess=4, steps=[Step(overlap=1, word='yep'), Step(overlap=1, word='phiz')]),\n",
       " 'zb': Bridge(excess=3, steps=[Step(overlap=1, word='zap'), Step(overlap=1, word='pub')]),\n",
       " 'zd': Bridge(excess=2, steps=[Step(overlap=1, word='zen'), Step(overlap=2, word='end')]),\n",
       " 'zf': Bridge(excess=2, steps=[Step(overlap=1, word='zoo'), Step(overlap=1, word='of')]),\n",
       " 'zh': Bridge(excess=2, steps=[Step(overlap=1, word='zoo'), Step(overlap=2, word='ooh')]),\n",
       " 'zj': Bridge(excess=5, steps=[Step(overlap=1, word='zinc'), Step(overlap=1, word='conj')]),\n",
       " 'zk': Bridge(excess=3, steps=[Step(overlap=1, word='zoo'), Step(overlap=1, word='oak')]),\n",
       " 'zq': Bridge(excess=5, steps=[Step(overlap=1, word='zuni'), Step(overlap=1, word='iraq')]),\n",
       " 'zr': Bridge(excess=2, steps=[Step(overlap=1, word='zoo'), Step(overlap=1, word='or')]),\n",
       " 'zv': Bridge(excess=3, steps=[Step(overlap=1, word='zest'), Step(overlap=1, word='tv')]),\n",
       " 'zw': Bridge(excess=3, steps=[Step(overlap=1, word='zap'), Step(overlap=1, word='paw')]),\n",
       " 'zx': Bridge(excess=3, steps=[Step(overlap=1, word='zap'), Step(overlap=2, word='apex')])}"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{A + B: W.bridges[A][B] for A, B in letter_pairs\n",
    " if len(W.bridges[A][B].steps) == 2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question: What strange letter combinations are there?** Let's look at two-letter suffixes or prefixes that only appear in one or two nonsubwords. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'aj': {'ajar'},\n",
       " 'ay': {'ayahs', 'ayatollahs'},\n",
       " 'bw': {'bwanas'},\n",
       " 'ct': {'ctrl'},\n",
       " 'dn': {'dnieper'},\n",
       " 'dv': {'dvorak'},\n",
       " 'ek': {'ekistics'},\n",
       " 'ez': {'ezekiel'},\n",
       " 'fb': {'fbi'},\n",
       " 'fj': {'fjords'},\n",
       " 'gj': {'gjetosts'},\n",
       " 'gw': {'gweducks', 'gweducs'},\n",
       " 'hd': {'hdqrs'},\n",
       " 'ie': {'ieee'},\n",
       " 'if': {'iffiness'},\n",
       " 'ik': {'ikebanas', 'ikons'},\n",
       " 'ip': {'ipecacs'},\n",
       " 'iv': {'ivories', 'ivory'},\n",
       " 'jn': {'jnanas'},\n",
       " 'kw': {'kwachas', 'kwashiorkor'},\n",
       " 'mc': {'mcdonald'},\n",
       " 'oj': {'ojibwas'},\n",
       " 'pf': {'pfennigs'},\n",
       " 'qa': {'qaids', 'qatar'},\n",
       " 'qo': {'qophs'},\n",
       " 'sf': {'sforzatos'},\n",
       " 'tc': {'tchaikovsky'},\n",
       " 'uf': {'ufos'},\n",
       " 'wu': {'wurzel'},\n",
       " 'xi': {'xiphoids', 'xiphosuran'},\n",
       " 'xm': {'xmases'},\n",
       " 'yc': {'ycleped', 'yclept'},\n",
       " 'ym': {'ymca'},\n",
       " 'zl': {'zlotys'},\n",
       " 'zw': {'zwiebacks'}}"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{pre: W.unused_startswith[pre] # Rare two-letter prefixes\n",
    " for pre in letter_pairs if len(W.unused_startswith[pre]) in (1, 2)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'mb': {'clomb', 'whitecomb'},\n",
       " 'dn': {'haydn'},\n",
       " 'rf': {'waldorf', 'windsurf'},\n",
       " 'ao': {'chiao', 'ciao'},\n",
       " 'gn': {'champaign'},\n",
       " 'zm': {'transcendentalizm'},\n",
       " 'gm': {'apophthegm'},\n",
       " 'sr': {'ussr'},\n",
       " 'nu': {'vishnu'},\n",
       " 'ku': {'haiku'},\n",
       " 'ec': {'filespec', 'quebec'},\n",
       " 'lu': {'honolulu'},\n",
       " 'we': {'zimbabwe'},\n",
       " 'ua': {'joshua'},\n",
       " 'nx': {'bronx', 'meninx'},\n",
       " 'tl': {'peyotl', 'shtetl'},\n",
       " 'mt': {'daydreamt', 'undreamt'},\n",
       " 'cd': {'recd'},\n",
       " 'hr': {'kieselguhr'},\n",
       " 'ui': {'maqui', 'prosequi'},\n",
       " 'zo': {'diazo', 'palazzo'},\n",
       " 'bm': {'ibm', 'icbm'},\n",
       " 'su': {'shiatsu'},\n",
       " 'ud': {'aloud', 'overproud'},\n",
       " 'aa': {'markkaa'},\n",
       " 'ji': {'fiji'},\n",
       " 'nc': {'dezinc', 'quidnunc'},\n",
       " 'mp': {'prestamp'},\n",
       " 'wa': {'kiowa', 'okinawa'},\n",
       " 'ou': {'thankyou'},\n",
       " 'xe': {'deluxe', 'maxixe'},\n",
       " 'oz': {'kolkhoz'},\n",
       " 'ko': {'gingko', 'stinko'},\n",
       " 'xo': {'convexo'},\n",
       " 'xs': {'duplexs'},\n",
       " 'ob': {'blowjob'},\n",
       " 'za': {'organza'},\n",
       " 'pa': {'tampa'},\n",
       " 'ho': {'groucho'},\n",
       " 'nz': {'franz'},\n",
       " 'sz': {'grosz'},\n",
       " 'td': {'retd'},\n",
       " 'ab': {'skylab'},\n",
       " 'ug': {'bedrug', 'sparkplug'},\n",
       " 'dt': {'rembrandt'},\n",
       " 'oi': {'hanoi', 'polloi'},\n",
       " 'ub': {'beelzebub'},\n",
       " 'uc': {'caoutchouc'},\n",
       " 'lm': {'stockholm', 'unhelm'},\n",
       " 'ep': {'asleep', 'shlep'},\n",
       " 'po': {'troppo'},\n",
       " 'tu': {'impromptu'},\n",
       " 'yx': {'styx'},\n",
       " 'ef': {'unicef'},\n",
       " 'zt': {'liszt'},\n",
       " 'hu': {'buchu'},\n",
       " 'ai': {'bonsai'},\n",
       " 'oe': {'monroe'},\n",
       " 'vt': {'govt'},\n",
       " 'eh': {'mikveh', 'yahweh'},\n",
       " 'vo': {'concavo'},\n",
       " 'ln': {'lincoln'},\n",
       " 'rb': {'cowherb'},\n",
       " 'hm': {'microhm'},\n",
       " 'fa': {'khalifa'},\n",
       " 'ru': {'nehru'},\n",
       " 'hn': {'mendelssohn'}}"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "endswith = multimap((w[-2:], w) for w in W.unused_words)\n",
    "\n",
    "{suf: endswith[suf] # Rare two-letter suffixes\n",
    " for suf in endswith if len(endswith[suf]) <= 2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two-letter prefixes definitely include some strange words.\n",
    "\n",
    "The list of two-letter suffixes is mostly picking out proper names and pointing out flaws in the word list. For example, lots of words end in `ab`: blab, cab, crab, dab, gab, jab, lab, etc. But must of them are subwords of plural forms; only `skylab` made it into the word list in singular form but not plural."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparison to Tom Murphy's Program\n",
    "\n",
    "To compare my [program](portman.py) to [Tom Murphy's](https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/portmantout/): \n",
    "- My string over *W* is about 554,000 letters; Murphy's is 611,000.\n",
    "- I used a greedy approach that builds up a single long portmanteau, one step at a time. \n",
    "- Murphy first built a pool of smaller portmanteaux, then greedily joined them all together.\n",
    "- I used Python (about 125 lines for the program without the exploratory questions and the pretty output).\n",
    "- Murphy used C++ (1867 lines), with a lot of extra functionality I didn't do: generating diagrams and animations, and running multiple threads in parallel.\n",
    "\n",
    "I'm reminded of the [Traveling Salesperson Problem](TSP.ipynb) where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments. The two approaches are different, but they are both suboptimal greedy methods, andit is not clear whether one is better than the other. You could try it!\n",
    "\n",
    "(*English trivia:*  my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a **bridge**.  Murphy's program starts by building a large pool of small portmanteaux that he calls **particles**, and when he can build no more particles, his next step is to put two particles together, so he calls it a **join**. The different metaphors for what our programs are doing lead to different terminology for the same idea.)\n",
    "\n",
    "\n",
    " \n",
    "\n",
    "It appears Murphy  perhaps didn't quite have the complete concept of **subwords**. He did mention that when he adds `'bulleting'`, he crosses `'bullet'` and `'bulletin'` off the list, but somehow  [his string](http://tom7.org/portmantout/murphy2015portmantout.pdf) contains both `'spectacular'` and `'spectaculars'`. My guess is that if he adds `'spectaculars'` first he crosses off `'spectacular'`, but if he happens to add `'spectacular'` first, he will later add `'spectaculars'`. Support for this view is that his output in `bench.txt` says \"I skipped 24319 words that were already substrs.\" but I computed that there are 44,320 such subwords; he found about half of them. I think those missing 20,001 words are the main reason why my strings are shorter.\n",
    "\n",
    "Also, Murphy's joins are always between one-letter prefixes and suffixes. I allow prefixes and suffixes of any length up to a total of 6 for `len(pre) + len(suf)`, for one-word bridges. I can get away with this because I limited my candidate pool to the 10,000 `W.short_words`. It would have been time-consuming to build all bridges for all 100,000 words, and probably would not have helped shorten *S* appreciably.\n",
    "\n",
    "I should say that I stole one important trick from Murphy. After I finished the first version of my program, I looked at his highly-entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI) and [paper](http://tom7.org/portmantout/murphy2015portmantout.pdf) and I noticed that I had a problem in my use of bridges. My `natalie` function originally contained something like this: \n",
    "\n",
    "    unused_word_step(...) or one_word_bridge(...) or two_word_bridge(...)\n",
    "    \n",
    "That is, I only considered two-word bridges when there was no one-word bridge, on the assumption that one word is shorter than two. But Murphy showed that my assumption was wrong: for `bridges['w']['c']` I had `'workaholic'` as the best one-word bridge, but he had the two-word bridge `'war' + 'arc' = 'warc'`, which saves six excess letters over my single word. After seeing that, I shamelessly copied his approach, and now I too get a two-letter excess for `bridges['w']['c']`. (Sometimes  `'war' + 'arc'` and sometimes `'wet' + 'etc'` or `'we' + 'etc'`, depending on the seed for the hash function that hashes strings into a word set.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Bridge(excess=2, steps=[Step(overlap=1, word='war'), Step(overlap=2, word='arc')])"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "W.bridges['w']['c']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "I'll stop here, but you should feel free to do more experimentation of your own. \n",
    "\n",
    "Here are some things you could do to make the portmantouts more interesting:\n",
    "\n",
    "- Use linguistic resources (such as [pretrained word embeddings](https://nlp.stanford.edu/projects/glove/)) to teach your program what words are related to each other. Encourage the program to place  related words next to each other. Maybe even make grammatical sentences.\n",
    "- Use linguistic resources (such as [NLTK](https://github.com/nltk/)) to teach your program where syllable breaks are in words, and what each syllable sounds like. Encourage the program to make overlaps match syllables. (That's why \"preferendumdums\" sounds better than \"fortyphonshore\".)\n",
    "\n",
    "Here are some things you could do to make *S* shorter:\n",
    "\n",
    "- **Lookahead**: Unused words are chosen based on the degree of overlap, but nothing else. It might help to prefer unused words which have a suffix that matches the prefix of another unused word. A single-word lookahead or a beam search could be used.\n",
    "- **Reserving words**: It seems like `haydn` and `dnieper` are made to go together in that order; they're the only two words with `dn` as an affix. Similarly, `womenfolk` should be followed by `menfolks`. But if we happened to place `dnieper` or `menfolks` first, we would loose the chance of these nice overlaps.  Maybe there could be a system that assures the proper ordering, or a preprocessing step that joins together words that go together uniquely well. \n",
    "- **Word choice ordering**: Perhaps `startswith_table` could sort the words in each key's bucket so that the \"difficult\" words (say, the ones that end in unusual letters) are encountered earlier in the program's execution, when there are more available words for them to connect to.\n",
    "- **Learning**: The greedy approach minimizes the number of excess letters for each step. But some words are harder to place than others. Instead of just minimizing the excess, consider also the *expected* excess of each word, which could be learned. \n",
    "  \n",
    "Here are some things you could do to make the program more robust:\n",
    "\n",
    "- Write and run unit tests.\n",
    "- Find other word lists, perhaps in other languages, and try the program on them.\n",
    "- Try word lists such as a list of names, or cities or countries, augmented with common short words.\n",
    "- Consider what to do for a wordset that has missing bridges. You could try three-word bridges, you could allow the program to back up and remove a previously-placed word; you could allow the addition of words to the start as well as the end of `P`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:base] *",
   "language": "python",
   "name": "conda-base-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}