{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
Peter Norvig
Dec 2018
Updated Jun 2020
\n", "\n", "# Portmantout Words\n", "\n", "A [***portmanteau***](https://en.wikipedia.org/wiki/Portmanteau) is a word that squishes together two words, like ***mathlete*** = ***math*** + ***athlete***. Inspired by [**Darius Bacon**](http://wry.me/), I covered this as a programming exercise in my 2012 [**Udacity course**](https://www.udacity.com/course/design-of-computer-programs--cs212). In 2018 I was re-inspired by [**Tom Murphy VII**](http://tom7.org/), who added a new twist: [***portmantout words***](http://www.cs.cmu.edu/~tom7/portmantout/) ([***tout***](https://www.duolingo.com/dictionary/French/tout/fd4dc453d9be9f32b7efe838ebc87599) from the French for *all*).\n", "\n", "\n", "Informally, **portmantout** means that you take all the words in a set and mush them together in some order such that each word overlaps the next. Or more formally: \n", "\n", "A **portmantout** of a set of words *W* is a string *S* constructed as follows:\n", "- Make a path *P* that is an ordered list containing all the words in *W* at least once each.\n", "- Every word in *P* (except the first) must *overlap*: it must begin with a prefix that is the same as a suffix of the previous word.\n", "- The string *S* is formed by concatenating the letters in *P*, but using each overlap only once, not twice.\n", "- For example:\n", " - *W* = {world, hello, lowbrow}\n", " - *P* = [hello, lowbrow, world]\n", " - *S* = \"hellowbroworld\".\n", "\n", "\n", "This notebook develops a program that can find a portmantout *S* for a set *W* of over 100,000 words in a few seconds. The program attempts to minimize the length of *S*, but does not guarantee that it is the shortest possible. Along the way it generated these interesting portmanteaux:\n", "\n", "\n", "- **impromptutankhamenability** [impromptu, tutankhamen, amenability]: willingness to see the Egyptian exhibit on the spur of the moment.\n", "- **preferendumdums** [prefer, referendum, dumdums]: agreeable uninformed voters.\n", "- **dashikimonogrammarianarchy** [dashiki, kimono, monogram, grammarian, anarchy]: the chaos that ensues when a linguist gets involved in choosing how to enscribe African/Japanese garb.\n", "- **fortyphonshore** [forty, typhons, onshore]: a dire weather report. \n", "- **allegestionstage** [alleges, egestions, onstage]: a brutal theatre critique.\n", "- **skymanipulablearsplittingler** [skyman, manipulable, blears, earsplitting, tinglers]: a nerve-damaging aviator.\n", "- **edinburgherselflesslylyricize** [edinburgh, burghers, herself, selflessly, slyly, lyricize]: a Scottish music review.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overall Design\n", "\n", "I originally thought I would define a major function, `portman`, to generate the portmantout string *S* from the set of words *W*, and a minor function, `is_portman`, to verify the result:\n", "\n", " portman(W: Wordset) -> str # Compute the string S from the set of words W\n", " is_portman(S: str, W: Wordset) -> bool # Verify that S is a valid portmantout covering W\n", "\n", "But then I realized that verification would be difficult. For example, *S* = `'helloworld'` would be rejected as non-overlapping if parsed as `'hello'` + `'world'`, but accepted if parsed as `'hello'` + `'low'` + `'world'`. It was hard for `is_portman` to decide which parse was intended, which is a shame because `portman` *knew* which was intended. \n", "\n", "To make everything explicit (and more efficient), I will construct not just a **list** of words *L*, but rather what I'll call a **path**, where each **step** in the path *P* consists of a word from *W* and an integer that specifies the number of characters in the overlap with the previous word. With that in mind, I decided on the following calling and [naming](https://en.wikipedia.org/wiki/Natalie_Portman) conventions:\n", "\n", " natalie(W: Wordset) -> Path # Find a portmantout path P for a set of words W\n", " portman(P: Path) -> str # Compute the string S represented by the path P\n", " is_portman(P: Path, W: Wordset) -> bool # Verify that P is a valid path covering W\n", "\n", "Thus I can generate a portmantout string *S* with:\n", "\n", " S = portman(natalie(W))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic Data Types and functions\n", "\n", "Here I describe how to implement the main data types in Python:\n", "\n", "- **Word**: a `str`.\n", "- **Wordset**: a `set` of words.\n", "- **Step**: a named tuple of the overlap (the length of the prefix that matches the previous word's suffix) and a word. \n", "- **Path**: a `list` of steps.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict, Counter, namedtuple\n", "from typing import Iterable\n", "\n", "Word = str\n", "Step = namedtuple('Step', ('overlap', 'word'))\n", "Path = list[Step]\n", "Wordset = set[Word]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the functions `portman` and `is_portman`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def portman(P: Path) -> Word:\n", " \"\"\"Compute the portmantout string S from the path P.\"\"\"\n", " # Concatenate the non-overlapping part of each step\n", " return ''.join(word[overlap:] for (overlap, word) in P)\n", "\n", "def is_portman(P: Path, W: Wordset) -> bool:\n", " \"\"\"Is the Path P a portmantout of the Wordset W?\"\"\"\n", " S = portman(P)\n", " return (all(word in S for word in W) and # 1. Every word in W is a substring of S\n", " all(step.overlap > 0 for step in P[1:])) # 2. Every word (except first) overlaps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *W0*: A Tiny Wordset\n", "\n", "A tiny example wordset `W0`, path `P0`, and string `S0`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'dashikimonogrammarianarchy'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "W0 = {'anarchy', 'dashiki', 'grammarian', 'kimono', 'monogram',\n", " 'a', 'am', 'an', 'arc', 'arch', 'aria', 'as', 'ash', 'dash', 'gram', \n", " 'grammar', 'i', 'mar', 'maria', 'mono', 'narc', 'no', 'on', 'ram'}\n", "\n", "P0 = [Step(0, 'dashiki'),\n", " Step(2, 'kimono'),\n", " Step(4, 'monogram'),\n", " Step(4, 'grammarian'),\n", " Step(2, 'anarchy')]\n", "\n", "assert is_portman(P0, W0)\n", "\n", "S0 = portman(P0)\n", "S0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Steps in a Path\n", "\n", "My intuition is that finding a shortest *S* is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find a guaranteed shortest possible string in a reasonable amount of time. A common approach to NP-hard problems is a **greedy algorithm**: make the locally best choice at each step, without ever undoing a step, in the hope that the steps will fit together into a solution that is not too far from the best solution. I distinguish two types of steps:\n", "\n", "- **Unused word step**: uses a word that is in *W*, has not been used in the path yet, and has a prefix that matches the suffix of the previous word.\n", "- **Bridging word step**: if no unused word step is possible, we need to do something to get back on track. I call that something a **bridge**: a step that repeats a previously-used word in order to provide a new suffix that will allow us to place an unused word on the next step. \n", "\n", "At each turn I will choose the step that minimizes the number of *excess letters** added to the path. That is, minimie the number of letters that a step adds, relative to a baseline model in which all the words are concatenated with no repeated words and no overlap between them. (That's not a valid solution, but it is useful as a benchmark.) So if a step adds an unused word, and it overlaps with the previous word by three letters, that is an excess of -3: I've saved three letters over just concatenating the unused word. (Note that a negative excess is a positive thing!) For a bridging word step, the excess is the number of letters that do not overlap either the previous word or the next word. So all unused word steps have negative excess and all bridging steps are non-negative. Therefore if there is an unused word step, we don't need to consider a bridging step.\n", "\n", "**Examples:** In each row of the table below, `'ajar'` is the previous word, but each row makes different assumptions about what unused words remain, and thus we get different choices for the step to take. The table shows the overlapping letters between the previous word and the step, and in the case of bridges, it shows the unused word that the step is bridging to. (Sometimes we need a bridge that is two words long, but with our word set *W* we will never need more than that.) The final column shows the excess letter score (and the actual excess letters).\n", "\n", "|Previous|Step(s)|Overlap|Bridge to|Type of Step(s)|Excess|\n", "|--------|----|----|---|---|---|\n", "| ajar|**jarring**|jar|–|*unused word* |-3| \n", "| ajar|**arbitrary**|ar|–|*unused word* |-2|\n", "| ajar|**rabbits**|r|–|*unused word*|-1|\n", "| ajar|**argot**|ar|goths|*one-step bridge* |0|\n", "| ajar|**arrow**|ar|owlets| *one-step bridge*|1 (r)|\n", "| ajar|r**ani, iraq**|r|quizzed| *two-step bridge*|5 (anira) | \n", "\n", "W2 = Wordset('ajar jarring ringlet letter terminal alto'.split())\n", "\n", "Let's go over the examples:\n", "- **jarring**: Here we assume `jarring` is an unused word. It overlaps with `ajar` by 3 letters, giving it an excess cost of -3.\n", "- **arbitrary** and **rabbits**: unused words that overlap by fewer than 3 letters, so would only be chosen if there were no unused words with more overlap.\n", "- **argot** and **arrow**: One-step bridges; a bridge with the least excess would be chosen.\n", "- **rani, iraq**: a two-step bridge. Suppose `quizzed` is the only remaining unused word. There is no single word that bridges from any suffix of `ajar` to any prefix of `quizzed`. But `rani` can bridge from `'r'` to `'i'` and `iraq` can bridge from `'i'` to `'q'`. This two-word bridge has an excess score of 5 due to the letters `anira` not overlapping anything.\n", "\n", "# Redefining Wordset for Efficiency\n", "\n", "With an 108,000 word word set, the program will be slow if we have to consider every possible word on very step. Here are some questions about this:\n", "- How should we keep track of which words in *W* have been used in the path so far and which are unused?\n", "- How can we efficiently find words that match a suffix of the previous word?\n", "- How can we efficiently find a good bridge?\n", "- Can we eliminate some of the 108,000 words from consideration?\n", "- Can we prove that we will never get stuck in a dead end?\n", "\n", "To answer these questions I will redefine `Wordset` and `Path` so that they do some precomputation when they are constructed, caching information that will help make the choice of each step more efficient. We'll start with the `Wordset`:\n", "\n", "- We can eliminate from consideration all words that are **subwords** of other words. For example, given that the word 'scampi' is in our word list, we have to put it in a step at some point. But when we do, we've also automatically added the words 'a', 'am', 'amp', 'camp', 'campi', 'i', 'scam', and 'scamp'. So we can divide our words two sets and store them as the `.subwords` and `.nonsubwords` attributes of the Wordset. When we start constructing a path we will know that we only have to cover the `.nonsubwords`.\n", "- Bridges should use short word. We don't need to consider `antidisestablishmentarianism` as a possible bridge word. We'll precompute a list of short words that are good candidates for bridges and store them on the `.short_words` attribute. I arbitrarily set the limit at five letters, except that if the first or last letter is a rare letter for that position, they don't count. So a short word could be up to 7 letters.\n", "- A bridge may be used multiple times in the construction of a path. It makes sense to precompute all the reasonable bridges and store them under the `.bridges` attribue. We'll describe how to build bridges later." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class Wordset(set[Word]):\n", " \"\"\"A set of words, with cached data structures for efficieency.\"\"\"\n", " def __init__(self, words):\n", " W = self\n", " W.update(words)\n", " W.subwords = {part for w in W for part in subparts(w) if part in W}\n", " W.nonsubwords = W - W.subwords\n", " W.short_words = {w for w in W if is_short_word(w)}\n", " W.bridges = build_bridges(W)\n", "\n", "def subparts(word: Word) -> set[str]:\n", " \"\"\"All non-empty proper substrings of word.\"\"\"\n", " return {word[i:j] \n", " for i in range(len(word)) \n", " for j in range(i + 1, len(word) + int(i != 0))} # don't use word[0:len(word) + 1]\n", "\n", "def is_short_word(word: Word, maxlen=5, rare_start='jkqxyz', rare_end='qxfwzubvopikh') -> bool: \n", " \"\"\"Is this a short word, suitable for use in bridges?\"\"\"\n", " # Short words are a maximmum of 5 letters, except we don't count rare first or last letters\n", " return len(word) - int(word[0] in rare_start) - int(word[-1] in rare_end) <= maxlen \n", " \n", "# TODO: build_bridges" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Redefining Path for Efficiency\n", "\n", "While building a path, we want to keep track of what words need to be added to the path, and we want to be able to quickly find words that have a prefix matching the previous step's suffix. So we will pre-compute two attributes on a path:\n", "\n", "- The `.unused_words` attribute holds the nonsubwords in *W* that have not yet been added to the path.\n", "- The `.unused_startswith` attribute is a dict that maps from each possible word prefix to a set of the remaining unused words that start with that prefix.\n", " - Example entry: `P.unused_startswith['somet'] == {'something', 'sometimes'}`.\n", "- These data structures are updated on each step. When `'dirt'` is added to the path, we remove it from `.unused_words` and also remove it from the `'d'`, `'di'`, and `'dir'` entries in `.unused_startswith`.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "class Path(list[Step]):\n", " \"\"\"A list of steps, with cached data structures for efficiency.\"\"\"\n", " def __init__(self, steps, W: Wordset):\n", " self.unused_words = W.nonsubwords.copy()\n", " self.unused_startswith = startswith_table(self.unused_words)\n", " for step in steps:\n", " self.add_step(step)\n", "\n", " def add_step(self, step: Step) -> None:\n", " \"\"\"Add step to path; remove step's word from `.unused_words` \n", " and from `.unused_startswith[pre]` for each prefix `pre`.\"\"\"\n", " self.append(step)\n", " word = step.word\n", " if word in self.unused_words: # Maintain .unused_words and .unused_startswith\n", " self.unused_words.remove(word)\n", " for pre in prefixes(word):\n", " self.unused_startswith[pre].remove(word)\n", " if not self.unused_startswith[pre]: # clean up unused key\n", " del self.unused_startswith[pre]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def prefixes(word: Word) -> list[str]:\n", " \"\"\"All non-empty proper prefixes of word.\"\"\"\n", " return [word[:i] for i in range(1, len(word))]\n", "\n", "def startswith_table(words) -> dict[str, set[Word]]: \n", " \"\"\"A dict mapping a prefix to all the words it starts:\n", " {'somet': {'something', 'sometimes'}, ...}.\"\"\"\n", " return multimap((pre, w) for w in words for pre in prefixes(w)) \n", "\n", "def multimap(pairs) -> dict[object, set]:\n", " \"\"\"Given (key, val) pairs, make a dict of {key: {val,...}}.\"\"\"\n", " result = defaultdict(set)\n", " for key, val in pairs:\n", " result[key].add(val)\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building Bridges\n", "\n", "The last major piece of the program is the construction of the `W.bridges` table. Recall that we want `W.bridges[suf][pre]` to be the minimal-excess bridge between a suffix of the previous word and a prefix of an unused nonsubword in *W*. The words in the bridge can be subwords, but we will also use the bridges to reach an unused nonsubword. Examples bridges:\n", "\n", " W.bridges['ar']['c'] == Bridge(excess=0, [Step(2, 'arc')])\n", " W.bridges['ar']['ow'] == Bridge(excess=1, [Step(2, 'arrow')])\n", " W.bridges['r']['q'] == Bridge(excess=5, [Step(1, 'rani'), Step(1, 'iraq')])\n", " \n", "To build **one-word bridges**, consider every short word, and split it up in all possible ways into a prefix that will overlap the previous word, a suffix that will overlap the next word, and a count of zero or more excess letters in the middle that don't overlap anything. Suggest all these bridges, but for each suffix/prefix pair we only keep the bridge that have the minimal excess.\n", "\n", "We will also need **two-word bridges**. With only one-word bridges, the algorithm can get stuck in a dead end where there is neither an unused word nor a bridge that matches a suffix of the previous word. But if we add in too many two-word bridges the algorithm would be slow, bogged down with all those bridges to check. Thus, I decided to only use two-word bridges that bridge from a single last letter in the previous word to a single first letter in a following word. If we can construct a bridge from every single letter to every other single letter, then we know the algorithm can never get stuck. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "Bridge = namedtuple('Bridge', 'excess, steps')\n", "Bridges = dict[str, dict[str, Bridge]] # bridges[prefix][suffix] = Bridge(...)\n", "\n", "def build_bridges(W: Wordset):\n", " \"\"\"A table of bridges[pre][suf] == Bridge(excess, [Step(overlap, word)])\"\"\"\n", " short_startswith = startswith_table(W.short_words)\n", " bridges = defaultdict(dict)\n", " # One-word bridges: consider every way to split up every short word into suffix/excess/prefix\n", " for word in W.short_words:\n", " for (pre, excess, suf) in splits(word):\n", " suggest_bridge(bridges, word, pre, excess, suf)\n", " # Two-word bridges: only bridge from a single letter to a single letter\n", " for word1 in W.short_words:\n", " for suf in suffixes(word1): \n", " for word2 in short_startswith[suf]: \n", " excess = len(word1) + len(word2) - len(suf) - 2\n", " A, B = word1[0], word2[-1] # First and last letters\n", " step2 = Step(len(suf), word2)\n", " suggest_bridge(bridges, word1, A, excess, B, step2)\n", " return bridges\n", "\n", "def suggest_bridge(bridges: Bridges, word: Word, pre: str, excess: int, suf: str, step2=None):\n", " \"\"\"Store a new bridge if it has less excess than the previous bridges[pre][suf].\"\"\"\n", " if suf != pre and (suf not in bridges[pre] or excess < bridges[pre][suf].excess):\n", " steps = [Step(len(pre), word)]\n", " if step2: steps.append(step2)\n", " bridges[pre][suf] = Bridge(excess, steps)\n", "\n", "def splits(word: Word) -> list[tuple[str, int, str]]: \n", " \"\"\"Ways to split up word, as a list of (prefix, excess, suffix) tuples.\"\"\"\n", " return [(word[:i], excess, word[i+excess:])\n", " for excess in range(len(word) - 1)\n", " for i in range(1, len(word) - excess)]\n", "\n", "def suffixes(word: Word) -> list[str]:\n", " \"\"\"All non-empty proper suffixes of word, in longest first order.\"\"\"\n", " return [word[i:] for i in range(1, len(word))]\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an example of splitting the word 'arrow' into pieces that might overlap the previous and next word; the middle number is the excess of the bridge:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('a', 0, 'rrow'),\n", " ('ar', 0, 'row'),\n", " ('arr', 0, 'ow'),\n", " ('arro', 0, 'w'),\n", " ('a', 1, 'row'),\n", " ('ar', 1, 'ow'),\n", " ('arr', 1, 'w'),\n", " ('a', 2, 'ow'),\n", " ('ar', 2, 'w'),\n", " ('a', 3, 'w')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits('arrow')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a Portmantout Path: natalie\n", "\n", "The function `natalie` does a greedy search for a portmantout path. As stated above, the approach is to start with a path of one word (either given as an optional argument or chosen arbitrarily from the nonsubwords in *W*), and then repeatedly add steps, each step coming from either `unused_word_step` or `bridging_steps`. Since the function `unused_word_step` will return zero or one step, and `bridging_steps` will return one or two steps; I decided the simplest interface is to have them both return a list of steps.\n", "\n", "`unused_word_step` considers every suffix of the previous word, where the function `suffixes` is guaranteed to order the longest suffixes first. If a suffix starts an unused words, we choose it. Since we're going longest-suffix first, no other word choice could do better on the excess letters metric.\n", "\n", "If there is no `unused_word_step` that works for the previous word, we try the function `bridging_steps`, which considers every suffix of the previous word, and for each one it looks in the `W.bridges[suf]` table to see what prefixes `pre` we can bridge to from this suffix. It collects all the bridges `W.bridges[suf][pre]` where `pre` starts some currently unused word. Finally, out of all the possible bridges, it chooses the one with minimal excess cost." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def natalie(W: Wordset, start=None) -> Path:\n", " \"\"\"Return a portmantout path containing all words in W. You can optionally give the start word.\"\"\"\n", " first_step = Step(0, start or first(W.nonsubwords))\n", " P = Path([first_step], W)\n", " while P.unused_words:\n", " steps = unused_word_step(P) or bridging_steps(W, P)\n", " for step in steps:\n", " P.add_step(step)\n", " return P\n", "\n", "def unused_word_step(P: Path) -> list[Step]:\n", " \"\"\"Return [Step(overlap, unused_word)] or [].\"\"\"\n", " prev_word = P[-1].word\n", " for suffix in suffixes(prev_word):\n", " if suffix in P.unused_startswith and (unused_word := first(P.unused_startswith[suffix])):\n", " return [Step(len(suffix), unused_word)]\n", " return []\n", "\n", "def bridging_steps(W: Wordset, P: Path) -> list[Step]:\n", " \"\"\"The steps from the minimal-excess bridge that bridges \n", " from a suffix of the previous word to a prefix of any unused word.\"\"\"\n", " prev_word = P[-1].word\n", " bridges = [W.bridges[suf][pre] \n", " for suf in suffixes(prev_word) if suf in W.bridges\n", " for pre in W.bridges[suf] if P.unused_startswith[pre]]\n", " return min(bridges).steps # Choose the bridge with minimal excess\n", "\n", "def first(iterable) -> object | None:\n", " \"\"\"The first element in an iterable, or None\"\"\"\n", " return next(iter(iterable), None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *W*: Tom Murphy's Wordset \n", "\n", "Tom Murphy has an 108,709 word file `\"wordlist.asc\"`, which we can make into a `Wordset` called `W`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a\n", "aahed\n", "aahing\n", "aardvark\n", "aardvarks\n", "aardwolf\n", "abaci\n", "aback\n", "abacus\n", "abacuses\n" ] } ], "source": [ "! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc\n", "! head wordlist.asc" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "W = Wordset(open('wordlist.asc').read().split()) \n", "assert len(W) == 108709" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Portmantout Solutions\n", "\n", "**Finally!** We're ready to make a long portmantout. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.59 s, sys: 18.4 ms, total: 3.61 s\n", "Wall time: 3.61 s\n" ] }, { "data": { "text/plain": [ "(554039, 103445)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time P = natalie(W)\n", "S = portman(P)\n", "len(S), len(P)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just a few seconds run time; great! The portmantout has over half a million letters and 100,000 steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " The string is about a half-million letters long." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pretty Output Reports\n", "\n", "Notice I haven't actually *looked* at the portmantout yet. Even with a tiny font it would be over 100 pages. Instead, I'll define `report` to print various statistics, summarize the begin and end of the portmantout, and optionally save the full string *S* into a file. \n", "\n", "To verify the path is valid, I will redefine `is_portman` to be faster. *Python trivia:* if `X, Y` and `Z` are sets, `X <= Y <= Z` means \"is `X` a subset of `Y` and `Y` a subset of `Z`?\" We use the notation here to say that the set of words in *P* must contain all the nonsubwords and can only contain words from *W*.\n", "\n", "A step such as `Step(1, 'hello')` is printed as `he⋅llo` to indicate that `he` is the 2-letter overlap with the previous word." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def report(W: Wordset, P: Path = None, steps=100, letters=500, save='natalie.txt'):\n", " P = P or natalie(W)\n", " S = portman(P)\n", " sub = W.subwords \n", " nonsub = W.nonsubwords\n", " uniq = len({step.word for step in P}) # unique step words in P\n", " bridge = len(P) - len(nonsub) # number of bridge steps in P\n", " bridges = sum(len(W.bridges[pre]) for pre in W.bridges) # number of bridges in W\n", " counts = Counter(len(W.bridges[pre][suf].steps) for pre in W.bridges for suf in W.bridges[pre])\n", " def L(words) -> int: return sum(map(len, words)) # Number of letters\n", " print(f'S has {len(S):,d} letters and a compression ratio of {L(W)/len(S):.3f}\\n'\n", " f'P has {len(P):,d} steps ({bridge:,d} bridge steps), '\n", " f'average overlap {(L(s.word for s in P)-len(S))/(len(P)-1):.2f} letters\\n'\n", " f'W has {len(W):,d} words ({len(nonsub):,d} nonsubwords, {len(sub):,d} subwords, '\n", " f'{len(W.short_words):,d} short words)\\n'\n", " f'There are {bridges:,d} bridges ({counts[1]:,d} one-step, {counts[2]} two-step, '\n", " f'{len(missing_bridges(W)) or \"none\"} missing)'\n", " )\n", " if save: \n", " open(save, \"w\").write(S)\n", " print(f'S saved as the file \"{save}\".')\n", " print(f'\\nThe first and last {letters} letters of S:\\n\\n{S[:letters]}\\n...\\n{S[-letters:]}')\n", " stepS0 = ', '.join(w[:i] + '⋅' + w[i:] for i, w in P[:steps])\n", " steps2 = ', '.join(w[:i] + '⋅' + w[i:] for i, w in P[-steps:])\n", " print(f'\\nThe first and last {steps} steps:\\n\\n{stepS0}\\n...\\n{steps2}')\n", " assert is_portman(P, W)\n", "\n", "def is_portman(P: Path, W: Wordset) -> str:\n", " \"\"\"Verify that P forms a valid portmantout string for W:\n", " That it uses every word and the overlap of each word matches the previous word.\"\"\"\n", " uses_words = (W.nonsubwords) <= set(step.word for step in P) <= W\n", " overlaps_match = all((overlap == 0) if (i == 0) \n", " else (overlap > 0 and P[i - 1].word[-overlap:] == word[:overlap])\n", " for i, (overlap, word) in enumerate(P[1:], 1))\n", " return uses_words and overlaps_match\n", "\n", "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n", "letter_pairs = [A + B for A in alphabet for B in alphabet if A != B]\n", "\n", "def missing_bridges(W: Wordset) -> list[str]:\n", " \"\"\"What 1-letter-suffix to 1-letter-prefix bridges are missing from W.bridges?\"\"\"\n", " return [A + B for (A, B) in letter_pairs if B not in W.bridges[A]]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "S has 554,039 letters and a compression ratio of 1.682\n", "P has 103,445 steps (39,056 bridge steps), average overlap 1.67 letters\n", "W has 108,709 words (64,389 nonsubwords, 44,320 subwords, 11,536 short words)\n", "There are 70,300 bridges (70,155 one-step, 145 two-step, none missing)\n", "S saved as the file \"natalie.txt\".\n", "\n", "The first and last 500 letters of S:\n", "\n", "pickupsilonshorebirdseyestrainmasterworkshopsackingshipshapeupstrokestrelsewherebypasseduciblencheddarsenalsoftenedithersiticalkedgeditorializingingeringsidesteppedologiestrangingkoalaskashmirscarecrowstepsistersestinestimablyricistsimmeshinglesseningrafteddyingsquealingeringlyceroselysiantonymousingsongstersergessoesophagustatoriallylsaprophyticallymphoidiumlautediumsubdisciplinesmanicuredressingstutteredoundeducingeniouslyerbasslybootstrapshootingsurrogacylindersatzestfulnessestetsonshipsideca\n", "...\n", "meteraniraquaggieraniraquayagesiraquotablyetiraquoitedeliraquoitsiraquartzesiraquartosiraquackiestoqueeredeliraqaidsiraquarriedeliraquantitativelyetiraquackieraniraquanticolloquarrymenoiraquaverersiraqindarsiraquasarsiraquarriersiraquarrelsomemiraquantaliraquantumagiraquotersiraquackingorkiraquartilesiraquarrellingorkiraquaggyetiraquantifiesiraquaveryetiraquailedeliraquarreledeliraquartesiraquarantiningorkiraquainteraniraquonsetapiraquackishnessiraqophsiraquotationallyetiraquantityetiraquackster\n", "\n", "The first and last 100 steps:\n", "\n", "⋅pickups, ups⋅ilons, ons⋅hore, shore⋅birds, birds⋅eyes, eyes⋅train, train⋅master, master⋅works, works⋅hops, hops⋅acking, sacking⋅s, kings⋅hips, ships⋅hape, shape⋅ups, ups⋅trokes, kes⋅trels, els⋅ewhere, where⋅by, by⋅passed, sed⋅ucible, ble⋅nched, ched⋅dars, ars⋅enals, als⋅o, so⋅ftened, ed⋅ith, dith⋅ers, thers⋅itical, cal⋅ked, ked⋅ged, ed⋅itorializing, zing⋅ing, ging⋅ering, ring⋅sides, sides⋅tepped, ped⋅ologies, logies⋅t, est⋅ranging, ging⋅ko, ko⋅alas, alas⋅kas, kas⋅hmirs, s⋅carecrows, crows⋅teps, steps⋅isters, ters⋅est, sest⋅ines, ines⋅timably, ly⋅ricists, ts⋅immes, immes⋅hing, shing⋅les, les⋅sening, ing⋅rafted, ted⋅dy, eddy⋅ing, dying⋅s, s⋅quealing, ling⋅eringly, gly⋅cerose, erose⋅ly, ely⋅sian, an⋅tonymous, mous⋅ings, sings⋅ongs, songs⋅ters, ters⋅er, ser⋅ges, ges⋅soes, oes⋅ophagus, gus⋅tatorially, ally⋅ls, s⋅aprophytically, ly⋅mphoid, oid⋅ium, um⋅lauted, ted⋅iums, s⋅ubdisciplines, lines⋅man, man⋅icured, red⋅ressing, dressing⋅s, s⋅tuttered, red⋅ounded, ded⋅ucing, ing⋅eniously, sly⋅er, yer⋅bas, bas⋅sly, sly⋅boots, boots⋅traps, traps⋅hooting, shooting⋅s, s⋅urrogacy, cy⋅linders, ers⋅atzes, zes⋅tfulness, fulness⋅es, ses⋅tets, stets⋅ons\n", "...\n", "s⋅ir, ir⋅aq, q⋅uartos, s⋅ir, ir⋅aq, q⋅uackiest, t⋅oque, que⋅ered, d⋅eli, i⋅raq, q⋅aids, s⋅ir, ir⋅aq, q⋅uarried, d⋅eli, i⋅raq, q⋅uantitatively, y⋅eti, i⋅raq, q⋅uackier, r⋅ani, i⋅raq, q⋅uantic, c⋅olloq, q⋅uarrymen, n⋅oir, ir⋅aq, q⋅uaverers, s⋅ir, ir⋅aq, q⋅indars, s⋅ir, ir⋅aq, q⋅uasars, s⋅ir, ir⋅aq, q⋅uarriers, s⋅ir, ir⋅aq, q⋅uarrelsome, e⋅mir, ir⋅aq, q⋅uantal, l⋅ira, ira⋅q, q⋅uantum, m⋅agi, i⋅raq, q⋅uoters, s⋅ir, ir⋅aq, q⋅uacking, g⋅orki, i⋅raq, q⋅uartiles, s⋅ir, ir⋅aq, q⋅uarrelling, g⋅orki, i⋅raq, q⋅uaggy, y⋅eti, i⋅raq, q⋅uantifies, s⋅ir, ir⋅aq, q⋅uavery, y⋅eti, i⋅raq, q⋅uailed, d⋅eli, i⋅raq, q⋅uarreled, d⋅eli, i⋅raq, q⋅uartes, s⋅ir, ir⋅aq, q⋅uarantining, g⋅orki, i⋅raq, q⋅uainter, r⋅ani, i⋅raq, q⋅uonset, t⋅apir, ir⋅aq, q⋅uackishness, s⋅ir, ir⋅aq, q⋅ophs, s⋅ir, ir⋅aq, q⋅uotationally, y⋅eti, i⋅raq, q⋅uantity, y⋅eti, i⋅raq, q⋅uackster\n" ] } ], "source": [ "report(W, P)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# *W1*: Near Subwords\n", "\n", "In looking over the output reports, I notice that there are some pairs of words that are **near-subwords** of each other, for example:\n", "\n", " preorganization uncharacteristic microclimatological\n", " reorganizations characteristics climatologically\n", " \n", "In each of these pairs, neither word is a subword of the other, but they share a large overlap. Now if our greedy search happened to pick one of the words from the top row above, all is well: it could pick the corresponding word in the bottom row next and have a nice overlap. But if it happened to pick a word from the bottom row first, we would have lost the chance for that overlap.\n", "\n", "It could modify my program to make sure it picks 'preorganization' before 'reorganizations', but it seemed easier to leae the program as is, and create a new word list that contains none of the above words, but instead adds these pseudo-words:\n", "\n", " preorganizations uncharacteristics microclimatologically\n", "\n", "To explore this, I'll start by making a list of (length-of-overlap, first-word, second-word) triples. At first I limited this to long overlaps, but I found that even two-letter overlaps can help: \"haydn\" and \"dnieper\" naturally go together because they are the only two words that have the affix \"dn\"." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def near_subword_triples(W: Wordset, min_overlap=2) -> list[tuple[int, str, str]]:\n", " P = Path([], W)\n", " startswith = P.unused_startswith\n", " endswith = multimap((suf, w) for w in P.unused_words for suf in suffixes(w))\n", " triples = []\n", " for x in endswith:\n", " while endswith[x] and startswith[x] and len(x) >= min_overlap:\n", " triples.append((len(x), endswith[x].pop(), startswith[x].pop()))\n", " return sorted(triples, reverse=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21933" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "triples = near_subword_triples(W)\n", "len(triples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's examine some of the triples. First the ones with the most overlap:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(15, 'overdiversification', 'diversifications'),\n", " (15, 'nondifferentiation', 'differentiations'),\n", " (14, 'unrepresentative', 'representatively'),\n", " (14, 'uncharacteristic', 'characteristics'),\n", " (14, 'preorganization', 'reorganizations'),\n", " (14, 'preconstruction', 'reconstructions'),\n", " (14, 'overspecialization', 'specializations'),\n", " (14, 'overgeneralization', 'generalizations'),\n", " (14, 'nonrepresentative', 'representatives'),\n", " (14, 'nondiscrimination', 'discriminational'),\n", " (14, 'misadministration', 'administrations'),\n", " (14, 'microclimatological', 'climatologically'),\n", " (14, 'maladministration', 'administrational'),\n", " (14, 'interdenominational', 'denominationally'),\n", " (14, 'indiscrimination', 'discriminations'),\n", " (14, 'consubstantiation', 'substantiations'),\n", " (13, 'unprepossessing', 'prepossessingness'),\n", " (13, 'unjustification', 'justifications'),\n", " (13, 'uncompassionate', 'compassionately'),\n", " (13, 'uncommunicative', 'communicativeness')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "triples[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every 2000th one:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(15, 'overdiversification', 'diversifications'),\n", " (6, 'lidless', 'idlesses'),\n", " (5, 'barnyards', 'yardsticks'),\n", " (4, 'leningrad', 'gradually'),\n", " (3, 'washbasins', 'insoul'),\n", " (3, 'pikestaff', 'affirmatives'),\n", " (3, 'foeman', 'mangier'),\n", " (3, 'aftermarket', 'ketchups'),\n", " (2, 'schwas', 'asphalting'),\n", " (2, 'lymphoid', 'idiosyncratic'),\n", " (2, 'dispraise', 'serviceableness')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "triples[::2000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I'll create a function to take a wordlist as input and convert it to a new wordlist where, for example, 'preorganization' and 'reorganizations' will be removed from the wordlist, replaced with the new pseudo-word 'preorganizations'." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def pseudo_wordlist(W: Wordset) -> Wordset:\n", " \"\"\"Make a new wordlist that merges near-subwords together.\"\"\"\n", " removed = set()\n", " added = set()\n", " for (x, word1, word2) in near_subword_triples(W):\n", " if word1 not in removed and word2 not in removed:\n", " removed |= {word1, word2}\n", " added |= {word1 + word2[x:]}\n", " return Wordset(added | W - removed)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "S has 543,832 letters and a compression ratio of 1.625\n", "P has 91,133 steps (38,344 bridge steps), average overlap 1.45 letters\n", "W has 97,110 words (52,789 nonsubwords, 44,321 subwords, 10,571 short words)\n", "There are 63,694 bridges (63,543 one-step, 151 two-step, none missing)\n", "S saved as the file \"natalie-nearsubwords.txt\".\n", "\n", "The first and last 500 letters of S:\n", "\n", "impermanencephalitictocsinspirationallyricistsarinassignerstwhilestrangingerbreadinghiestrayingrafteddyingatemendationshorebirdshucksteredemonstrationalizationstageyriesquiredacteditherythematiteslashedgehoppingrainediblesophagalenastinessentiallycanthropyrotechnicalitiescrowsudsiestanchedgepigstickedemasculinizingressiveracityfieditorializedshepherdswomanhoodsubdisciplinescapablerraticallyristsarismsatyriasisteredresservalsidlinglycogenictatedematousledgehammersouthboundismayedgewaysideswiperse\n", "...\n", "siraquakieraniraquaveryetiraqianaemicolloquailedeliraquailingeniiraquarrelsometimesiraquartesiraquarreledeliraquantumbleweedsiraquandoozeraniraquanticolloquantitiesiraquarantiningeniiraquavereductionistapiraquarterlyetiraquaggiestapiraquoinsusceptibilityetiraquaffingerprintsiraquartermastersiraquonsetlinesiraquotientsiraquaveringlyetiraquagsiraquackedeliraquaversiraquarterbacksawsiraquaverersiraquartanoiraquackishnessiraqindarsiraqophsiraqintarsiasiraquotationallyetiraquailsiraquasarsiraquantity\n", "\n", "The first and last 100 steps:\n", "\n", "⋅impermanencephalitic, tic⋅tocsins, ins⋅pirationally, ly⋅ricists, ts⋅arinassigners, ers⋅twhiles, es⋅tranging, ging⋅erbreading, ding⋅hies, es⋅traying, ing⋅rafted, ted⋅dying, ing⋅atemen, emen⋅dations, ons⋅horebirds, s⋅huckstered, red⋅emonstrational, rational⋅izations, ons⋅tagey, ey⋅ries, es⋅quiredacted, ed⋅ithery, ery⋅thematites, tes⋅lashed, hed⋅gehopping, ing⋅rained, ined⋅ibles, es⋅ophagalenas, nas⋅tiness, iness⋅entially, ly⋅canthropy, py⋅rotechnicalities, es⋅crows, s⋅udsiestanched, hed⋅gepigsticked, ed⋅emasculinizing, ing⋅ressive, ve⋅racityfied, ed⋅itorialized, zed⋅s, s⋅hepherdswoman, woman⋅hoods, s⋅ubdisciplines, ines⋅capabler, er⋅ratically, ly⋅rists, ts⋅arisms, s⋅atyriasistered, red⋅resservals, s⋅idlinglycogenic, nic⋅tated, ed⋅ematous, tous⋅ledgehammers, s⋅outhboundismayed, ed⋅geways, ways⋅ideswipers, pers⋅ecutesy, sy⋅mbolizes, zes⋅tiestriates, tes⋅typsis, sis⋅tering, ring⋅sidestepped, ped⋅ologiest, est⋅ablisherpas, pas⋅calkaloids, s⋅idebands, s⋅ubregions, s⋅hooks, s⋅wishers, hers⋅elfing, fing⋅ertipsters, s⋅trabismuskegs, s⋅umosquitoes, oes⋅ophagus, gus⋅hymnals, s⋅wastikas, kas⋅hasidic, dic⋅tations, s⋅alviassenters, s⋅lenderestructuring, ring⋅necks, s⋅carring, ring⋅leadership, ship⋅building, ing⋅rates, tes⋅tierces, ces⋅urae, ae⋅rated, ed⋅ifies, es⋅trumming, ing⋅eniously, sly⋅er, yer⋅baselessly, sly⋅est, est⋅radiologists, ts⋅aritzas, as⋅surors, s⋅caldicotyledons, s⋅calypsos, os⋅tensible\n", "...\n", "y⋅eti, i⋅raq, q⋅ianaemic, c⋅olloq, q⋅uailed, d⋅eli, i⋅raq, q⋅uailing, g⋅enii, i⋅raq, q⋅uarrelsometimes, s⋅ir, ir⋅aq, q⋅uartes, s⋅ir, ir⋅aq, q⋅uarreled, d⋅eli, i⋅raq, q⋅uantumbleweeds, s⋅ir, ir⋅aq, q⋅uandoozer, r⋅ani, i⋅raq, q⋅uantic, c⋅olloq, q⋅uantities, s⋅ir, ir⋅aq, q⋅uarantining, g⋅enii, i⋅raq, q⋅uavereductionist, t⋅apir, ir⋅aq, q⋅uarterly, y⋅eti, i⋅raq, q⋅uaggiest, t⋅apir, ir⋅aq, q⋅uoinsusceptibility, y⋅eti, i⋅raq, q⋅uaffingerprints, s⋅ir, ir⋅aq, q⋅uartermasters, s⋅ir, ir⋅aq, q⋅uonsetlines, s⋅ir, ir⋅aq, q⋅uotients, s⋅ir, ir⋅aq, q⋅uaveringly, y⋅eti, i⋅raq, q⋅uags, s⋅ir, ir⋅aq, q⋅uacked, d⋅eli, i⋅raq, q⋅uavers, s⋅ir, ir⋅aq, q⋅uarterbacksaws, s⋅ir, ir⋅aq, q⋅uaverers, s⋅ir, ir⋅aq, q⋅uartan, n⋅oir, ir⋅aq, q⋅uackishness, s⋅ir, ir⋅aq, q⋅indars, s⋅ir, ir⋅aq, q⋅ophs, s⋅ir, ir⋅aq, q⋅intarsias, s⋅ir, ir⋅aq, q⋅uotationally, y⋅eti, i⋅raq, q⋅uails, s⋅ir, ir⋅aq, q⋅uasars, s⋅ir, ir⋅aq, q⋅uantity\n" ] } ], "source": [ "W1 = pseudo_wordlist(W)\n", "report(W1, save='natalie-nearsubwords.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparison of Results\n", "\n", "The number of letters for various versions:\n", "- 630,408 for Tom Murphy's\n", "- 554,039 for my *W*\n", "- 543,832 for my *W1*.\n", "\n", "\n", "To compare my program to [Tom Murphy's](https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/portmantout/): \n", "- I used a greedy approach that builds up a single long portmanteau, one step at a time. \n", "- Murphy first built a pool of smaller portmanteaux, then greedily joined them all together.\n", " - (I did a little joining-together with my `pseudo_wordlist` function.)\n", "- I used Python (about 200 lines for the program without the exploratory questions).\n", "- Murphy used C++ (1867 lines), with a lot of extra functionality I didn't do: generating diagrams and animations, and running multiple threads in parallel.\n", "\n", "I'm reminded of the [Traveling Salesperson Problem](TSP.ipynb) where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments. The two approaches are different, but they are both suboptimal greedy methods, and it is not clear whether one is better than the other. You could try it! \n", "\n", "(*English trivia:* my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a **bridge**. Murphy's program starts by building a large pool of small portmanteaux that he calls **particles**, and when he can build no more particles, his next step is to put two particles together, so he calls it a **join**. The different metaphors for what our programs are doing lead to different terminology for the same idea.)\n", "\n", "It appears Murphy perhaps didn't quite have the complete concept of **subwords**. He did mention that when he adds `'bulleting'`, he crosses `'bullet'` and `'bulletin'` off the list, but somehow [his string](http://tom7.org/portmantout/murphy2015portmantout.pdf) contains both `'spectacular'` and `'spectaculars'`. My guess is that if he adds `'spectaculars'` first he crosses off `'spectacular'`, but if he happens to add `'spectacular'` first, he will later add `'spectaculars'`. Support for this view is that his output in `bench.txt` says \"I skipped 24319 words that were already substrs.\" but I computed that there are 44,320 such subwords; he found about half of them. I think those missing 20,001 words are the main reason why my strings are shorter.\n", "\n", "Also, Murphy's joins are always between one-letter prefixes and suffixes. I allow longer prefixes and suffixes for one-word bridges. \n", "\n", "I should say that I stole one important trick from Murphy. After I finished the first version of my program, I looked at his highly-entertaining [video](https://www.youtube.com/watch?time_continue=1&v=QVn2PZGZxaI) and [paper](http://tom7.org/portmantout/murphy2015portmantout.pdf) and I noticed that I had a problem in my use of bridges. My `natalie` function originally contained something like this: \n", "\n", " step = unused_word_step(...) or one_word_bridge(...) or two_word_bridge(...)\n", " \n", "That is, I only considered two-word bridges when there was no one-word bridge, on the assumption that one word is shorter than two. But Murphy showed that my assumption was wrong: for `bridges['w']['c']` I had `'workaholic'` as the best one-word bridge, but he had the two-word bridge `'war' + 'arc' = 'warc'`, which saves six excess letters over my single word. After seeing that, I shamelessly copied his approach, and now I too get a two-letter excess for `bridges['w']['c']`. (Sometimes `'war' + 'arc'` and sometimes `'wet' + 'etc'` or `'we' + 'etc'`, depending on the seed for the hash function that hashes strings into a word set.)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Bridge(excess=2, steps=[Step(overlap=1, word='wet'), Step(overlap=2, word='etc')])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "W.bridges['w']['c']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Further Work\n", "\n", "Here are some things you could do to make the portmantouts more interesting:\n", "\n", "- Use linguistic resources (such as [pretrained word embeddings](https://nlp.stanford.edu/projects/glove/)) to teach your program what words are related to each other. Encourage the program to place related words next to each other. Maybe even make grammatical sentences.\n", "- Use linguistic resources (such as [NLTK](https://github.com/nltk/)) to teach your program where syllable breaks are in words, and what each syllable sounds like. Encourage the program to make overlaps match syllables. (That's why \"preferendumdums\" sounds better than \"fortyphonshore\".)\n", "\n", "Here are some things you could do to make *S* shorter:\n", "\n", "- **Lookahead**: Unused words are chosen based on the degree of overlap, but nothing else. It might help to prefer unused words which have a suffix that matches the prefix of another unused word. A single-word lookahead or a beam search could be used.\n", "- **Word choice ordering**: Perhaps `startswith_table` could sort the words in each key's bucket so that the \"difficult\" words (say, the ones that end in unusual letters) are encountered earlier in the program's execution, when there are more available words for them to connect to.\n", "- **Expected excess**: The greedy approach minimizes the number of excess letters for each step. But some words are harder to place than others. Instead of just minimizing the excess, consider also the *expected* excess of each word, which could be learned. \n", " \n", "Here are some things you could do to make the program more robust:\n", "\n", "- Write and run unit tests.\n", "- Find other word lists, perhaps in other languages, and try the program on them.\n", "- Try word lists such as a list of names, or cities or countries, augmented with common short words.\n", "- Consider what to do for a wordset that has missing bridges. You could try three-word bridges, you could allow the program to back up and remove a previously-placed word; you could allow the addition of words to the start as well as the end of `P`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Questions\n", "\n", "The program is complete, but there are still many interesting things to explore, and questions to answer.\n", "\n", "**Question: is there an imbalance in starting and ending letters of words?** That could lead to a need for many bridges. We saw in the last 100 steps of *P* multiple repetitions of the two-word bridge \"s⋅ir, ir⋅aq\". That suggests there are too many words that end in \"s\" and too many that start with \"q\". Let's investigate. I'll make a table that shows, for each letter, how many words start with that letter and how many end with that letter." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Letter Starts Ends Ratio\n", "------ ------ ------ -----\n", " a 3,528 384 9:1\n", " b 3,776 6 629:1\n", " c 5,849 908 6:1\n", " d 4,093 7,520 1:2\n", " e 2,470 3,215 1:1\n", " f 2,794 51 55:1\n", " g 2,177 6,343 1:3\n", " h 2,169 351 6:1\n", " i 2,771 128 22:1\n", " j 638 0 1:0\n", " k 566 157 4:1\n", " l 1,634 1,182 1:1\n", " m 3,405 657 5:1\n", " n 1,542 1,860 1:1\n", " o 1,797 113 16:1\n", " p 4,977 123 40:1\n", " q 330 0 1:0\n", " r 3,811 1,994 2:1\n", " s 7,388 29,056 1:4\n", " t 3,097 2,107 1:1\n", " u 2,557 11 232:1\n", " v 1,032 6 172:1\n", " w 1,561 42 37:1\n", " x 51 68 1:1\n", " y 207 8,086 1:39\n", " z 169 21 8:1\n" ] } ], "source": [ "words = W.nonsubwords\n", "starts = Counter(w[0] for w in words)\n", "ends = Counter(w[-1] for w in words)\n", "\n", "def ratio(L: str) -> str:\n", " \"\"\"Approximate ratio of words that start with L to words that end with L.\"\"\"\n", " s, e = starts[L], ends[L]\n", " return f'{round(s/e)}:1' if (s > e and e != 0) else f'1:{round(e/s)}'\n", "\n", "print('Letter Starts Ends Ratio')\n", "print('------ ------ ------ -----')\n", "for L in sorted(starts):\n", " print(f'{L:>5} {starts[L]:6,d} {ends[L]:6,d} {ratio(L):>5}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yes, there is a problem: there are many more words that start with `b`, `f`, `p`, `u`, and `v` than that end with those letters. In the other direction 45% of all words end in `s`, but only a quarter of that number start with `s`. The start:end ratio for `y` is 1:39." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.451257202317166" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ends['s'] / len(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: what are the most common words in path *P*?** \n", "\n", "These will be bridge words. What do they have in common?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('sap', 2447),\n", " ('so', 2208),\n", " ('of', 1975),\n", " ('lyre', 1666),\n", " ('sun', 1514),\n", " ('sin', 1422),\n", " ('sic', 1399),\n", " ('sam', 1089),\n", " ('yaw', 972),\n", " ('sob', 933),\n", " ('go', 835),\n", " ('lye', 707)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(step.word for step in P).most_common(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, bridging away from `s` is a big concern (half of the top dozen bridges). Also, `lyre` and `lye` bridge away from an adverb ending, `ly` (as can `yaw`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: What is the distribution of word lengths?** " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAk0AAAGwCAYAAAC0HlECAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjYsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvq6yFwwAAAAlwSFlzAAAPYQAAD2EBqD+naQAANx1JREFUeJzt3Xt0FPXdx/HPkoRwMSwESJbVcFEjEBMBg4UAJaFAgBIi0hY1bYpHCvSAYAREKFIjSkCUS0uqAvIIFZA+5yjWFt0SLMQiV4NRQQRRkIAJsRA2BDEJyTx/+DjHJVxmccNuwvt1zpzD/OY7s9+Zs8jH387O2gzDMAQAAIDLauDvBgAAAOoCQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwINjfDdQn1dXV+uqrrxQWFiabzebvdgAAgAWGYejMmTNyOp1q0ODS80mEJh/66quvFBUV5e82AADAVSgoKNBNN910ye2EJh8KCwuT9N1Fb9asmZ+7AQAAVpSWlioqKsr8d/xSCE0+9P1Hcs2aNSM0AQBQx1zp1hpuBAcAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABggV9D07vvvqthw4bJ6XTKZrPpjTfeMLdVVlbqscceU1xcnJo2bSqn06nf/va3+uqrrzyOUV5erokTJ6pVq1Zq2rSpUlNTdezYMY+akpISpaeny263y263Kz09XadPn/aoOXr0qIYNG6amTZuqVatWmjRpkioqKmrr1AEAQB3j19B09uxZdenSRdnZ2TW2ffPNN9qzZ49mzZqlPXv26PXXX9fBgweVmprqUZeRkaH169dr3bp12rp1q8rKypSSkqKqqiqzJi0tTfn5+XK5XHK5XMrPz1d6erq5vaqqSkOHDtXZs2e1detWrVu3Tq+99pqmTJlSeycPAADqFiNASDLWr19/2Zpdu3YZkowvv/zSMAzDOH36tBESEmKsW7fOrDl+/LjRoEEDw+VyGYZhGJ988okhydixY4dZs337dkOS8emnnxqGYRhvvfWW0aBBA+P48eNmzauvvmqEhoYabrfb8jm43W5Dklf7AAAA/7L673eduqfJ7XbLZrOpefPmkqS8vDxVVlYqOTnZrHE6nYqNjdW2bdskSdu3b5fdblePHj3Mmp49e8put3vUxMbGyul0mjWDBg1SeXm58vLyLtlPeXm5SktLPRYAAFA/1ZnQ9O2332r69OlKS0tTs2bNJElFRUVq2LChWrRo4VEbGRmpoqIisyYiIqLG8SIiIjxqIiMjPba3aNFCDRs2NGsuZu7cueZ9Una7XVFRUT/qHAEAQOAK9ncDVlRWVuq+++5TdXW1nn/++SvWG4Yhm81mrv/wzz+m5kIzZszQ5MmTzfXS0lKCUx3VfvqGq9rvyLyhPj0GACBwBfxMU2VlpUaOHKnDhw8rJyfHnGWSJIfDoYqKCpWUlHjsU1xcbM4cORwOnThxosZxv/76a4+aC2eUSkpKVFlZWWMG6odCQ0PVrFkzjwUAANRPAR2avg9Mn332mTZt2qSWLVt6bI+Pj1dISIhycnLMscLCQu3du1e9evWSJCUkJMjtdmvXrl1mzc6dO+V2uz1q9u7dq8LCQrNm48aNCg0NVXx8fG2eIgAAqCP8+vFcWVmZDh06ZK4fPnxY+fn5Cg8Pl9Pp1C9/+Uvt2bNH//znP1VVVWXOBoWHh6thw4ay2+0aPXq0pkyZopYtWyo8PFxTp05VXFycBgwYIEnq3LmzBg8erDFjxmjp0qWSpLFjxyolJUUdO3aUJCUnJysmJkbp6el69tlnderUKU2dOlVjxoxh9ggAAEjyc2h6//331a9fP3P9+/uDRo0apczMTL355puSpK5du3rst3nzZiUlJUmSFi1apODgYI0cOVLnzp1T//79tXLlSgUFBZn1a9as0aRJk8xv2aWmpno8GyooKEgbNmzQ+PHj1bt3bzVu3FhpaWl67rnnauO0AQBAHWQzDMPwdxP1RWlpqex2u9xuNzNUdQw3ggPA9cvqv98BfU8TAABAoCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFvg1NL377rsaNmyYnE6nbDab3njjDY/thmEoMzNTTqdTjRs3VlJSkvbt2+dRU15erokTJ6pVq1Zq2rSpUlNTdezYMY+akpISpaeny263y263Kz09XadPn/aoOXr0qIYNG6amTZuqVatWmjRpkioqKmrjtAEAQB3k19B09uxZdenSRdnZ2RfdPn/+fC1cuFDZ2dnavXu3HA6HBg4cqDNnzpg1GRkZWr9+vdatW6etW7eqrKxMKSkpqqqqMmvS0tKUn58vl8sll8ul/Px8paenm9urqqo0dOhQnT17Vlu3btW6dev02muvacqUKbV38gAAoE4J9ueLDxkyREOGDLnoNsMwtHjxYs2cOVMjRoyQJK1atUqRkZFau3atxo0bJ7fbrRUrVuiVV17RgAEDJEmrV69WVFSUNm3apEGDBmn//v1yuVzasWOHevToIUlavny5EhISdODAAXXs2FEbN27UJ598ooKCAjmdTknSggUL9MADD2jOnDlq1qzZNbgaAAAgkAXsPU2HDx9WUVGRkpOTzbHQ0FAlJiZq27ZtkqS8vDxVVlZ61DidTsXGxpo127dvl91uNwOTJPXs2VN2u92jJjY21gxMkjRo0CCVl5crLy/vkj2Wl5ertLTUYwEAAPVTwIamoqIiSVJkZKTHeGRkpLmtqKhIDRs2VIsWLS5bExERUeP4ERERHjUXvk6LFi3UsGFDs+Zi5s6da94nZbfbFRUV5eVZAgCAuiJgQ9P3bDabx7phGDXGLnRhzcXqr6bmQjNmzJDb7TaXgoKCy/YFAADqroANTQ6HQ5JqzPQUFxebs0IOh0MVFRUqKSm5bM2JEydqHP/rr7/2qLnwdUpKSlRZWVljBuqHQkND1axZM48FAADUTwEbmjp06CCHw6GcnBxzrKKiQrm5uerVq5ckKT4+XiEhIR41hYWF2rt3r1mTkJAgt9utXbt2mTU7d+6U2+32qNm7d68KCwvNmo0bNyo0NFTx8fG1ep4AAKBu8Ou358rKynTo0CFz/fDhw8rPz1d4eLjatm2rjIwMZWVlKTo6WtHR0crKylKTJk2UlpYmSbLb7Ro9erSmTJmili1bKjw8XFOnTlVcXJz5bbrOnTtr8ODBGjNmjJYuXSpJGjt2rFJSUtSxY0dJUnJysmJiYpSenq5nn31Wp06d0tSpUzVmzBhmjwAAgCQ/h6b3339f/fr1M9cnT54sSRo1apRWrlypadOm6dy5cxo/frxKSkrUo0cPbdy4UWFhYeY+ixYtUnBwsEaOHKlz586pf//+WrlypYKCgsyaNWvWaNKkSea37FJTUz2eDRUUFKQNGzZo/Pjx6t27txo3bqy0tDQ999xztX0JAABAHWEzDMPwdxP1RWlpqex2u9xuNzNUdUz76Ruuar8j84b69BgAgGvP6r/fAXtPEwAAQCAhNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWBPu7AeDHaD99w1Xtd2TeUB93AgCo75hpAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAv49hwQYPhGIAAEJmaaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAU+CU2nT5/2xWEAAAAClteh6ZlnntHf/vY3c33kyJFq2bKlbrzxRn344Yc+bQ4AACBQeB2ali5dqqioKElSTk6OcnJy9Pbbb2vIkCF69NFHfd4gAABAIPA6NBUWFpqh6Z///KdGjhyp5ORkTZs2Tbt37/Zpc+fPn9fjjz+uDh06qHHjxrr55ps1e/ZsVVdXmzWGYSgzM1NOp1ONGzdWUlKS9u3b53Gc8vJyTZw4Ua1atVLTpk2VmpqqY8eOedSUlJQoPT1ddrtddrtd6enpfOwIAABMXoemFi1aqKCgQJLkcrk0YMAASd+Fl6qqKp8298wzz+jFF19Udna29u/fr/nz5+vZZ5/VkiVLzJr58+dr4cKFys7O1u7du+VwODRw4ECdOXPGrMnIyND69eu1bt06bd26VWVlZUpJSfHoNy0tTfn5+XK5XHK5XMrPz1d6erpPzwcAANRdXv/23IgRI5SWlqbo6GidPHlSQ4YMkSTl5+fr1ltv9Wlz27dv1913362hQ7/7Ta327dvr1Vdf1fvvvy/pu6C2ePFizZw5UyNGjJAkrVq1SpGRkVq7dq3GjRsnt9utFStW6JVXXjED3urVqxUVFaVNmzZp0KBB2r9/v1wul3bs2KEePXpIkpYvX66EhAQdOHBAHTt2vGh/5eXlKi8vN9dLS0t9ev4AACBweD3TtGjRIj300EOKiYlRTk6ObrjhBknffWw3fvx4nzbXp08fvfPOOzp48KAk6cMPP9TWrVv185//XJJ0+PBhFRUVKTk52dwnNDRUiYmJ2rZtmyQpLy9PlZWVHjVOp1OxsbFmzfbt22W3283AJEk9e/aU3W43ay5m7ty55sd5drvd/NgSAADUP17PNIWEhGjq1Kk1xjMyMnzRj4fHHntMbrdbnTp1UlBQkKqqqjRnzhzdf//9kqSioiJJUmRkpMd+kZGR+vLLL82ahg0bqkWLFjVqvt+/qKhIERERNV4/IiLCrLmYGTNmaPLkyeZ6aWkpwQkAgHrKUmh68803LR8wNTX1qpu50N/+9jetXr1aa9eu1e233678/HxlZGTI6XRq1KhRZp3NZvPYzzCMGmMXurDmYvVXOk5oaKhCQ0Otng4AAKjDLIWm4cOHe6zbbDYZhuGx/j1f3gz+6KOPavr06brvvvskSXFxcfryyy81d+5cjRo1Sg6HQ9J3M0Vt2rQx9ysuLjZnnxwOhyoqKlRSUuIx21RcXKxevXqZNSdOnKjx+l9//XWNWSwAAHB9snRPU3V1tbls3LhRXbt21dtvv63Tp0/L7Xbrrbfe0p133imXy+XT5r755hs1aODZYlBQkPnIgQ4dOsjhcCgnJ8fcXlFRodzcXDMQxcfHKyQkxKOmsLBQe/fuNWsSEhLkdru1a9cus2bnzp1yu91mDQAAuL55fU9TRkaGXnzxRfXp08ccGzRokJo0aaKxY8dq//79Pmtu2LBhmjNnjtq2bavbb79dH3zwgRYuXKgHH3xQ0nczXBkZGcrKylJ0dLSio6OVlZWlJk2aKC0tTZJkt9s1evRoTZkyRS1btlR4eLimTp2quLg489t0nTt31uDBgzVmzBgtXbpUkjR27FilpKRc8ptzAADg+uJ1aPr8889lt9trjNvtdh05csQXPZmWLFmiWbNmafz48SouLpbT6dS4ceP0xz/+0ayZNm2azp07p/Hjx6ukpEQ9evTQxo0bFRYWZtYsWrRIwcHBGjlypM6dO6f+/ftr5cqVCgoKMmvWrFmjSZMmmd+yS01NVXZ2tk/PBwAA1F0244c3J1nQt29fhYSEaPXq1eZ9REVFRUpPTzc/GrtelZaWym63y+12q1mzZv5u57rQfvqGq9rvyLyhPj9OIPUCALDO6r/fXj+nacWKFSouLla7du1066236tZbb1Xbtm1VWFioFStW/KimAQAAApXXH89FR0frww8/1KZNm/Tpp5/KMAzFxMRowIABV/yaPwAAQF3lVWg6f/68GjVqpPz8fCUnJ3s8ZRsAAKA+8+rjueDgYLVr187nP8wLAAAQ6Ly+p+nxxx/XjBkzdOrUqdroBwAAICB5fU/Tn//8Zx06dEhOp1Pt2rVT06ZNPbbv2bPHZ80BAAAECq9D04U/qQIAAHA98Do0PfHEE7XRBwAAQEDzOjR9Ly8vT/v375fNZlNMTIy6devmy74AAAACitehqbi4WPfdd5+2bNmi5s2byzAMud1u9evXT+vWrVPr1q1ro08AAAC/8vrbcxMnTlRpaan27dunU6dOqaSkRHv37lVpaakmTZpUGz0CAAD4ndczTS6XS5s2bVLnzp3NsZiYGP3lL3/hYZcAAKDe8nqmqbq6WiEhITXGQ0JCVF1d7ZOmAAAAAo3XoelnP/uZHn74YX311Vfm2PHjx/XII4+of//+Pm0OAAAgUHgdmrKzs3XmzBm1b99et9xyi2699VZ16NBBZ86c0ZIlS2qjRwAAAL/z+p6mqKgo7dmzRzk5Ofr0009lGIZiYmI0YMCA2ugPAAAgIHgdmr755hs1adJEAwcO1MCBA2ujJwAAgIDjdWhq3ry5unfvrqSkJCUlJal37941fn8OAACgvvH6nqbc3FylpqZqz549+uUvf6kWLVqoZ8+emj59ut5+++3a6BEAAMDvvA5NCQkJmj59ulwul0pKSvTuu++qU6dOWrBggVJSUmqjRwAAAL+7qt+e+/TTT7Vlyxbl5uZqy5Ytqqys1LBhw5SYmOjr/gAAAAKC16HJ4XCosrJSP/vZz5SUlKQ//OEPiouLq43eAAAAAobXH885HA6VlZXp6NGjOnr0qI4dO6aysrLa6A0AACBgeB2a8vPzdeLECc2cOVPnz5/XrFmz1Lp1a/Xo0UPTp0+vjR4BAAD87qruaWrevLlSU1PVp08f9e7dW3//+9+1du1avf/++5o3b56vewQAAPA7r0PT+vXrtWXLFm3ZskX79u1Ty5Yt9dOf/lSLFi1Sv379aqNHAAAAv/M6NI0bN059+/bVmDFjlJSUpNjY2NroCwAAIKB4HZqKi4trow8AAICA5vWN4AAAANcjQhMAAIAFhCYAAAALLIWmjz76SNXV1bXdCwAAQMCyFJq6deum//73v5Kkm2++WSdPnqzVpgAAAAKNpdDUvHlzHT58WJJ05MgRZp0AAMB1x9IjB37xi18oMTFRbdq0kc1mU/fu3RUUFHTR2i+++MKnDQIAAAQCS6Fp2bJlGjFihA4dOqRJkyZpzJgxCgsLq+3eAAAAAoblh1sOHjxYkpSXl6eHH36Y0AQAAK4rXj8R/OWXXzb/fOzYMdlsNt14440+bQoAACDQeP2cpurqas2ePVt2u13t2rVT27Zt1bx5cz311FPcIA4AAOotr2eaZs6cqRUrVmjevHnq3bu3DMPQe++9p8zMTH377beaM2dObfQJAADgV16HplWrVumll15SamqqOdalSxfdeOONGj9+PKEJAADUS15/PHfq1Cl16tSpxninTp106tQpnzQFAAAQaLwOTV26dFF2dnaN8ezsbHXp0sUnTQEAAAQarz+emz9/voYOHapNmzYpISFBNptN27ZtU0FBgd56663a6BEAAMDvvJ5pSkxM1MGDB3XPPffo9OnTOnXqlEaMGKEDBw7opz/9aW30CAAA4HdezzRJktPp5IZvAABwXfF6pgkAAOB6RGgCAACw4Ko+nruWjh8/rscee0xvv/22zp07p9tuu00rVqxQfHy8JMkwDD355JNatmyZSkpK1KNHD/3lL3/R7bffbh6jvLxcU6dO1auvvqpz586pf//+ev7553XTTTeZNSUlJZo0aZLefPNNSVJqaqqWLFmi5s2bX9PzBXyh/fQNV7XfkXlDfdwJANQfXs00GYahL7/8UufOnautfjyUlJSod+/eCgkJ0dtvv61PPvlECxYs8Agy8+fP18KFC5Wdna3du3fL4XBo4MCBOnPmjFmTkZGh9evXa926ddq6davKysqUkpKiqqoqsyYtLU35+flyuVxyuVzKz89Xenr6NTlPAAAQ+LyaaTIMQ9HR0dq3b5+io6NrqyfTM888o6ioKI8fCW7fvr1HP4sXL9bMmTM1YsQISd89sTwyMlJr167VuHHj5Ha7tWLFCr3yyisaMGCAJGn16tWKiorSpk2bNGjQIO3fv18ul0s7duxQjx49JEnLly9XQkKCDhw4oI4dO160v/LycpWXl5vrpaWlvr4EAAAgQHg109SgQQNFR0fr5MmTtdWPhzfffFPdu3fXr371K0VERKhbt25avny5uf3w4cMqKipScnKyORYaGqrExERt27ZNkpSXl6fKykqPGqfTqdjYWLNm+/btstvtZmCSpJ49e8put5s1FzN37lzZ7XZziYqK8tm5AwCAwOL1jeDz58/Xo48+qr1799ZGPx6++OILvfDCC4qOjta//vUv/f73v9ekSZP017/+VZJUVFQkSYqMjPTYLzIy0txWVFSkhg0bqkWLFpetiYiIqPH6ERERZs3FzJgxQ26321wKCgqu/mQBAEBA8/pG8N/85jf65ptv1KVLFzVs2FCNGzf22O7L35+rrq5W9+7dlZWVJUnq1q2b9u3bpxdeeEG//e1vzTqbzeaxn2EYNcYudGHNxeqvdJzQ0FCFhoZaOhcAAFC3eR2aFi9eXAttXFybNm0UExPjMda5c2e99tprkiSHwyHpu5miNm3amDXFxcXm7JPD4VBFRYVKSko8ZpuKi4vVq1cvs+bEiRM1Xv/rr7+uMYsFAACuT16HplGjRtVGHxfVu3dvHThwwGPs4MGDateunSSpQ4cOcjgcysnJUbdu3SRJFRUVys3N1TPPPCNJio+PV0hIiHJycjRy5EhJUmFhofbu3av58+dLkhISEuR2u7Vr1y795Cc/kSTt3LlTbrfbDFYAAOD6dlXPafr888/18ssv6/PPP9ef/vQnRUREyOVyKSoqyuP5SD/WI488ol69eikrK0sjR47Url27tGzZMi1btkzSdx+pZWRkKCsrS9HR0YqOjlZWVpaaNGmitLQ0SZLdbtfo0aM1ZcoUtWzZUuHh4Zo6dari4uLMb9N17txZgwcP1pgxY7R06VJJ0tixY5WSknLJb84BAIDri9c3gufm5iouLk47d+7U66+/rrKyMknSRx99pCeeeMKnzd11111av369Xn31VcXGxuqpp57S4sWL9etf/9qsmTZtmjIyMjR+/Hh1795dx48f18aNGxUWFmbWLFq0SMOHD9fIkSPVu3dvNWnSRP/4xz8UFBRk1qxZs0ZxcXFKTk5WcnKy7rjjDr3yyis+PR8AAFB3eT3TNH36dD399NOaPHmyRzDp16+f/vSnP/m0OUlKSUlRSkrKJbfbbDZlZmYqMzPzkjWNGjXSkiVLtGTJkkvWhIeHa/Xq1T+mVQAAUI95PdP08ccf65577qkx3rp162v2/CYAAIBrzevQ1Lx5cxUWFtYY/+CDD3TjjTf6pCkAAIBA43VoSktL02OPPaaioiLZbDZVV1frvffe09SpUz2enQQAAFCfeB2a5syZo7Zt2+rGG29UWVmZYmJi1LdvX/Xq1UuPP/54bfQIAADgd17fCB4SEqI1a9Zo9uzZ+uCDD1RdXa1u3bpdkx/wBQAA8Jerek6TJN1yyy26+eabJV38J0gAAADqE68/npOkFStWKDY2Vo0aNVKjRo0UGxurl156yde9AQAABAyvZ5pmzZqlRYsWaeLEiUpISJAkbd++XY888oiOHDmip59+2udNAgAA+JvXoemFF17Q8uXLdf/995tjqampuuOOOzRx4kRCEwAAqJe8/niuqqpK3bt3rzEeHx+v8+fP+6QpAACAQON1aPrNb36jF154ocb4smXLPH4TDgAAoD6x9PHc5MmTzT/bbDa99NJL2rhxo3r27ClJ2rFjhwoKCni4JQAAqLcshaYPPvjAYz0+Pl6S9Pnnn0v67nfnWrdurX379vm4PQAAgMBgKTRt3ry5tvsAAAAIaFf1nCYAAIDrjdePHPj222+1ZMkSbd68WcXFxaqurvbYvmfPHp81BwAAECi8Dk0PPvigcnJy9Mtf/lI/+clP+AkVAABwXfA6NG3YsEFvvfWWevfuXRv9AAAABCSv72m68cYbFRYWVhu9AAAABCyvQ9OCBQv02GOP6csvv6yNfgAAAAKS1x/Pde/eXd9++61uvvlmNWnSRCEhIR7bT5065bPmAAAAAoXXoen+++/X8ePHlZWVpcjISG4EBwAA1wWvQ9O2bdu0fft2denSpTb6AQAACEhe39PUqVMnnTt3rjZ6AQAACFheh6Z58+ZpypQp2rJli06ePKnS0lKPBQAAoD7y+uO5wYMHS5L69+/vMW4Yhmw2m6qqqnzTGQAAQADxOjTx470AAOB65HVoSkxMrI0+AAAAAprXoendd9+97Pa+fftedTMAAACByuvQlJSUVGPsh89q4p4mAABQH3n97bmSkhKPpbi4WC6XS3fddZc2btxYGz0CAAD4ndczTXa7vcbYwIEDFRoaqkceeUR5eXk+aQwAACCQeD3TdCmtW7fWgQMHfHU4AACAgOL1TNNHH33ksW4YhgoLCzVv3jx+WgUAANRbXoemrl27ymazyTAMj/GePXvqf/7nf3zWGAAAQCDxOjQdPnzYY71BgwZq3bq1GjVq5LOmAAAAAo3Xoaldu3a10QcAAEBA8zo0SdI777yjd955R8XFxaqurvbYxkd0AACgPvI6ND355JOaPXu2unfvrjZt2ng82BIAAKC+8jo0vfjii1q5cqXS09Nrox8AAaL99A1Xtd+ReUN93AkABAavQ1NFRYV69epVG73gOsM/ygCAusTrh1v+7ne/09q1a2ujFwAAgIDl9UzTt99+q2XLlmnTpk264447FBIS4rF94cKFPmsOAAAgUFzVE8G7du0qSdq7d6/HNm4KBwAA9ZXXoWnz5s210QcAAEBA89kP9gIAANRnhCYAAAAL6lRomjt3rmw2mzIyMswxwzCUmZkpp9Opxo0bKykpSfv27fPYr7y8XBMnTlSrVq3UtGlTpaam6tixYx41JSUlSk9Pl91ul91uV3p6uk6fPn0NzgoAANQFdSY07d69W8uWLdMdd9zhMT5//nwtXLhQ2dnZ2r17txwOhwYOHKgzZ86YNRkZGVq/fr3WrVunrVu3qqysTCkpKaqqqjJr0tLSlJ+fL5fLJZfLpfz8fB7gCQAATHUiNJWVlenXv/61li9frhYtWpjjhmFo8eLFmjlzpkaMGKHY2FitWrVK33zzjfksKbfbrRUrVmjBggUaMGCAunXrptWrV+vjjz/Wpk2bJEn79++Xy+XSSy+9pISEBCUkJGj58uX65z//qQMHDvjlnAEAQGCpE6FpwoQJGjp0qAYMGOAxfvjwYRUVFSk5OdkcCw0NVWJiorZt2yZJysvLU2VlpUeN0+lUbGysWbN9+3bZ7Xb16NHDrOnZs6fsdrtZczHl5eUqLS31WAAAQP3k9SMHrrV169Zpz5492r17d41tRUVFkqTIyEiP8cjISH355ZdmTcOGDT1mqL6v+X7/oqIiRURE1Dh+RESEWXMxc+fO1ZNPPundCQEAgDopoGeaCgoK9PDDD2v16tVq1KjRJesufKimYRhXfNDmhTUXq7/ScWbMmCG3220uBQUFl31NAABQdwV0aMrLy1NxcbHi4+MVHBys4OBg5ebm6s9//rOCg4PNGaYLZ4OKi4vNbQ6HQxUVFSopKblszYkTJ2q8/tdff11jFuuHQkND1axZM48FAADUTwEdmvr376+PP/5Y+fn55tK9e3f9+te/Vn5+vm6++WY5HA7l5OSY+1RUVCg3N1e9evWSJMXHxyskJMSjprCwUHv37jVrEhIS5Ha7tWvXLrNm586dcrvdZg0AALi+BfQ9TWFhYYqNjfUYa9q0qVq2bGmOZ2RkKCsrS9HR0YqOjlZWVpaaNGmitLQ0SZLdbtfo0aM1ZcoUtWzZUuHh4Zo6dari4uLMG8s7d+6swYMHa8yYMVq6dKkkaezYsUpJSVHHjh2v4RkDAIBAFdChyYpp06bp3LlzGj9+vEpKStSjRw9t3LhRYWFhZs2iRYsUHByskSNH6ty5c+rfv79WrlypoKAgs2bNmjWaNGmS+S271NRUZWdnX/PzAQAAganOhaYtW7Z4rNtsNmVmZiozM/OS+zRq1EhLlizRkiVLLlkTHh6u1atX+6hLAABQ3wT0PU0AAACBgtAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgQbC/GwBQv7WfvuGq9jsyb6iPOwGAH4eZJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsCCgQ9PcuXN11113KSwsTBERERo+fLgOHDjgUWMYhjIzM+V0OtW4cWMlJSVp3759HjXl5eWaOHGiWrVqpaZNmyo1NVXHjh3zqCkpKVF6errsdrvsdrvS09N1+vTp2j5FAABQRwR0aMrNzdWECRO0Y8cO5eTk6Pz580pOTtbZs2fNmvnz52vhwoXKzs7W7t275XA4NHDgQJ05c8asycjI0Pr167Vu3Tpt3bpVZWVlSklJUVVVlVmTlpam/Px8uVwuuVwu5efnKz09/ZqeLwAACFzB/m7gclwul8f6yy+/rIiICOXl5alv374yDEOLFy/WzJkzNWLECEnSqlWrFBkZqbVr12rcuHFyu91asWKFXnnlFQ0YMECStHr1akVFRWnTpk0aNGiQ9u/fL5fLpR07dqhHjx6SpOXLlyshIUEHDhxQx44dr+2JAwCAgBPQM00XcrvdkqTw8HBJ0uHDh1VUVKTk5GSzJjQ0VImJidq2bZskKS8vT5WVlR41TqdTsbGxZs327dtlt9vNwCRJPXv2lN1uN2supry8XKWlpR4LAACon+pMaDIMQ5MnT1afPn0UGxsrSSoqKpIkRUZGetRGRkaa24qKitSwYUO1aNHisjURERE1XjMiIsKsuZi5c+ea90DZ7XZFRUVd/QkCAICAVmdC00MPPaSPPvpIr776ao1tNpvNY90wjBpjF7qw5mL1VzrOjBkz5Ha7zaWgoOBKpwEAAOqoOhGaJk6cqDfffFObN2/WTTfdZI47HA5JqjEbVFxcbM4+ORwOVVRUqKSk5LI1J06cqPG6X3/9dY1ZrB8KDQ1Vs2bNPBYAAFA/BXRoMgxDDz30kF5//XX9+9//VocOHTy2d+jQQQ6HQzk5OeZYRUWFcnNz1atXL0lSfHy8QkJCPGoKCwu1d+9esyYhIUFut1u7du0ya3bu3Cm3223WAACA61tAf3tuwoQJWrt2rf7+978rLCzMnFGy2+1q3LixbDabMjIylJWVpejoaEVHRysrK0tNmjRRWlqaWTt69GhNmTJFLVu2VHh4uKZOnaq4uDjz23SdO3fW4MGDNWbMGC1dulSSNHbsWKWkpPDNOQAAICnAQ9MLL7wgSUpKSvIYf/nll/XAAw9IkqZNm6Zz585p/PjxKikpUY8ePbRx40aFhYWZ9YsWLVJwcLBGjhypc+fOqX///lq5cqWCgoLMmjVr1mjSpEnmt+xSU1OVnZ1duycIAADqjIAOTYZhXLHGZrMpMzNTmZmZl6xp1KiRlixZoiVLllyyJjw8XKtXr76aNq877advuKr9jswb6uNOAAC4dgI6NAGARFAHEBgC+kZwAACAQEFoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABcH+bgAArpX20zdc1X5H5g31cScA6iJmmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAAC4L93QAA1CXtp2+4qv2OzBvq404AXGvMNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAW8Jym6wzPmAECA38XgbqHmSYAAAALCE0XeP7559WhQwc1atRI8fHx+s9//uPvlgAAQADg47kf+Nvf/qaMjAw9//zz6t27t5YuXaohQ4bok08+Udu2bf3dHgB44CM+4NpipukHFi5cqNGjR+t3v/udOnfurMWLFysqKkovvPCCv1sDAAB+xkzT/6uoqFBeXp6mT5/uMZ6cnKxt27ZddJ/y8nKVl5eb6263W5JUWlpae43+SNXl31zVfj88J18cg14Cv5f6dj70culeYp/411UdZ++Tg3x6DF8eB/DG938nDMO4fKEBwzAM4/jx44Yk47333vMYnzNnjnHbbbdddJ8nnnjCkMTCwsLCwsJSD5aCgoLLZgVmmi5gs9k81g3DqDH2vRkzZmjy5MnmenV1tU6dOqWWLVtecp/6rLS0VFFRUSooKFCzZs383U69w/WtPVzb2sO1rV1cX98wDENnzpyR0+m8bB2h6f+1atVKQUFBKioq8hgvLi5WZGTkRfcJDQ1VaGiox1jz5s1rq8U6o1mzZvzlrUVc39rDta09XNvaxfX98ex2+xVruBH8/zVs2FDx8fHKycnxGM/JyVGvXr381BUAAAgUzDT9wOTJk5Wenq7u3bsrISFBy5Yt09GjR/X73//e360BAAA/IzT9wL333quTJ09q9uzZKiwsVGxsrN566y21a9fO363VCaGhoXriiSdqfGQJ3+D61h6ube3h2tYuru+1ZTOMK32/DgAAANzTBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITfjRMjMzZbPZPBaHw+Hvtuqkd999V8OGDZPT6ZTNZtMbb7zhsd0wDGVmZsrpdKpx48ZKSkrSvn37/NNsHXSl6/vAAw/UeC/37NnTP83WIXPnztVdd92lsLAwRUREaPjw4Tpw4IBHDe/dq2fl+vLevTYITfCJ22+/XYWFheby8ccf+7ulOuns2bPq0qWLsrOzL7p9/vz5WrhwobKzs7V79245HA4NHDhQZ86cucad1k1Xur6SNHjwYI/38ltvvXUNO6ybcnNzNWHCBO3YsUM5OTk6f/68kpOTdfbsWbOG9+7Vs3J9Jd6714QPfusW17knnnjC6NKli7/bqHckGevXrzfXq6urDYfDYcybN88c+/bbbw273W68+OKLfuiwbrvw+hqGYYwaNcq4++67/dJPfVJcXGxIMnJzcw3D4L3raxdeX8PgvXutMNMEn/jss8/kdDrVoUMH3Xffffriiy/83VK9c/jwYRUVFSk5OdkcCw0NVWJiorZt2+bHzuqXLVu2KCIiQrfddpvGjBmj4uJif7dU57jdbklSeHi4JN67vnbh9f0e793aR2jCj9ajRw/99a9/1b/+9S8tX75cRUVF6tWrl06ePOnv1uqV739M+sIfkI6MjKzxQ9O4OkOGDNGaNWv073//WwsWLNDu3bv1s5/9TOXl5f5urc4wDEOTJ09Wnz59FBsbK4n3ri9d7PpKvHevFX5GBT/akCFDzD/HxcUpISFBt9xyi1atWqXJkyf7sbP6yWazeawbhlFjDFfn3nvvNf8cGxur7t27q127dtqwYYNGjBjhx87qjoceekgfffSRtm7dWmMb790f71LXl/futcFME3yuadOmiouL02effebvVuqV77+ReOH/mRcXF9f4P3j4Rps2bdSuXTveyxZNnDhRb775pjZv3qybbrrJHOe96xuXur4Xw3u3dhCa4HPl5eXav3+/2rRp4+9W6pUOHTrI4XAoJyfHHKuoqFBubq569erlx87qr5MnT6qgoID38hUYhqGHHnpIr7/+uv7973+rQ4cOHtt57/44V7q+F8N7t3bw8Rx+tKlTp2rYsGFq27atiouL9fTTT6u0tFSjRo3yd2t1TllZmQ4dOmSuHz58WPn5+QoPD1fbtm2VkZGhrKwsRUdHKzo6WllZWWrSpInS0tL82HXdcbnrGx4erszMTP3iF79QmzZtdOTIEf3hD39Qq1atdM899/ix68A3YcIErV27Vn//+98VFhZmzijZ7XY1btxYNpuN9+6PcKXrW1ZWxnv3WvHnV/dQP9x7771GmzZtjJCQEMPpdBojRoww9u3b5++26qTNmzcbkmoso0aNMgzju69uP/HEE4bD4TBCQ0ONvn37Gh9//LF/m65DLnd9v/nmGyM5Odlo3bq1ERISYrRt29YYNWqUcfToUX+3HfAudk0lGS+//LJZw3v36l3p+vLevXZshmEY1zKkAQAA1EXc0wQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAE4Lq2cuVKNW/e/JLbjxw5IpvNpvz8/GvW0+U88MADGj58uL/bAK5LhCYACECBFtYAEJoAXCcqKir83QKAOo7QBMDv/vGPf6h58+aqrq6WJOXn58tms+nRRx81a8aNG6f777/fXH/ttdd0++23KzQ0VO3bt9eCBQs8jtm+fXs9/fTTeuCBB2S32zVmzBhJ330c17ZtWzVp0kT33HOPTp486XW/n3zyiX7+85/rhhtuUGRkpNLT0/Xf//7X3J6UlKRJkyZp2rRpCg8Pl8PhUGZmpscxPv30U/Xp00eNGjVSTEyMNm3aJJvNpjfeeEOS1KFDB0lSt27dZLPZlJSU5LH/c889pzZt2qhly5aaMGGCKisrvT4PAN4hNAHwu759++rMmTP64IMPJEm5ublq1aqVcnNzzZotW7YoMTFRkpSXl6eRI0fqvvvu08cff6zMzEzNmjVLK1eu9Djus88+q9jYWOXl5WnWrFnauXOnHnzwQY0fP175+fnq16+fnn76aa96LSwsVGJiorp27ar3339fLpdLJ06c0MiRIz3qVq1apaZNm2rnzp2aP3++Zs+erZycHElSdXW1hg8friZNmmjnzp1atmyZZs6c6bH/rl27JEmbNm1SYWGhXn/9dXPb5s2b9fnnn2vz5s1atWqVVq5cWePcAdQCAwACwJ133mk899xzhmEYxvDhw405c+YYDRs2NEpLS43CwkJDkrF//37DMAwjLS3NGDhwoMf+jz76qBETE2Out2vXzhg+fLhHzf33328MHjzYY+zee+817Hb7Jfs6fPiwIcn44IMPDMMwjFmzZhnJyckeNQUFBYYk48CBA4ZhGEZiYqLRp08fj5q77rrLeOyxxwzDMIy3337bCA4ONgoLC83tOTk5hiRj/fr1F33d740aNcpo166dcf78eXPsV7/6lXHvvfde8hwA+AYzTQACQlJSkrZs2SLDMPSf//xHd999t2JjY7V161Zt3rxZkZGR6tSpkyRp//796t27t8f+vXv31meffaaqqipzrHv37h41+/fvV0JCgsfYhetXkpeXp82bN+uGG24wl+/7+vzzz826O+64w2O/Nm3aqLi4WJJ04MABRUVFyeFwmNt/8pOfWO7h9ttvV1BQ0EWPDaD2BPu7AQCQvgtNK1as0IcffqgGDRooJiZGiYmJys3NVUlJifnRnCQZhiGbzeaxv2EYNY7ZtGnTK9Z4q7q6WsOGDdMzzzxTY1ubNm3MP4eEhHhss9ls5j1bF+vfG5c7NoDaQ2gCEBC+v69p8eLFSkxMlM1mU2JioubOnauSkhI9/PDDZm1MTIy2bt3qsf+2bdt02223eczAXCgmJkY7duzwGLtw/UruvPNOvfbaa2rfvr2Cg6/uP6GdOnXS0aNHdeLECUVGRkqSdu/e7VHTsGFDSfKYOQPgX3w8ByAg2O12de3aVatXrza/Kda3b1/t2bNHBw8e9Pj22JQpU/TOO+/oqaee0sGDB7Vq1SplZ2dr6tSpl32NSZMmyeVyaf78+Tp48KCys7Plcrm86nPChAk6deqU7r//fu3atUtffPGFNm7cqAcffNBywBk4cKBuueUWjRo1Sh999JHee+8980bw72egIiIi1LhxY/NGc7fb7VWfAHyP0AQgYPTr109VVVVmQGrRooViYmLUunVrde7c2ay788479b//+79at26dYmNj9cc//lGzZ8/WAw88cNnj9+zZUy+99JKWLFmirl27auPGjXr88ce96tHpdOq9995TVVWVBg0apNjYWD388MOy2+1q0MDaf1KDgoL0xhtvqKysTHfddZd+97vfmX00atRIkhQcHKw///nPWrp0qZxOp+6++26v+gTgezbDFx/yAwB+lPfee099+vTRoUOHdMstt/i7HQAXQWgCAD9Yv369brjhBkVHR+vQoUN6+OGH1aJFixr3agEIHNwIDgB+cObMGU2bNk0FBQVq1aqVBgwYUOOp5gACCzNNAAAAFnAjOAAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMCC/wOhpaaD6+JNigAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "counts = Counter(map(len, words)) # Counter of word lengths\n", "plt.bar(list(counts.keys()), list(counts.values()))\n", "plt.xlabel('word length'); plt.ylabel('number of words');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: What is the longest word?** " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'antidisestablishmentarianism'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max(W, key=len)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: What is the distribution of letters in the Wordset?**" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('e', 68038),\n", " ('s', 60080),\n", " ('i', 53340),\n", " ('a', 43177),\n", " ('n', 42145),\n", " ('r', 41794),\n", " ('t', 38093),\n", " ('o', 35027),\n", " ('l', 32356),\n", " ('c', 23100),\n", " ('d', 22448),\n", " ('u', 19898),\n", " ('g', 17815),\n", " ('p', 16128),\n", " ('m', 16062),\n", " ('h', 12673),\n", " ('y', 11889),\n", " ('b', 11581),\n", " ('f', 7885),\n", " ('v', 5982),\n", " ('k', 4892),\n", " ('w', 4880),\n", " ('z', 2703),\n", " ('x', 1677),\n", " ('j', 1076),\n", " ('q', 1066)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(L for w in words for L in w).most_common() # Counter of letters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: How many bridges are there?** " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "70300" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make a list of all bridges, B\n", "B = sorted(W.bridges[suf][pre] for suf in W.bridges for pre in W.bridges[suf])\n", "len(B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ones with the most excess:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Bridge(excess=5, steps=[Step(overlap=1, word='yeti'), Step(overlap=1, word='iraq')]),\n", " Bridge(excess=5, steps=[Step(overlap=1, word='zircon'), Step(overlap=3, word='conj')]),\n", " Bridge(excess=5, steps=[Step(overlap=1, word='zuni'), Step(overlap=1, word='iraq')]),\n", " Bridge(excess=6, steps=[Step(overlap=1, word='blini'), Step(overlap=1, word='iraq')]),\n", " Bridge(excess=6, steps=[Step(overlap=1, word='gorki'), Step(overlap=1, word='iraq')]),\n", " Bridge(excess=6, steps=[Step(overlap=1, word='jinni'), Step(overlap=1, word='iraq')]),\n", " Bridge(excess=6, steps=[Step(overlap=1, word='okapi'), Step(overlap=1, word='iraq')]),\n", " Bridge(excess=6, steps=[Step(overlap=1, word='vedic'), Step(overlap=1, word='conj')]),\n", " Bridge(excess=6, steps=[Step(overlap=1, word='xenic'), Step(overlap=1, word='conj')]),\n", " Bridge(excess=8, steps=[Step(overlap=1, word='xenic'), Step(overlap=1, word='colloq')])]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B[-10:] # Sample every 2000th bridge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every 4000th bridge:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Bridge(excess=0, steps=[Step(overlap=1, word='aahed')]),\n", " Bridge(excess=0, steps=[Step(overlap=1, word='groin')]),\n", " Bridge(excess=0, steps=[Step(overlap=1, word='quiver')]),\n", " Bridge(excess=0, steps=[Step(overlap=2, word='avast')]),\n", " Bridge(excess=0, steps=[Step(overlap=2, word='hover')]),\n", " Bridge(excess=0, steps=[Step(overlap=2, word='rucks')]),\n", " Bridge(excess=0, steps=[Step(overlap=3, word='bored')]),\n", " Bridge(excess=0, steps=[Step(overlap=3, word='keltic')]),\n", " Bridge(excess=0, steps=[Step(overlap=3, word='spats')]),\n", " Bridge(excess=0, steps=[Step(overlap=4, word='earns')]),\n", " Bridge(excess=0, steps=[Step(overlap=4, word='sable')]),\n", " Bridge(excess=1, steps=[Step(overlap=1, word='bitsy')]),\n", " Bridge(excess=1, steps=[Step(overlap=1, word='nags')]),\n", " Bridge(excess=1, steps=[Step(overlap=2, word='cairo')]),\n", " Bridge(excess=1, steps=[Step(overlap=2, word='riven')]),\n", " Bridge(excess=1, steps=[Step(overlap=3, word='howdah')]),\n", " Bridge(excess=1, steps=[Step(overlap=4, word='kazoos')]),\n", " Bridge(excess=2, steps=[Step(overlap=2, word='octopi')])]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B[::4000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: How many excess letters do the bridges have?** " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({0: 43411, 1: 21306, 2: 4869, 3: 634, 4: 52, 5: 21, 6: 6, 8: 1})" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Counter of bridge excess letters\n", "BC = Counter(b.excess for b in B)\n", "BC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the bridges have 0 or 1 excess letter, so we're doing pretty well. The mean is under 1/2 letter:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.47372688477951636" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from statistics import mean\n", "\n", "mean(BC.elements())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: How many 1-step and 2-step bridges are there?**" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({1: 70155, 2: 145})" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(len(b.steps) for b in B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are only 145 2-step bridges; we might as well see them all:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a-f: are + ref (excess 2)\n", "a-g: at + tag (excess 2)\n", "a-j: ash + hadj (excess 4)\n", "a-q: air + iraq (excess 3)\n", "a-v: at + tv (excess 1)\n", "b-c: be + etc (excess 2)\n", "b-j: bacon + conj (excess 4)\n", "b-q: blini + iraq (excess 6)\n", "b-v: bot + tv (excess 2)\n", "c-v: cut + tv (excess 2)\n", "d-f: do + of (excess 1)\n", "d-j: doc + conj (excess 4)\n", "d-q: deli + iraq (excess 5)\n", "d-r: do + or (excess 1)\n", "d-u: dry + you (excess 3)\n", "d-v: dot + tv (excess 2)\n", "d-z: do + oyez (excess 3)\n", "e-j: econ + conj (excess 3)\n", "e-p: era + rap (excess 2)\n", "e-q: emir + iraq (excess 4)\n", "e-v: eat + tv (excess 2)\n", "e-w: era + raw (excess 2)\n", "e-z: era + razz (excess 3)\n", "f-c: far + arc (excess 2)\n", "f-j: fish + hadj (excess 5)\n", "f-q: fir + iraq (excess 3)\n", "f-v: fat + tv (excess 2)\n", "g-c: go + orc (excess 2)\n", "g-f: go + of (excess 1)\n", "g-j: gush + hadj (excess 5)\n", "g-q: gorki + iraq (excess 6)\n", "g-r: go + or (excess 1)\n", "g-v: git + tv (excess 2)\n", "h-c: he + etc (excess 2)\n", "h-q: hair + iraq (excess 4)\n", "h-u: hem + emu (excess 2)\n", "h-v: hut + tv (excess 2)\n", "i-c: is + sac (excess 2)\n", "i-g: is + sag (excess 2)\n", "i-j: icon + conj (excess 3)\n", "i-o: is + so (excess 1)\n", "i-u: if + flu (excess 2)\n", "i-v: it + tv (excess 1)\n", "i-w: is + saw (excess 2)\n", "i-z: if + fez (excess 2)\n", "j-c: jet + etc (excess 2)\n", "j-q: jinni + iraq (excess 6)\n", "j-v: jet + tv (excess 2)\n", "k-c: kudo + doc (excess 3)\n", "k-j: kasha + hadj (excess 5)\n", "k-m: keel + elm (excess 3)\n", "k-q: kafir + iraq (excess 5)\n", "l-c: let + etc (excess 2)\n", "l-j: loco + conj (excess 4)\n", "l-q: lira + iraq (excess 3)\n", "l-v: lot + tv (excess 2)\n", "l-z: levi + viz (excess 3)\n", "m-j: mac + conj (excess 4)\n", "m-q: magi + iraq (excess 5)\n", "m-z: mech + chez (excess 4)\n", "n-f: no + of (excess 1)\n", "n-j: narco + conj (excess 5)\n", "n-q: noir + iraq (excess 4)\n", "n-v: net + tv (excess 2)\n", "o-i: of + fbi (excess 2)\n", "o-j: ooh + hadj (excess 4)\n", "o-p: of + fop (excess 2)\n", "o-q: okapi + iraq (excess 6)\n", "o-u: of + flu (excess 2)\n", "o-v: oft + tv (excess 2)\n", "o-w: one + new (excess 2)\n", "p-j: poco + conj (excess 4)\n", "p-q: pair + iraq (excess 4)\n", "p-v: pet + tv (excess 2)\n", "q-b: quem + mob (excess 4)\n", "q-j: qoph + hadj (excess 5)\n", "q-v: quit + tv (excess 3)\n", "q-w: quem + mow (excess 4)\n", "q-x: quem + mix (excess 4)\n", "r-j: recon + conj (excess 4)\n", "r-q: rani + iraq (excess 5)\n", "r-u: ref + flu (excess 3)\n", "r-v: rut + tv (excess 2)\n", "s-f: so + of (excess 1)\n", "s-j: shad + hadj (excess 3)\n", "s-q: sir + iraq (excess 3)\n", "t-f: to + of (excess 1)\n", "t-j: taco + conj (excess 4)\n", "t-q: tapir + iraq (excess 5)\n", "t-z: tv + viz (excess 2)\n", "u-b: us + sob (excess 2)\n", "u-f: ufo + of (excess 2)\n", "u-g: ufo + fog (excess 2)\n", "u-j: ugh + hadj (excess 4)\n", "u-q: ugli + iraq (excess 5)\n", "u-w: use + sew (excess 2)\n", "u-z: us + sitz (excess 3)\n", "v-c: vet + etc (excess 2)\n", "v-f: veto + of (excess 3)\n", "v-j: vedic + conj (excess 6)\n", "v-k: via + ark (excess 3)\n", "v-m: via + am (excess 2)\n", "v-q: vizir + iraq (excess 5)\n", "w-c: wet + etc (excess 2)\n", "w-j: with + hadj (excess 5)\n", "w-q: weir + iraq (excess 4)\n", "w-u: we + emu (excess 2)\n", "w-v: wit + tv (excess 2)\n", "x-b: xmas + sob (excess 4)\n", "x-f: xmas + massif (excess 5)\n", "x-g: xmas + sag (excess 4)\n", "x-h: xmas + mash (excess 3)\n", "x-i: xmas + ski (excess 4)\n", "x-j: xenic + conj (excess 6)\n", "x-k: xmas + mask (excess 3)\n", "x-l: xmas + sell (excess 5)\n", "x-o: xmas + so (excess 3)\n", "x-p: xmas + asp (excess 3)\n", "x-q: xenic + colloq (excess 8)\n", "x-t: xmas + mast (excess 3)\n", "x-u: xylem + emu (excess 4)\n", "x-v: xmas + shiv (excess 5)\n", "x-w: xmas + saw (excess 4)\n", "x-y: xmas + massy (excess 4)\n", "x-z: xmas + sitz (excess 5)\n", "y-b: yaw + web (excess 3)\n", "y-c: yet + etc (excess 2)\n", "y-f: yeti + if (excess 3)\n", "y-j: yeah + hadj (excess 5)\n", "y-o: yes + so (excess 2)\n", "y-q: yeti + iraq (excess 5)\n", "y-v: yet + tv (excess 2)\n", "y-x: yaw + wax (excess 3)\n", "y-z: yaw + whiz (excess 4)\n", "z-b: zip + pub (excess 3)\n", "z-d: zen + end (excess 2)\n", "z-f: zoo + of (excess 2)\n", "z-h: zoo + ooh (excess 2)\n", "z-j: zircon + conj (excess 5)\n", "z-k: zoo + oak (excess 3)\n", "z-q: zuni + iraq (excess 5)\n", "z-r: zoo + or (excess 2)\n", "z-v: zest + tv (excess 3)\n", "z-w: zip + paw (excess 3)\n", "z-x: zip + pox (excess 3)\n" ] } ], "source": [ "for A, B in letter_pairs:\n", " bridge = W.bridges[A][B]\n", " steps = bridge.steps\n", " if len(steps) == 2:\n", " print(f'{A}-{B}: {steps[0].word:6} + {steps[1].word:6} (excess {bridge.excess})')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question: What strange letter combinations are there?** Let's look at two-letter suffixes or prefixes that only appear in one or two nonsubwords. " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'pf': {'pfennigs'},\n", " 'ez': {'ezekiel'},\n", " 'fb': {'fbi'},\n", " 'mc': {'mcdonald'},\n", " 'll': {'llamas', 'llanos'},\n", " 'hd': {'hdqrs'},\n", " 'ik': {'ikebanas', 'ikons'},\n", " 'ek': {'ekistics'},\n", " 'ym': {'ymca'},\n", " 'jn': {'jnanas'},\n", " 'qa': {'qaids', 'qatar'},\n", " 'ay': {'ayahs', 'ayatollahs'},\n", " 'oj': {'ojibwas'},\n", " 'ee': {'eelgrasses', 'eelworm'},\n", " 'bw': {'bwanas'},\n", " 'xm': {'xmases'},\n", " 'gw': {'gweducks', 'gweducs'},\n", " 'yc': {'ycleped', 'yclept'},\n", " 'sf': {'sforzatos'},\n", " 'xi': {'xiphoids', 'xiphosuran'},\n", " 'dv': {'dvorak'},\n", " 'ct': {'ctrl'},\n", " 'fj': {'fjords'},\n", " 'gj': {'gjetosts'},\n", " 'dn': {'dnieper'},\n", " 'zw': {'zwiebacks'},\n", " 'iv': {'ivories', 'ivory'},\n", " 'qo': {'qophs'},\n", " 'ip': {'ipecacs'},\n", " 'if': {'iffiness'},\n", " 'tc': {'tchaikovsky'},\n", " 'ie': {'ieee'},\n", " 'kw': {'kwachas', 'kwashiorkor'},\n", " 'zl': {'zlotys'},\n", " 'wu': {'wurzel'},\n", " 'uf': {'ufos'},\n", " 'aj': {'ajar'}}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "startswith = multimap((w[:2], w) for w in words)\n", "\n", "{pre: startswith[pre] # Rare two-letter suffixes\n", " for pre in startswith if len(startswith[pre]) <= 2}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two-letter prefixes definitely include some strange words!" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'su': {'shiatsu'},\n", " 'ko': {'gingko', 'stinko'},\n", " 'nx': {'bronx', 'meninx'},\n", " 'oz': {'kolkhoz'},\n", " 'oi': {'hanoi', 'polloi'},\n", " 'rb': {'cowherb'},\n", " 'ug': {'bedrug', 'sparkplug'},\n", " 'lm': {'stockholm', 'unhelm'},\n", " 'gn': {'champaign'},\n", " 'nc': {'dezinc', 'quidnunc'},\n", " 'ud': {'aloud', 'overproud'},\n", " 'xo': {'convexo'},\n", " 'nu': {'vishnu'},\n", " 'xe': {'deluxe', 'maxixe'},\n", " 'td': {'retd'},\n", " 'hn': {'mendelssohn'},\n", " 'ui': {'maqui', 'prosequi'},\n", " 'mt': {'daydreamt', 'undreamt'},\n", " 'bm': {'ibm', 'icbm'},\n", " 'nz': {'franz'},\n", " 'hr': {'kieselguhr'},\n", " 'zo': {'diazo', 'palazzo'},\n", " 'dn': {'haydn'},\n", " 'lu': {'honolulu'},\n", " 'yx': {'styx'},\n", " 'ku': {'haiku'},\n", " 'zm': {'transcendentalizm'},\n", " 'mb': {'clomb', 'whitecomb'},\n", " 'sr': {'ussr'},\n", " 'ou': {'thankyou'},\n", " 'ep': {'asleep', 'shlep'},\n", " 'tl': {'peyotl', 'shtetl'},\n", " 'uc': {'caoutchouc'},\n", " 'dt': {'rembrandt'},\n", " 'ru': {'nehru'},\n", " 'wa': {'kiowa', 'okinawa'},\n", " 'xs': {'duplexs'},\n", " 'ab': {'skylab'},\n", " 'fa': {'khalifa'},\n", " 'rf': {'waldorf', 'windsurf'},\n", " 'ho': {'groucho'},\n", " 'oe': {'monroe'},\n", " 'ln': {'lincoln'},\n", " 'cd': {'recd'},\n", " 'ua': {'joshua'},\n", " 'hm': {'microhm'},\n", " 'gm': {'apophthegm'},\n", " 'ao': {'chiao', 'ciao'},\n", " 'ec': {'filespec', 'quebec'},\n", " 'mp': {'prestamp'},\n", " 'aa': {'markkaa'},\n", " 'vt': {'govt'},\n", " 'eh': {'mikveh', 'yahweh'},\n", " 'ob': {'blowjob'},\n", " 'ef': {'unicef'},\n", " 'ji': {'fiji'},\n", " 'za': {'organza'},\n", " 'we': {'zimbabwe'},\n", " 'vo': {'concavo'},\n", " 'sz': {'grosz'},\n", " 'hu': {'buchu'},\n", " 'pa': {'tampa'},\n", " 'tu': {'impromptu'},\n", " 'ai': {'bonsai'},\n", " 'zt': {'liszt'},\n", " 'po': {'troppo'},\n", " 'ub': {'beelzebub'}}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "endswith = multimap((w[-2:], w) for w in words)\n", "\n", "{suf: endswith[suf] # Rare two-letter suffixes\n", " for suf in endswith if len(endswith[suf]) <= 2}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The list of two-letter suffixes is mostly picking out proper names and pointing out flaws in the word list. For example, lots of words end in `ab`: blab, cab, crab, dab, gab, jab, lab, boabab, kebab, taxicab, backstab, etc. But must of them are subwords of plural forms; only `skylab` made it into the word list in singular form but not plural." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:base] *", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.9" } }, "nbformat": 4, "nbformat_minor": 4 }