Anigrams notebook

2022-09-30 11:57:26 -07:00 · 2022-09-30 11:57:26 -07:00 · 5a80aac81d
commit 5a80aac81d
parent 850d297630
1 changed files with 356 additions and 0 deletions
--- a/ipynb/Anigrams.ipynb
+++ b/ipynb/Anigrams.ipynb
@ -0,0 +1,356 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div style=\"text-align: right\" align=\"right\"><i>Peter Norvig, Sept 2022</i></div>\n",
+    "\n",
+    "# Anigrams Puzzle\n",
+    "\n",
+    "From 538's [**The Riddler** on Sept 16, 2022](https://fivethirtyeight.com/features/can-you-build-the-biggest-anigram/) comes this puzzle:\n",
+    "    \n",
+    "> In the game of [**Anigrams**](http://anigrams.us), you **unscramble** successively larger, nested collections of letters to create a valid **chain** of six English **words** between four and nine letters in length. For example, a **chain** of five words (sadly, less than the six needed for a valid game of Anigrams) can be constructed using the following sequence, with each **term** after the first including **one additional letter** than the previous **term**:\n",
+    ">\n",
+    ">- DEIR (which unscramble to make the words DIRE, IRED or RIDE)\n",
+    ">- DEIR**D** (DRIED or REDID)\n",
+    ">- DEIRD**L** (DIRLED, DREIDL or RIDDLE)\n",
+    ">- DEIRDL**R** (RIDDLER)\n",
+    ">- DEIRDLR**S** (RIDDLERS)\n",
+    ">\n",
+    ">\n",
+    "> What is the **longest chain** of such nested anagrams you can create, **starting** with four letters?\n",
+    ">\n",
+    "> *Extra credit:* **How many possible games** of Anigrams are there? That is, how many valid sets are there of four initial letters, and then five more **letters added** one at a time in an ordered sequence, that result in a sequence of six valid anagrams? (Note: Swapping the order of the first four letters does not result in a distinct game.)\n",
+    "\n",
+    "I'll solve this puzzle with a Python program. First some generally useful imports:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Set, List\n",
+    "import functools "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Basic Concepts\n",
+    "\n",
+    "The puzzle description mentions a bunch of concepts: **word**, **chain**, **term**, **starting**, **unscramble**,  **longest chain**, and **one additional letter**. To solve the puzzle, I'll carefully define each concept, in prose and in Python. Once I have a collection of concepts, I can see how they fit together to solve the puzzle. First the data types:\n",
+    "\n",
+    "- **Word**: an English language word, e.g. `'ride'`.\n",
+    "- **Chain**: a list of words, each one adding a letter, e.g. `['ride', 'redid', 'riddle']`. \n",
+    "- **Term**: a scrambled word, e.g. `'deir'`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Word  = str        # A word in the English word list, e.g. 'ride'.\n",
+    "Chain = List[Word] # A list of words, each one adding a letter, e.g. ['ride', 'redid', 'riddle']. \n",
+    "Term  = str        # A scrambled word, e.g. 'deir'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then some functions and constants:\n",
+    "\n",
+    "- **scramble**: I'll define `scramble(word)` to return one of the *n*! ways to scramble an *n*-letter word. I'll arbitrarily choose alphabetical order.\n",
+    "- **terms**: I'll define `all_terms` as a dict of `{Term: Word}` pairs, e.g.  `{'deir': 'ride', 'ddeir': 'redid', ...}`.\n",
+    "- **starting**: I'll define `starting_terms` to be the set of 4-letter terms from `all_terms`, e.g. `{'deir', 'acef', 'eirt', ...]`.\n",
+    "- **unscramble**: I'll define `unscramble(term)` to return the corresponding word. A term might unscramble to several words; I'll arbitrarily pick one.\n",
+    "- **longest chain**: I'll define`longest(chains)` to return the chain with the most words.\n",
+    "- **one additional letter**: I'll define `add_one_letter(term)` to return the set of valid terms that can be formed by adding one letter to `term`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def scramble(word: Word) -> Term: \n",
+    "    \"\"\"A canonical scrambling of the letters in `word` (into alphabetical order).\"\"\"\n",
+    "    return ''.join(sorted(word))\n",
+    "\n",
+    "all_terms = {scramble(w): w for w in open('enable1.txt').read().split()} # {Term: Word}\n",
+    "\n",
+    "starting_terms = {term for term in all_terms if len(term) == 4} # 4-letter terms\n",
+    "\n",
+    "unscramble = all_terms.get # e.g., unscramble('deir') = 'ride'\n",
+    "\n",
+    "def longest(chains) -> Chain: return max(chains, key=len, default=[])\n",
+    "\n",
+    "def add_one_letter(term: Term) -> Set[Term]:\n",
+    "    \"\"\"The set of valid Terms (from `all_terms`) that can be formed by adding one letter to `term`.\"\"\"\n",
+    "    terms = (scramble(term + L) for L in 'abcdefghijklmnopqrstuvwxyz')\n",
+    "    return {t for t in terms if t in all_terms} "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following assertions serve as unit tests and also as examples of usage:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert scramble('ride') == 'deir'\n",
+    "assert unscramble('deir') == 'ride'\n",
+    "assert scramble('dire') == scramble('ired') == scramble('ride')\n",
+    "assert unscramble('billowy') == scramble('billowy') == 'billowy'\n",
+    "assert longest(([1, 2, 3], [1, 2], [1, 2, 3, 4], [3, 2, 1])) == [1, 2, 3, 4]\n",
+    "assert longest([]) == [] # The longest of no chains is the empty chain\n",
+    "assert add_one_letter('abck') == {'aabck', 'abckl', 'abcks'}\n",
+    "assert add_one_letter(scramble('back')) == {scramble('aback'), scramble('black'), scramble('backs')}\n",
+    "assert add_one_letter(scramble('riddler')) == {scramble('riddlers')}\n",
+    "assert add_one_letter(scramble('riddlers')) == set()\n",
+    "assert len(all_terms) == 156473\n",
+    "assert len(starting_terms) == 2674"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Solution\n",
+    "\n",
+    "We are now ready to compute \"the longest chain ... starting with four letters.\" First we define `longest_starting_from(term)` to compute the longest chain starting from a term, then we apply that to every `starting_term`, and take the `longest`. To compute `longest_starting_from(term)`, recursively apply `longest_starting_from` to every one-letter addition to `term`, and then return the longest of those chains, with the unscrambling of `term` prepended. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 1.35 s, sys: 9.73 ms, total: 1.36 s\n",
+      "Wall time: 1.36 s\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "['nine',\n",
+       " 'inane',\n",
+       " 'eonian',\n",
+       " 'enation',\n",
+       " 'nominate',\n",
+       " 'nominates',\n",
+       " 'antinomies',\n",
+       " 'nitrosamine',\n",
+       " 'terminations',\n",
+       " 'antimodernist',\n",
+       " 'determinations',\n",
+       " 'underestimation',\n",
+       " 'underestimations']"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cache = functools.lru_cache(None) # Used as a function decorator to avoid re-computation\n",
+    "\n",
+    "@cache\n",
+    "def longest_starting_from(term: Term) -> Chain:\n",
+    "    \"\"\"The longest chain starting with `term`. Equal to the unscrambling of `term`,\n",
+    "    followed by the longest chain starting from a one-letter addition to `term` (if there are any).\"\"\"\n",
+    "    chains = map(longest_starting_from, add_one_letter(term))\n",
+    "    return [unscramble(term), *longest(chains)]\n",
+    "\n",
+    "%time longest(map(longest_starting_from, starting_terms)) # A solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that the longest chain gets up to a 16-letter word. There are other chains that are equally long, and it turns out that all of them end with either `indeterminations` or `underestimations`. \n",
+    "\n",
+    "*Note*: Typically a recursive function has the form \"if terminal_condition return something else recurse.\" The function `longest_starting_from` does not need an explicit check for a terminal condition, because when `add_one_letter(term)` returns the empty set, the function `longest` deals with it (via the `default=[]` keyword)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extra Credit\n",
+    "\n",
+    "How many possible games of Anigrams are there? A game is a chain that starts with a 4-letter word and ends with a 9-letter word. Swapping the order of letters in any word \"does not result in a distinct game.\" (I don't have to worry about that because my `unscramble(term)` always picks the same word, given a term.)\n",
+    "\n",
+    "The number of games starting from a given term is exactly 1 if the term has 9 letters, and is otherwise the sum of the number of games starting from each one-letter addition to term. To find the total number of possible games, just sum up the `games_starting_from` for each 4-letter term:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@cache\n",
+    "def games_starting_from(term: Term, goal_length=9) -> int:\n",
+    "    \"\"\"How many games start with `term` and end with a word of `goal_length` letters?\"\"\"\n",
+    "    if len(term) == goal_length:\n",
+    "        return 1\n",
+    "    else:\n",
+    "        return sum(games_starting_from(t, goal_length) for t in add_one_letter(term))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 816 ms, sys: 4.69 ms, total: 821 ms\n",
+      "Wall time: 820 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "4510515"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "%time sum(map(games_starting_from, starting_terms))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are 4,510,515 total possible games. \n",
+    "\n",
+    "## More Tests, Examples, and Notes\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert games_starting_from('beep') == 1\n",
+    "assert longest_starting_from('beep') == ['beep', 'plebe', 'beleap', 'beleapt', 'bedplate', 'bedplates']\n",
+    "assert max([games_starting_from(term), term] for term in starting_terms) == [64422, 'aers']\n",
+    "for word in 'alerts alters artels estral laster ratels salter slater staler stelar talers'.split():\n",
+    "    assert scramble(word) == 'aelrst' \n",
+    "for (term, word) in all_terms.items():\n",
+    "    assert scramble(word) == term\n",
+    "    assert unscramble(term) == word\n",
+    "    assert scramble(unscramble(term)) == term\n",
+    "    assert unscramble(scramble(word)) == word"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The words `'a'` and `'I'` are not in `enable1.txt`, because one-letter words are not valid Scrabble™️ words. We could add them, and then find the longest chain of all:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['a',\n",
+       " 'as',\n",
+       " 'sea',\n",
+       " 'teas',\n",
+       " 'teams',\n",
+       " 'stamen',\n",
+       " 'tameins',\n",
+       " 'mannites',\n",
+       " 'nominates',\n",
+       " 'antinomies',\n",
+       " 'nitrosamine',\n",
+       " 'terminations',\n",
+       " 'antimodernist',\n",
+       " 'determinations',\n",
+       " 'underestimation',\n",
+       " 'underestimations']"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "all_terms.update({'a': 'a', 'i': 'i'})\n",
+    "longest(map(longest_starting_from, all_terms))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*Note*: The puzzle description uses the phrase \"collection of letters\" and the word \"anagram\" as synonyms for \"term\". I chose to use \"term\" throughout because \"collection of letters\" is too long, and \"anagram\" (although more descriptive than the bland \"term\") is not quite right: \"ride\" and \"dire\" are indeed anagrams of each other, but \"deir\" is not an anagram, because it is not an English word. Rather, \"deir\" is a *scrambled collection of letters that correspond to a word*. Also, I like the fact that \"term\" and \"word\" have the same number of letters.\n",
+    "\n",
+    "*Note*: This whole notebook only takes about 3 seconds to run. The use of the `@cache` decorator means that the complexity of the whole algorithm is *O*(*T*), where *T* is the number of terms. Without the `@cache`, many chains would have to be re-computed multiple times. For example, starting at `'ride'` you can add one letter to get to either `'riced'` or `'rimed'`. From either of those you can add one letter to get to `'dermic'`, and from there there's a whole subtree of possibilities to explore. With `@cache`, that subtree is only computed once. Without `@cache` it is computed twice, and lots of other subtrees are computed multiple times, making the overall run time about ten-fold slower.\n",
+    "\n",
+    "*Note*: I could have represented a chain as a list of scrambled terms, rather than a list of words, but I figure if you want the solution, you'd rather see words, not scrambles. I also could have represented a chain as an initial set of four letters plus the letters that get added, in order. For example, `'deir' + 'd' + 'l' + 'r' + 's' == 'deirdlrs'`. That would have been a more compact representation than a list, but messier to deal with. For example, the word `'dermic'` would be reached by two paths, `'deircm'` and `'deirmc'`, and we would lose the advantage of `@cache` unless we took some convoluted steps to translate back and forth among the chain format, the scrambled word format, and the unscrambled word format.\n",
+    "\n",
+    "*Note*: Because the order of iteration over sets is not fixed in Python, different runs of this notebook can produce different results.\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}