diff --git a/ipynb/SpellingBee.ipynb b/ipynb/SpellingBee.ipynb index 7f42b25..bc7fd35 100644 --- a/ipynb/SpellingBee.ipynb +++ b/ipynb/SpellingBee.ipynb @@ -19,7 +19,7 @@ "> 2. The word must include the central letter.\n", "> 3. The word cannot include any letter beyond the seven given letters.\n", ">\n", - ">Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, etc. Words that use all seven letters in the honeycomb are known as **pangrams** and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 8 + 7 = 15 points.\n", + ">Note that letters can be repeated. For example, GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, etc. Words that use all seven letters in the honeycomb are known as **pangrams** and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 8 + 7 = 15 points.\n", ">\n", "> ***Which seven-letter honeycomb results in the highest possible score?*** To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.\n", ">\n", @@ -27,7 +27,7 @@ "\n", "\n", "\n", - "Since the referenced [word list](https://norvig.com/ngrams/enable1.txt) came from [***my*** web site](https://norvig.com/ngrams), I felt somewhat compelled to solve this one. (Note it is a standard Scrabble® word list that I happen to host a copy of; I didn't curate it.) \n", + "Since the referenced word list came from [***my*** web site](https://norvig.com/ngrams), I felt compelled to solve this puzzle. (Note it is a standard public domain Scrabble® word list that I happen to host a copy of; I didn't curate it, Mendel Cooper and Alan Beale did.) \n", "\n", "I'll show you how I address the problem. First some imports, then we'll work through 10 steps." ] @@ -38,24 +38,26 @@ "metadata": {}, "outputs": [], "source": [ - "from collections import Counter, defaultdict\n", + "from collections import defaultdict\n", "from dataclasses import dataclass\n", "from itertools import combinations\n", - "from typing import List, Set, Dict, Tuple" + "from typing import List, Set, Dict, Tuple, Iterable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 1: Words, Word Scores, and Pangrams\n", + "# 1: Letters, Lettersets, Words, and Pangrams\n", "\n", - "Let's start by defining some basic terms:\n", + "Let's start by defining the most basic terms:\n", "\n", - "- **valid word**: a string of at least 4 letters ('A' to 'Z' but not 'S'), and not more than 7 distinct letters.\n", - "- **word list**: a list of valid words.\n", - "- **pangram**: a word with exactly 7 distinct letters.\n", - "- **word score**: 1 for a four letter word, or the length of the word for longer words, plus 7 for a pangram." + "- **Letter**: the valid letters are uppercase 'A' to 'Z', but not 'S'.\n", + "- **Letterset**: the set of distinct letters in a word.\n", + "- **Word**: A string of letters.\n", + "- **valid word**: a word of at least 4 letters, all valid, and not more than 7 distinct letters.\n", + "- **pangram**: a valid word with exactly 7 distinct letters.\n", + "- **word list**: a list of valid words." ] }, { @@ -64,26 +66,31 @@ "metadata": {}, "outputs": [], "source": [ - "def is_valid(word) -> bool:\n", - " \"\"\"Is word 4 or more letters, no 'S', and no more than 7 distinct letters?\"\"\"\n", - " return len(word) >= 4 and 'S' not in word and len(set(word)) <= 7\n", + "letters = set('ABCDEFGHIJKLMNOPQR' + 'TUVWXYZ')\n", + "Letter = str\n", + "Letterset = str\n", + "Word = str \n", "\n", - "def word_list(text) -> List[str]: \n", - " \"\"\"All the valid words in text (uppercased).\"\"\"\n", - " return [w for w in text.upper().split() if is_valid(w)]\n", + "def letterset(word) -> Letterset:\n", + " \"\"\"The set of distinct letters in a word.\"\"\"\n", + " return ''.join(sorted(set(word)))\n", + "\n", + "def is_valid(word) -> bool:\n", + " \"\"\"Is word 4 or more valid letters and no more than 7 distinct letters?\"\"\"\n", + " return len(word) >= 4 and letters.issuperset(word) and len(set(word)) <= 7 \n", "\n", "def is_pangram(word) -> bool: return len(set(word)) == 7\n", "\n", - "def word_score(word) -> int: \n", - " \"\"\"The points for this word, including bonus for pangram.\"\"\"\n", - " return 1 if len(word) == 4 else len(word) + 7 * is_pangram(word)" + "def word_list(text: str) -> List[Word]: \n", + " \"\"\"All the valid words in a text (uppercased).\"\"\"\n", + " return [w for w in text.upper().split() if is_valid(w)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "I'll make a mini word list to experiment with: " + "Here's a mini word list to experiment with:" ] }, { @@ -111,7 +118,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that `em` and `gem` are too short, `gems` has an `s` which is not allowed, and `amalgamation` has too many distinct letters (8). We're left with six valid words out of the ten candidate words. Here are examples of the other two functions in action:" + "Note that `em` and `gem` are too short, `gems` has an `s`, and `amalgamation` has 8 distinct letters. We're left with six valid words out of the ten candidate words. The pangrams are:" ] }, { @@ -131,7 +138,19 @@ } ], "source": [ - "{w for w in mini if is_pangram(w)}" + "set(filter(is_pangram, mini))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Why did I choose to represent a `Letterset` as a sorted string (and not a `set`)? Because:\n", + "- A `set` can't be the key of a dict.\n", + "- A `frozenset` can be a key, and would be a reasonable choice for `Letterset`, but it:\n", + " - Takes up more memory than a `str`.\n", + " - Is verbose and hard to read when debugging: `frozenset({'A', 'G', 'L', 'M'})`\n", + "- A `str` of distinct letters in sorted order fixes all these issues." ] }, { @@ -142,12 +161,7 @@ { "data": { "text/plain": [ - "{'AMALGAM': 7,\n", - " 'CACCIATORE': 17,\n", - " 'EROTICA': 14,\n", - " 'GAME': 1,\n", - " 'GLAM': 1,\n", - " 'MEGAPLEX': 15}" + "'AGLM'" ] }, "execution_count": 5, @@ -156,16 +170,20 @@ } ], "source": [ - "{w: word_score(w) for w in mini}" + "assert letterset('AMALGAM') == letterset('GLAM')\n", + "\n", + "letterset('AMALGAM')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 2: Honeycombs and Lettersets\n", + "# 2: Honeycombs\n", "\n", - "A honeycomb lattice consists of (1) a set of seven distinct letters and (2) the one distinguished center letter:\n" + "A honeycomb lattice consists of:\n", + "- A set of seven distinct letters\n", + "- The one distinguished center letter" ] }, { @@ -174,17 +192,11 @@ "metadata": {}, "outputs": [], "source": [ - "Letter = Letterset = str # Types\n", - "\n", "@dataclass(frozen=True, order=True)\n", "class Honeycomb:\n", " \"\"\"A Honeycomb lattice, with 7 letters, 1 of which is the center.\"\"\"\n", - " letters: Letterset # 7 letters\n", - " center: Letter # 1 letter\n", - " \n", - "def letterset(word) -> Letterset:\n", - " \"\"\"The set of letters in a word, represented as a sorted str.\"\"\"\n", - " return ''.join(sorted(set(word)))" + " letters: Letterset \n", + " center: Letter " ] }, { @@ -212,11 +224,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The type `Letter` is a `str` of 1 letter and `Letterset` is an unordered collection of letters, which I will represent as a sorted `str`. Why not a Python `set` or `frozenset`? Because a `str` takes up less space in memory, and its printed representation is easier to read when debugging. Compare:\n", - "- `frozenset({'A', 'E', 'G', 'L', 'M', 'P', 'X'})`\n", - "- `'AEGLMPX'`\n", + "# 3: Scoring\n", "\n", - "Why sorted? So that equal lettersets are equal:" + "- The **word score** is 1 point for a 4-letter word, or the word length for longer words, plus 7 bonus points for a pangram.\n", + "- The **game score** for a honeycomb is the sum of the word scores for the words that the honeycomb can make. \n", + "- A honeycomb **can make** a word if:\n", + " - the word contains the honeycomb's center, and\n", + " - every letter in the word is in the honeycomb. " ] }, { @@ -225,36 +239,49 @@ "metadata": {}, "outputs": [], "source": [ - "assert letterset('EROTICA') == letterset('CACCIATORE') == 'ACEIORT'" + "def word_score(word) -> int: \n", + " \"\"\"The points for this word, including bonus for pangram.\"\"\"\n", + " return 1 if len(word) == 4 else (len(word) + 7 * is_pangram(word))\n", + "\n", + "def game_score(honeycomb, wordlist) -> int:\n", + " \"\"\"The total score for this honeycomb.\"\"\"\n", + " return sum(word_score(w) for w in wordlist if can_make(honeycomb, w))\n", + "\n", + "def can_make(honeycomb, word) -> bool:\n", + " \"\"\"Can the honeycomb make this word?\"\"\"\n", + " return honeycomb.center in word and all(L in honeycomb.letters for L in word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 3: Game Score\n", - "\n", - "The **game score** for a honeycomb is the sum of the word scores for all the words that the honeycomb **can make**. \n", - "\n", - "A honeycomb can make a word if\n", - "(1) the word contains the honeycomb's center, and\n", - "(2) every letter in the word is in the honeycomb. " + "The word scores, game score (on `hc`), and makeable words for `mini` are as follows:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'AMALGAM': 7,\n", + " 'CACCIATORE': 17,\n", + " 'EROTICA': 14,\n", + " 'GAME': 1,\n", + " 'GLAM': 1,\n", + " 'MEGAPLEX': 15}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "def game_score(honeycomb, wordlist) -> int:\n", - " \"\"\"The total score for this honeycomb.\"\"\"\n", - " return sum(word_score(w) \n", - " for w in wordlist if can_make(honeycomb, w))\n", - "\n", - "def can_make(honeycomb, word) -> bool:\n", - " \"\"\"Can the honeycomb make this word?\"\"\"\n", - " return honeycomb.center in word and all(L in honeycomb.letters for L in word)" + "{w: word_score(w) for w in mini}" ] }, { @@ -265,7 +292,7 @@ { "data": { "text/plain": [ - "{'AMALGAM': 7, 'GAME': 1, 'GLAM': 1, 'MEGAPLEX': 15}" + "24" ] }, "execution_count": 10, @@ -274,28 +301,39 @@ } ], "source": [ - "{w: word_score(w) for w in mini if can_make(hc, w)}" + "game_score(hc, mini) # 7 + 1 + 1 + 15" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'AMALGAM', 'GAME', 'GLAM', 'MEGAPLEX'}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "assert game_score(hc, mini) == 24 == sum(_.values())" + "{w for w in mini if can_make(hc, w)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 4: Top Honeycomb\n", + "# 4: Top Honeycomb on Mini Word List\n", "\n", - "The strategy for finding the top (highest-scoring) honeycomb is:\n", - " - Compile a list of valid candidate honeycombs.\n", + "A simple strategy for finding the top (highest-game-score) honeycomb is:\n", + " - Compile a list of all valid candidate honeycombs.\n", " - For each honeycomb, compute the game score.\n", - " - Return a (score, honeycomb) tuple with the highest score." + " - Return a (score, honeycomb) tuple with the maximum score." ] }, { @@ -304,19 +342,17 @@ "metadata": {}, "outputs": [], "source": [ - "def top_honeycomb(words) -> Tuple[int, Honeycomb]: \n", - " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", - " return max((game_score(h, words), h) \n", - " for h in candidate_honeycombs(words))" + "def top_honeycomb(wordlist) -> Tuple[int, Honeycomb]: \n", + " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", + " return max((game_score(h, wordlist), h) \n", + " for h in candidate_honeycombs(wordlist))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "What are the possible candidate honeycombs? We can put any letter (except 'S') in the center, then any 6 remaining letters around the outside; this gives 25 × (24 choose 6) = 3,364,900 possible honeycombs. It would take hours to apply `game_score` to all of these.\n", - "\n", - "Fortunately, we can use the constraint that a valid honeycomb **must make at least one pangram**. So the letters of any valid honeycomb must ***be*** the letterset of some pangram (and the center can be any of those letters):" + "What are the possible candidate honeycombs? We could try all letters in all slots, but that's a lot of honeycombs. Fortunately, we can use the constraint that a valid honeycomb **must make at least one pangram**. So the letters of any valid honeycomb must ***be*** the letterset of some pangram (and the center can be any of those letters):" ] }, { @@ -325,15 +361,15 @@ "metadata": {}, "outputs": [], "source": [ - "def candidate_honeycombs(words) -> List[Honeycomb]:\n", + "def candidate_honeycombs(wordlist) -> List[Honeycomb]:\n", " \"\"\"Valid honeycombs have pangram letters, with any center.\"\"\"\n", " return [Honeycomb(letters, center) \n", - " for letters in pangram_lettersets(words)\n", + " for letters in pangram_lettersets(wordlist)\n", " for center in letters]\n", "\n", - "def pangram_lettersets(words) -> Set[Letterset]:\n", - " \"\"\"All lettersets from the pangram words.\"\"\"\n", - " return {letterset(w) for w in words if is_pangram(w)}" + "def pangram_lettersets(wordlist) -> Set[Letterset]:\n", + " \"\"\"All lettersets from the pangram words in wordlist.\"\"\"\n", + " return {letterset(w) for w in wordlist if is_pangram(w)}" ] }, { @@ -364,20 +400,20 @@ { "data": { "text/plain": [ - "[Honeycomb(letters='AEGLMPX', center='A'),\n", - " Honeycomb(letters='AEGLMPX', center='E'),\n", - " Honeycomb(letters='AEGLMPX', center='G'),\n", - " Honeycomb(letters='AEGLMPX', center='L'),\n", - " Honeycomb(letters='AEGLMPX', center='M'),\n", - " Honeycomb(letters='AEGLMPX', center='P'),\n", - " Honeycomb(letters='AEGLMPX', center='X'),\n", - " Honeycomb(letters='ACEIORT', center='A'),\n", + "[Honeycomb(letters='ACEIORT', center='A'),\n", " Honeycomb(letters='ACEIORT', center='C'),\n", " Honeycomb(letters='ACEIORT', center='E'),\n", " Honeycomb(letters='ACEIORT', center='I'),\n", " Honeycomb(letters='ACEIORT', center='O'),\n", " Honeycomb(letters='ACEIORT', center='R'),\n", - " Honeycomb(letters='ACEIORT', center='T')]" + " Honeycomb(letters='ACEIORT', center='T'),\n", + " Honeycomb(letters='AEGLMPX', center='A'),\n", + " Honeycomb(letters='AEGLMPX', center='E'),\n", + " Honeycomb(letters='AEGLMPX', center='G'),\n", + " Honeycomb(letters='AEGLMPX', center='L'),\n", + " Honeycomb(letters='AEGLMPX', center='M'),\n", + " Honeycomb(letters='AEGLMPX', center='P'),\n", + " Honeycomb(letters='AEGLMPX', center='X')]" ] }, "execution_count": 15, @@ -393,7 +429,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now we're ready to find the highest-scoring honeycomb with the mini word list:" + "Now we're ready to find the highest-scoring honeycomb with the `mini` word list:" ] }, { @@ -420,11 +456,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**It works.** But that's just the mini word list. \n", + "**The program works.** But that's just the mini word list. \n", "\n", - "# 5: The enable1 Word List\n", + "# 5: The Full Word List\n", "\n", - "Here's the real word list, `enable1.txt`, and some counts derived from it:" + "Here's the full-scale word list, `enable1.txt`:" ] }, { @@ -445,30 +481,24 @@ "aalii\n", "aaliis\n", "aals\n", - "aardvark\n" + "aardvark\n", + " 172820 enable1.txt\n" ] } ], "source": [ - "! [ -e enable1.txt ] || curl -O http://norvig.com/ngrams/enable1.txt\n", - "! head enable1.txt" + "! [ -e enable1.txt ] || curl -O http://norvig.com/ngrams/enable1.txt\n", + "! head enable1.txt\n", + "! wc -w enable1.txt" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " 172820 enable1.txt\n" - ] - } - ], + "outputs": [], "source": [ - "! wc -w enable1.txt" + "enable1 = word_list(open('enable1.txt').read())" ] }, { @@ -477,108 +507,46 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "44585" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "Some counts for 'enable1.txt':\n", + " 172,820 total words\n", + " 44,585 valid Spelling Bee words\n", + " 14,741 pangram words\n", + " 7,986 distinct pangram lettersets\n", + " 55,902 candidate pangram-containing honeycombs\n", + "3,364,900 or 25 × (24 choose 6) possible honeycombs (98% invalid, non-pangram)\n" + ] } ], "source": [ - "enable1 = word_list(open('enable1.txt').read())\n", - "\n", - "len(enable1)" + "print(f\"\"\"Some counts for 'enable1.txt':\n", + "{172820:9,d} total words\n", + "{len(enable1):9,d} valid Spelling Bee words\n", + "{sum(map(is_pangram, enable1)):9,d} pangram words\n", + "{len(pangram_lettersets(enable1)):9,d} distinct pangram lettersets\n", + "{len(candidate_honeycombs(enable1)):9,d} candidate pangram-containing honeycombs\n", + "{25*24*23*22*21*20*19//720:9,d} or 25 × (24 choose 6) possible honeycombs (98% invalid, non-pangram)\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How long will it take to run `top_honeycomb(enable1)`? Most of the computation time is in `game_score`, which is called once for each of the 44,585 valid words, so let's estimate the total time by first checking how long it takes to compute the game score of a single honeycomb:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "14741" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len([w for w in enable1 if is_pangram(w)])" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "7986" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(pangram_lettersets(enable1))" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "55902" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(candidate_honeycombs(enable1))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To summarize, there are:\n", - "\n", - "- 172,820 words in the `enable1` word list\n", - "- 44,585 valid Spelling Bee words\n", - "- 14,741 pangram words \n", - "- 7,986 distinct pangram lettersets\n", - "- 55,902 (7 × 7,986) candidate pangram-containing honeycombs\n", - "- out of 3,364,900 theoretically possible honeycombs\n", - "\n", - "How long will it take to run `top_honeycomb(enable1)`? Most of the computation time is in `game_score` (each call has to look at all 44,585 valid words), so let's estimate the total time by first checking how long it takes to compute the game score of a single honeycomb:" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "8.48 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + "8.47 ms ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], @@ -590,47 +558,51 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Roughly 9 milliseconds on my computer (this may vary). How many seconds to run `game_score` for all 55,902 valid honeycombs?" + "Roughly 8 or 9 milliseconds for one honeycomb. For all 55,902 valid honeycombs (in minutes):" ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "503.118" + "8.385299999999999" ] }, - "execution_count": 24, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "55902 * 9/1000" + ".009 * 55902 / 60" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "About 500 seconds, or 8 minutes. I could run `top_honeycomb(enable1)`, take a coffee break, come back, and declare victory. \n", + "About 8 or 9 minutes. I could run `top_honeycomb(enable1)`, get a coffee, come back, and declare victory. \n", "\n", - "But I think that a puzzle like this deserves a more elegant solution. I'd like to get the run time under a minute (as is suggested in [Project Euler](https://projecteuler.net/)), and I have an idea how to do it. \n", + "But I think that a puzzle like this deserves a more elegant solution. And I have an idea. \n", "\n", "# 6: Faster Algorithm: Points Table\n", "\n", - "Here's my plan:\n", + "Here's my idea:\n", "\n", - "1. Keep the same strategy of trying every pangram letterset, but do some precomputation that will make `game_score` much faster.\n", - "1. The precomputation is: compute the `letterset` and `word_score` for each word in the word list, and make a table of `{letterset: total_points}` giving the total number of word score points for all the words that correspond to each letterset. I call this a **points table**.\n", - "3. These calculations are independent of the honeycomb, so they need be done only once, not 55,902 times. \n", - "4. Every word that a honeycomb can make is formed from a **letter subset** of the honeycomb's 7 letters. A valid letter subset must include the center letter, and may include any non-empty subset of the other 6 letters, so there are 26 – 1 = 63 valid letter subsets. \n", - "4. `game_score2` considers each of the 63 letter subsets of a honeycomb, and sums the point table entry for each one. \n", - "5. Thus, `game_score2` iterates over just 63 letter subsets; a big optimization over `game_score`, which iterated over 44,585 words.\n", + "1. Try every pangram letterset, but do some precomputation to make `game_score` much faster:\n", + " - Compute the `letterset` and `word_score` for each word in the word list.\n", + " - Make a table of `{letterset: total_points}` giving the total points of all words with a given letterset. \n", + " - I call this a **points table**.\n", + " - These calculations are independent of the honeycomb, so are done once, not 55,902 times. \n", + "2. `game_score2` considers every letter subset of a honeycomb, and sums the point table entries. \n", + " - Every word that a honeycomb can make is formed from a **letter subset** of the honeycomb's 7 letters. \n", + " - A letter subset must include the center letter, and may include any non-empty subset of the other 6 letters.\n", + " - So there are 26 – 1 = 63 valid letter subsets. \n", + " - Thus, `game_score2` iterates over just 63 letter subsets; much fewer than 44,585 valid words.\n", "\n", "\n", "Here's the code:" @@ -638,22 +610,22 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 22, "metadata": {}, "outputs": [], "source": [ - "PointsTable = Dict[Letterset, int] # How many points does a letterset score?\n", + "PointsTable = Dict[Letterset, int] # How many total points does a letterset score?\n", "\n", - "def top_honeycomb2(words) -> Tuple[int, Honeycomb]: \n", - " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", - " points_table = tabulate_points(words)\n", + "def top_honeycomb2(wordlist) -> Tuple[int, Honeycomb]: \n", + " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", + " points_table = tabulate_points(wordlist)\n", " return max((game_score2(h, points_table), h) \n", - " for h in candidate_honeycombs(words))\n", + " for h in candidate_honeycombs(wordlist))\n", "\n", - "def tabulate_points(words) -> PointsTable:\n", - " \"\"\"Return a Counter of {letterset: points} from words.\"\"\"\n", - " table = Counter()\n", - " for w in words:\n", + "def tabulate_points(wordlist) -> PointsTable:\n", + " \"\"\"A table of {letterset: points} from words.\"\"\"\n", + " table = defaultdict(int)\n", + " for w in wordlist:\n", " table[letterset(w)] += word_score(w)\n", " return table\n", "\n", @@ -673,14 +645,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's get a feel for how this works. \n", - "\n", - "First consider `letter_subsets`. A 4-letter honeycomb makes $2^3 - 1= 7$ subsets; 7-letter honeycombs make $2^6 - 1= 63$ subsets:" + "Let's get a feel for how this works. First, a 4-letter honeycomb has 7 letter subsets and a 7-letter honeycomb has 63:" ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 23, "metadata": {}, "outputs": [ { @@ -689,7 +659,7 @@ "['AG', 'LG', 'MG', 'ALG', 'AMG', 'LMG', 'ALMG']" ] }, - "execution_count": 26, + "execution_count": 23, "metadata": {}, "output_type": "execute_result" } @@ -698,53 +668,67 @@ "letter_subsets(Honeycomb('ALMG', 'G')) " ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's eminded ourselves what `mini` is, and compute `tabulate_points(mini)`:" - ] - }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 24, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "mini = ['AMALGAM', 'CACCIATORE', 'EROTICA', 'GAME', 'GLAM', 'MEGAPLEX']\n" - ] - }, { "data": { "text/plain": [ - "Counter({'AGLM': 8, 'ACEIORT': 31, 'AEGM': 1, 'AEGLMPX': 15})" + "63" ] }, - "execution_count": 27, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "print('mini =', mini)\n", - "tabulate_points(mini)" + "len(letter_subsets(Honeycomb(letterset('MEGAPLEX'), 'G')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The letterset `'AGLM'` gets 8 points, 7 for AMALGAM and 1 for GLAM. `'ACEIORT'` gets 31 points, 17 for CACCIATORE and 14 for EROTICA. The other lettersets represent one word each. \n", - "\n", - "Let's make sure we haven't broken the `top_honeycomb` function:" + "Now let's look at the `mini` word list and the points table for it:" ] }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(['AMALGAM', 'CACCIATORE', 'EROTICA', 'GAME', 'GLAM', 'MEGAPLEX'],\n", + " defaultdict(int, {'AGLM': 8, 'ACEIORT': 31, 'AEGM': 1, 'AEGLMPX': 15}))" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "points_table = tabulate_points(mini)\n", + "mini, points_table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The letterset `'AGLM'` gets 8 points (7 for AMALGAM and 1 for GLAM). `'ACEIORT'` gets 31 points (17 for CACCIATORE and 14 for EROTICA). `'AEGM'` gets 1 for GAME and `'AEGLMPX'` gets 15 for MEGAPLEX. The other 59 lettersets have no words, no points.\n", + "\n", + "Let's make sure the new `top_honeycomb2` function works as well as the old one:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ @@ -762,20 +746,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Finally, the solution to the puzzle on the real word list:" + "We can now solve the puzzle on the real word list:" ] }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 1.65 s, sys: 3.78 ms, total: 1.65 s\n", - "Wall time: 1.65 s\n" + "CPU times: user 1.43 s, sys: 4.11 ms, total: 1.44 s\n", + "Wall time: 1.44 s\n" ] }, { @@ -784,7 +768,7 @@ "(3898, Honeycomb(letters='AEGINRT', center='R'))" ] }, - "execution_count": 29, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } @@ -799,19 +783,19 @@ "source": [ "**Wow! 3898 is a high score!** And the whole computation took **less than 2 seconds**!\n", "\n", - "We can see that `game_score2` is about 300 times faster than `game_score`:" + "We can see that `game_score2` is about 400 times faster than `game_score`:" ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "26.4 µs ± 90.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + "21.3 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], @@ -827,14 +811,14 @@ "source": [ "# 8: Even Faster Algorithm: Branch and Bound\n", "\n", - "A run time of less than 2 seconds is pretty good! But I'm not ready to stop now.\n", + "A run time of less than 2 seconds is pretty good! But I think I can do even better.\n", "\n", - "Consider the word **JUKEBOX**. It is a pangram, but what with the **J**, **K**, and **X**, it is a low-scoring honeycomb, regardless of what center is used:" + "Consider **JUKEBOX**. It is a pangram, but with **J**, **K**, and **X**, it scores poorly, regardless of the center:" ] }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 29, "metadata": {}, "outputs": [ { @@ -849,7 +833,7 @@ " Honeycomb(letters='BEJKOUX', center='X'): 15}" ] }, - "execution_count": 31, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" } @@ -869,45 +853,45 @@ "- For each pangram letterset, ask \"if we weren't required to use the center letter, what would this letterset score?\"\n", "- Check if that score (which is an upper bound of the score using any one center letter) is higher than the top score so far.\n", "- If yes, then try it with all seven centers; if not then discard it without trying any centers.\n", - "- This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: prune a whole **branch** (of 7 honeycombs) if an upper **bound** can't beat the top score.\n", + " - This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: prune a whole **branch** (of 7 honeycombs) if an upper **bound** can't beat the top score found so far.\n", "\n", - "To compute the score of a honeycomb with no center, it turns out I can just call `game_score2` on `Honeycomb(letters, '')`. This works because of a quirk of Python: `game_score2` checks if `honeycomb.center in letters`; normally in Python the expression `x in y` means \"is `x` a member of the collection `y`\", but when `y` is a string it means \"is `x` a substring of `y`\", and the empty string is a substring of every string. (If I had represented a letterset as a Python `set`, this wouldn't work.)\n", + "*Note*: To represent a honeycomb with no center, I can just use `Honeycomb(letters, '')`. This works because of a quirk of Python: `game_score2` checks if `honeycomb.center in letters`; normally in Python the expression `e in s` means \"*is* `e` *an element of the collection* `s`\", but when `s` is a string it means \"*is* `e` *a substring of* `s`\", and the empty string is a substring of every string. (If I had represented a `Letterset` as a Python `set`, this wouldn't work.)\n", "\n", - "Thus, I can rewrite `top_honeycomb` as follows:" + "Thus, I can rewrite `top_honeycomb` in this more efficient form:" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def top_honeycomb3(words) -> Tuple[int, Honeycomb]: \n", - " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", + " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", " points_table = tabulate_points(words)\n", - " top_score, top = -1, None\n", - " pangrams = (s for s in points_table if len(s) == 7)\n", + " top_score, top_honeycomb = -1, None\n", + " pangrams = [s for s in points_table if len(s) == 7]\n", " for p in pangrams:\n", " if game_score2(Honeycomb(p, ''), points_table) > top_score:\n", " for center in p:\n", " honeycomb = Honeycomb(p, center)\n", " score = game_score2(honeycomb, points_table)\n", " if score > top_score:\n", - " top_score, top = score, honeycomb\n", - " return top_score, top" + " top_score, top_honeycomb = score, honeycomb\n", + " return top_score, top_honeycomb" ] }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 360 ms, sys: 2 ms, total: 362 ms\n", - "Wall time: 361 ms\n" + "CPU times: user 309 ms, sys: 1.67 ms, total: 311 ms\n", + "Wall time: 310 ms\n" ] }, { @@ -916,7 +900,7 @@ "(3898, Honeycomb(letters='AEGINRT', center='R'))" ] }, - "execution_count": 33, + "execution_count": 31, "metadata": {}, "output_type": "execute_result" } @@ -929,126 +913,48 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Awesome! We get the same answer, and the computation is about 5 times faster; about **1/3 second**.\n", + "Awesome! We get the correct answer, and it runs four times faster.\n", "\n", - "For how many pangram lettersets did we have to check all 7 centers? We can find out by copy-pasting `top_honeycomb3` and annotating it to keep a `COUNT` of the number of pangrams that are checked, and to print a line of output when a honeycomb (either with or without a center letter) outscores the top score." + "How many honeycombs does `top_honeycomb3` examine? We can use `functools.lru_cache` to make `Honeycomb` keep track:" ] }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 32, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ADFLORW scores 333 with center ∅ (pangram 1 (1/7986))\n", - " scores 265 with center A \n", - "ABDENOR scores 1856 with center ∅ (pangram 2 (2/7986))\n", - " scores 1148 with center A \n", - " scores 1476 with center D \n", - " scores 1578 with center E \n", - "ABEIRTV scores 1585 with center ∅ (pangram 3 (4/7986))\n", - "ABDEORT scores 2434 with center ∅ (pangram 4 (28/7986))\n", - " scores 1679 with center A \n", - " scores 2134 with center E \n", - "ABCDERT scores 2254 with center ∅ (pangram 5 (35/7986))\n", - "ACELNRT scores 2529 with center ∅ (pangram 6 (46/7986))\n", - " scores 2158 with center A \n", - " scores 2225 with center E \n", - "ACDELRT scores 2746 with center ∅ (pangram 7 (47/7986))\n", - " scores 2273 with center A \n", - " scores 2608 with center E \n", - "ACENORT scores 2799 with center ∅ (pangram 8 (50/7986))\n", - "ACEIPRT scores 2653 with center ∅ (pangram 9 (57/7986))\n", - "ACDEIRT scores 3407 with center ∅ (pangram 10 (71/7986))\n", - " scores 3023 with center E \n", - "ACEINRT scores 3575 with center ∅ (pangram 11 (77/7986))\n", - "ADEOPRT scores 3031 with center ∅ (pangram 12 (157/7986))\n", - "AEGINRT scores 4688 with center ∅ (pangram 13 (178/7986))\n", - " scores 3372 with center A \n", - " scores 3769 with center E \n", - " scores 3782 with center N \n", - " scores 3898 with center R \n", - "ADEINRT scores 4020 with center ∅ (pangram 14 (419/7986))\n" - ] - }, { "data": { "text/plain": [ - "(3898, Honeycomb(letters='AEGINRT', center='R'))" + "CacheInfo(hits=0, misses=8084, maxsize=None, currsize=8084)" ] }, - "execution_count": 34, + "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "def top_honeycomb3_annotated(words) -> Tuple[int, Honeycomb]: \n", - " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb. Print stuff.\"\"\"\n", - " points_table = tabulate_points(words)\n", - " top_score, top = -1, None\n", - " pangrams = [s for s in points_table if len(s) == 7]\n", - " COUNT = 0\n", - " for i, p in enumerate(pangrams, 1):\n", - " if game_score2(Honeycomb(p, ''), points_table) > top_score:\n", - " COUNT +=1; \n", - " print(f'{p} scores {game_score2(Honeycomb(p, \"\"), points_table):4}',\n", - " f'with center ∅ (pangram {COUNT:2} ({i}/{len(pangrams)}))')\n", - " for center in p:\n", - " honeycomb = Honeycomb(p, center)\n", - " score = game_score2(honeycomb, points_table)\n", - " if score > top_score:\n", - " top_score, top = score, honeycomb\n", - " print(f'{\" \":8}scores {top_score:4} with center {top.center} ')\n", - " return top_score, top\n", - "\n", - "top_honeycomb3_annotated(enable1)" + "import functools\n", + "Honeycomb = functools.lru_cache(None)(Honeycomb)\n", + "top_honeycomb3(enable1)\n", + "Honeycomb.cache_info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Only 14 pangram lettersets had to have all 7 centers checked. We were lucky that the 4th pangram out of 7986, ABDEORT, happened to be aa good one, scoring 2134 points (with center E), setting a high score so that most of the remaining pangrams only needed to be checked with an empty center. The total number of calls to `game_score2` is:" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "8084" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(pangram_lettersets(enable1)) + 14 * 7" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "8,084 is a big improvement over 55,902.\n", + "This says that `top_honeycomb3` examined 8,084 honeycombs; an almost 7-fold improvement over the 55,902 examined by `top_honeycomb2`.\n", "\n", "# 9: Fancy Report\n", "\n", - "I'd like to see the actual words that each honeycomb can make, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this turned out to be a lot more complicated than I expected. I guess it is difficult to create a practical extraction and reporting tool. I feel you, [Larry Wall](http://www.wall.org/~larry/)." + "I'd like to see the actual words that each honeycomb can make, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this report turned out to be a lot more complicated than I anticipated. I guess it is difficult to create a practical extraction and reporting tool. I feel you, [Larry Wall](http://www.wall.org/~larry/)." ] }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 33, "metadata": {}, "outputs": [], "source": [ @@ -1067,7 +973,7 @@ " subsets = letter_subsets(honeycomb)\n", " nwords = sum(len(bins[s]) for s in subsets)\n", " print(f'{adj}{honeycomb} scores {Ns(score, \"point\")} on {Ns(nwords, \"word\")}',\n", - " f'from a {len(words)} word list:\\n')\n", + " f'from a {len(words):,d} word list:\\n')\n", " for s in sorted(subsets, key=lambda s: (-len(s), s)):\n", " if bins[s]:\n", " pts = sum(word_score(w) for w in bins[s])\n", @@ -1080,7 +986,7 @@ "def Ns(n, noun):\n", " \"\"\"A string with `n` followed by the plural or singular of noun:\n", " Ns(3, 'bear') => '3 bears'; Ns(1, 'world') => '1 world'\"\"\" \n", - " return f\"{n:d} {noun}{' ' if n == 1 else 's'}\"\n", + " return f\"{n:,d} {noun}{' ' if n == 1 else 's'}\"\n", "\n", "def group_by(items, key):\n", " \"Group items into bins of a dict, each bin keyed by key(item).\"\n", @@ -1090,9 +996,16 @@ " return bins" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here are reports for the mini and full word lists:" + ] + }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -1108,19 +1021,19 @@ } ], "source": [ - "report(Honeycomb('AEGLMPX', 'G'), mini)" + "report(hc, mini)" ] }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Top Honeycomb(letters='AEGINRT', center='R') scores 3898 points on 537 words from a 44585 word list:\n", + "Top Honeycomb(letters='AEGINRT', center='R') scores 3,898 points on 537 words from a 44,585 word list:\n", "\n", "AEGINRT 832 points 50 pangrams AERATING(15) AGGREGATING(18) ARGENTINE(16) ARGENTITE(16) ENTERTAINING(19)\n", " ENTRAINING(17) ENTREATING(17) GARNIERITE(17) GARTERING(16) GENERATING(17) GNATTIER(15) GRANITE(14)\n", @@ -1233,16 +1146,16 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Top Honeycomb(letters='AEINRST', center='E') scores 8681 points on 1179 words from a 98141 word list:\n", + "Top Honeycomb(letters='AEINRST', center='E') scores 8,681 points on 1,179 words from a 98,141 word list:\n", "\n", - "AEINRST 1381 points 86 pangrams ANESTRI(14) ANTISERA(15) ANTISTRESS(17) ANTSIER(14) ARENITES(15) ARSENITE(15)\n", + "AEINRST 1,381 points 86 pangrams ANESTRI(14) ANTISERA(15) ANTISTRESS(17) ANTSIER(14) ARENITES(15) ARSENITE(15)\n", " ARSENITES(16) ARTINESS(15) ARTINESSES(17) ATTAINERS(16) ENTERTAINERS(19) ENTERTAINS(17) ENTRAINERS(17)\n", " ENTRAINS(15) ENTREATIES(17) ERRANTRIES(17) INERTIAS(15) INSTANTER(16) INTENERATES(18) INTERSTATE(17)\n", " INTERSTATES(18) INTERSTRAIN(18) INTERSTRAINS(19) INTRASTATE(17) INTREATS(15) IRATENESS(16)\n", @@ -1420,8 +1333,8 @@ } ], "source": [ - "enable1s = [w for w in open('enable1.txt').read().upper().split()\n", - " if len(w) >= 4 and len(set(w)) <= 7]\n", + "letters.add('S') # Make 'S' a legal letter\n", + "enable1s = word_list(open('enable1.txt').read())\n", "\n", "report(words=enable1s)" ] @@ -1432,14 +1345,14 @@ "source": [ "Allowing 'S' words more than doubles the score!\n", "\n", - "Here are pictures for the highest-scoring honeycombs, with and without an S:\n", + "Here are the highest-scoring honeycombs (with and without an S) with their stats and mnemonics:\n", "\n", "\n", "
\n", " 537 words                         1,179 words \n", "
50 pangrams                         86 pangrams\n", "
3,898 points                         8,681 points\n", - "
\n", + "
  RETAINING                         ENTERTAINERS\n", "
" ] }, @@ -1449,22 +1362,22 @@ "source": [ "# Summary\n", "\n", - "This notebook showed how to find the highest-scoring honeycomb. Four ideas led to four approaches:\n", + "This notebook showed how to find the highest-scoring honeycomb, with a baseline and three key ideas:\n", "\n", - "1. **Brute Force Enumeration**: Compute the game score for every possible honeycomb; return the highest-scoring.\n", - "2. **Pangram Lettersets**: Compute the game score for just the honeycombs that are pangram lettersets (with all possible centers).\n", - "3. **Points Table**: Precompute the score for each letterset; then for each candidate honeycomb, sum the scores of the 63 letter subsets.\n", + "1. **Brute Force Enumeration**: Compute game score for every honeycomb; return the highest-scoring.\n", + "2. **Pangram Lettersets**: Compute game score for honeycombs that are pangram lettersets (with all 7 centers).\n", + "3. **Points Table**: Precompute score for each letterset; for each candidate honeycomb, sum 63 letter subset scores.\n", "4. **Branch and Bound**: Try all 7 centers only for lettersets that score better than the top score so far.\n", "\n", - "These ideas led to substantial improvements (gain) in number of honeycombs processed, `game_score` run time, and total run time:\n", + "The key ideas paid off in efficiency improvements:\n", "\n", "\n", - "|Approach|Honeycombs|Gain|`game_score` Time|Gain|Total Run Time|Gain|\n", + "|Approach|Honeycombs|Reduction|`game_score` Time|Speedup|Overall Time|Overall Speedup|\n", "|--------|----------|--------|----|---|---|---|\n", - "|**1. Brute Force Enumeration**|3,364,900|——|9000 microseconds|——|8.5 hours|——|\n", - "|**2. Pangram Lettersets**|55,902|60×|9000 microseconds|——|500 seconds|60×|\n", - "|**3. Points Table**|55,902|60×|25 microseconds|360×|1.6 seconds|20,000×|\n", - "|**4. Branch and Bound**|8,084 |400×|25 microseconds|360×|0.36 seconds|80,000×|\n", + "|1. **Brute Force Enumeration**|3,364,900|——|9000 microseconds|——|8.5 hours (est.)|——|\n", + "|2. **Pangram Lettersets**|55,902|60×|9000 microseconds|——|500 sec (est.)|60×|\n", + "|3. **Points Table**|55,902|——|22 microseconds|400×|1.5 seconds|20,000×|\n", + "|4. **Branch and Bound**|8,084 |7×|22 microseconds|——|0.31 seconds|100,000×|\n", "\n" ] }