From 1ba5935e125686e6cf6c20cecb5eee86278e5fc7 Mon Sep 17 00:00:00 2001 From: Peter Norvig Date: Tue, 9 Nov 2021 13:38:46 -0800 Subject: [PATCH] Add files via upload --- ipynb/SpellingBee.ipynb | 867 ++++++++++++++++++++-------------------- 1 file changed, 427 insertions(+), 440 deletions(-) diff --git a/ipynb/SpellingBee.ipynb b/ipynb/SpellingBee.ipynb index 015d807..a57a1ba 100644 --- a/ipynb/SpellingBee.ipynb +++ b/ipynb/SpellingBee.ipynb @@ -12,7 +12,7 @@ "\n", "> In this game, seven letters are arranged in a **honeycomb** lattice, with one letter in the center. Here’s the lattice from Dec. 24, 2019:\n", "> \n", - "> \n", + "> \n", "> \n", "> The goal is to identify as many words as possible that meet the following criteria:\n", "> 1. The word must be at least four letters long.\n", @@ -23,13 +23,12 @@ ">\n", "> ***Which seven-letter honeycomb results in the highest possible score?*** To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.\n", ">\n", - "> For consistency, please use [this word list](https://norvig.com/ngrams/enable1.txt) to check your game score.\n", + "> For consistency, please use [this word list](https://norvig.com/ngrams/enable1.txt) to check your score.\n", "\n", "\n", + "Since the referenced word list came from [***my*** web site](https://norvig.com/ngrams), I felt compelled to solve this puzzle. (Note the word list is a standard public domain Scrabble® dictionary that I happen to host a copy of; I didn't curate it, Mendel Cooper and Alan Beale did.) \n", "\n", - "Since the referenced word list came from [***my*** web site](https://norvig.com/ngrams), I felt compelled to solve this puzzle. (Note it is a standard public domain Scrabble® word list that I happen to host a copy of; I didn't curate it, Mendel Cooper and Alan Beale did.) \n", - "\n", - "I'll show you how I address the problem. First some imports, then we'll work through 10 steps." + "I'll show you how I address the problem. First some imports:" ] }, { @@ -48,16 +47,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# 1: Letters, Lettersets, Words, and Pangrams\n", + "# Letters, Words, Lettersets, and Pangrams\n", "\n", "Let's start by defining the most basic terms:\n", "\n", - "- **Letter**: the valid letters are uppercase 'A' to 'Z', but not 'S'.\n", - "- **Letterset**: the set of distinct letters in a word.\n", + "- **Valid letter**: the valid letters are uppercase 'A' to 'Z', but not 'S'.\n", "- **Word**: A string of letters.\n", - "- **valid word**: a word of at least 4 letters, all valid, and not more than 7 distinct letters.\n", - "- **pangram**: a valid word with exactly 7 distinct letters.\n", - "- **word list**: a list of valid words." + "- **Word list**: a list of valid words.\n", + "- **Valid word**: a word of at least 4 valid letters and not more than 7 distinct letters.\n", + "- **Letterset**: the set of distinct letters in a word; e.g. letterset('BOOBOO') = 'BO'.\n", + "- **Pangram**: a valid word with exactly 7 distinct letters.\n", + "- **Pangram letterset**: the letterset for a pangram." ] }, { @@ -66,24 +66,38 @@ "metadata": {}, "outputs": [], "source": [ - "letters = set('ABCDEFGHIJKLMNOPQR' + 'TUVWXYZ')\n", - "Letter = str\n", - "Letterset = str\n", - "Word = str \n", + "valid_letters = set('ABCDEFGHIJKLMNOPQR' + 'TUVWXYZ')\n", + "Letter = str # A string of one letter\n", + "Word = str # A string of multiple letters\n", + "Letterset = str # A sorted string of distinct letters\n", + "\n", + "def word_list(text: str) -> List[Word]: \n", + " \"\"\"All the valid words in a text.\"\"\"\n", + " return [w for w in text.upper().split() if is_valid(w)]\n", + "\n", + "def is_valid(word) -> bool: \n", + " return len(word) >= 4 and valid_letters.issuperset(word) and len(set(word)) <= 7\n", "\n", "def letterset(word) -> Letterset:\n", - " \"\"\"The set of distinct letters in a word.\"\"\"\n", + " \"\"\"The set of distinct letters in a word, represented as a sorted str.\"\"\"\n", " return ''.join(sorted(set(word)))\n", "\n", - "def is_valid(word) -> bool:\n", - " \"\"\"Is word 4 or more valid letters and no more than 7 distinct letters?\"\"\"\n", - " return len(word) >= 4 and letters.issuperset(word) and len(set(word)) <= 7 \n", - "\n", "def is_pangram(word) -> bool: return len(set(word)) == 7\n", "\n", - "def word_list(text: str) -> List[Word]: \n", - " \"\"\"All the valid words in a text (uppercased).\"\"\"\n", - " return [w for w in text.upper().split() if is_valid(w)]" + "def pangram_lettersets(wordlist) -> Set[Letterset]:\n", + " \"\"\"All lettersets from the pangram words in wordlist.\"\"\"\n", + " return {letterset(w) for w in wordlist if is_pangram(w)}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Why did I represent a `Letterset` as a sorted string of distinct letters, and not a `set`? Because:\n", + "- A `set` can't be the key of a dict.\n", + "- A `frozenset` could be a key, and would be a reasonable choice for `Letterset`, but it:\n", + " - Takes up more memory than a `str`.\n", + " - Is harder to read when debugging: `frozenset({'A', 'G', 'L', 'M'})` versus `'AGLM'`." ] }, { @@ -118,7 +132,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that `em` and `gem` are too short, `gems` has an `s`, and `amalgamation` has 8 distinct letters. We're left with six valid words out of the ten candidate words. The pangrams are:" + "Note that `em` and `gem` are too short, `gems` has an `s`, and `amalgamation` has 8 distinct letters. We're left with six valid words out of the ten candidate words. The pangrams and their lettersets are as follows; there are three pangrams but only two pangram lettersets because CACCIATORE and EROTICA have the same letterset:" ] }, { @@ -129,7 +143,7 @@ { "data": { "text/plain": [ - "{'CACCIATORE', 'EROTICA', 'MEGAPLEX'}" + "{'CACCIATORE': 'ACEIORT', 'EROTICA': 'ACEIORT', 'MEGAPLEX': 'AEGLMPX'}" ] }, "execution_count": 4, @@ -138,112 +152,42 @@ } ], "source": [ - "set(filter(is_pangram, mini))" + "assert len(pangram_lettersets(mini)) == 2\n", + "\n", + "{w: letterset(w) for w in mini if is_pangram(w)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Why did I choose to represent a `Letterset` as a sorted string (and not a `set`)? Because:\n", - "- A `set` can't be the key of a dict.\n", - "- A `frozenset` can be a key, and would be a reasonable choice for `Letterset`, but it:\n", - " - Takes up more memory than a `str`.\n", - " - Is verbose and hard to read when debugging: `frozenset({'A', 'G', 'L', 'M'})`\n", - "- A `str` of distinct letters in sorted order fixes all these issues." + "# Honeycombs and Scoring\n", + "\n", + "- A **honeycomb** lattice consists of two attributes:\n", + " - A letterset of seven distinct letters\n", + " - A single distinguished center letter\n", + "- The **word score** is 1 point for a 4-letter word, or the word length for longer words, plus 7 bonus points for a pangram.\n", + "- The **total score** for a honeycomb is the sum of the word scores for the words that the honeycomb **can make**. \n", + "- A honeycomb **can make** a word if the word contains the honeycomb's center, and every letter in the word is in the honeycomb. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'AGLM'" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "assert letterset('AMALGAM') == letterset('GLAM')\n", - "\n", - "letterset('AMALGAM')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 2: Honeycombs\n", - "\n", - "A honeycomb lattice consists of:\n", - "- A set of seven distinct letters\n", - "- The one distinguished center letter" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, "outputs": [], "source": [ "@dataclass(frozen=True, order=True)\n", "class Honeycomb:\n", " \"\"\"A Honeycomb lattice, with 7 letters, 1 of which is the center.\"\"\"\n", " letters: Letterset \n", - " center: Letter " - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Honeycomb(letters='AEGLMPX', center='G')" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "hc = Honeycomb(letterset('MEGAPLEX'), 'G')\n", - "hc" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 3: Scoring\n", - "\n", - "- The **word score** is 1 point for a 4-letter word, or the word length for longer words, plus 7 bonus points for a pangram.\n", - "- The **game score** for a honeycomb is the sum of the word scores for the words that the honeycomb can make. \n", - "- A honeycomb **can make** a word if:\n", - " - the word contains the honeycomb's center, and\n", - " - every letter in the word is in the honeycomb. " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ + " center: Letter\n", + " \n", "def word_score(word) -> int: \n", " \"\"\"The points for this word, including bonus for pangram.\"\"\"\n", " return 1 if len(word) == 4 else (len(word) + 7 * is_pangram(word))\n", "\n", - "def game_score(honeycomb, wordlist) -> int:\n", + "def total_score(honeycomb, wordlist) -> int:\n", " \"\"\"The total score for this honeycomb.\"\"\"\n", " return sum(word_score(w) for w in wordlist if can_make(honeycomb, w))\n", "\n", @@ -256,12 +200,40 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The word scores, game score (on `hc`), and makeable words for `mini` are as follows:" + "Here is the honeycomb from the diagram at the top of the notebook:" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Honeycomb(letters='AEGLMPX', center='G')" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "hc = Honeycomb(letterset('LAPGEMX'), 'G')\n", + "hc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The word scores, makeable words, and total score for this honeycomb on the `mini` word list are as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -275,7 +247,7 @@ " 'MEGAPLEX': 15}" ] }, - "execution_count": 9, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -286,27 +258,7 @@ }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "24" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "game_score(hc, mini) # 7 + 1 + 1 + 15" - ] - }, - { - "cell_type": "code", - "execution_count": 11, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -315,7 +267,7 @@ "{'AMALGAM', 'GAME', 'GLAM', 'MEGAPLEX'}" ] }, - "execution_count": 11, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -324,27 +276,47 @@ "{w for w in mini if can_make(hc, w)}" ] }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "24" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "total_score(hc, mini) # 7 + 1 + 1 + 15" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 4: Top Honeycomb on Mini Word List\n", + "# Finding the Top-Scoring Honeycomb\n", "\n", - "A simple strategy for finding the top (highest-game-score) honeycomb is:\n", + "A simple strategy for finding the top (highest total score) honeycomb is:\n", " - Compile a list of all valid candidate honeycombs.\n", - " - For each honeycomb, compute the game score.\n", - " - Return a (score, honeycomb) tuple with the maximum score." + " - For each honeycomb, compute the total score.\n", + " - Return a (score, honeycomb) tuple for a honeycomb with the maximum score." ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def top_honeycomb(wordlist) -> Tuple[int, Honeycomb]: \n", " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", - " return max((game_score(h, wordlist), h) \n", + " return max((total_score(h, wordlist), h) \n", " for h in candidate_honeycombs(wordlist))" ] }, @@ -352,12 +324,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "What are the possible candidate honeycombs? We could try all letters in all slots, but that's a lot of honeycombs. Fortunately, we can use the constraint that a valid honeycomb **must make at least one pangram**. So the letters of any valid honeycomb must ***be*** the letterset of some pangram (and the center can be any of those letters):" + "What are the possible candidate honeycombs? We could try all letters in all slots, but that's a **lot** of honeycombs:\n", + "- The center can be any valid letter (25 choices, because 'S' is not allowed).\n", + "- The outside can be any six of the remaining 24 letters.\n", + "- All together, that's 25 × (24 choose 6) = 3,364,900 candidate honeycombs.\n", + "\n", + "Fortunately, we can use the constraint that **there must be at least one pangram** in a valid honeycomb. So the letters of any valid honeycomb must ***be*** the letterset of some pangram (and the center can be any one of the seven letters):" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -365,36 +342,12 @@ " \"\"\"Valid honeycombs have pangram letters, with any center.\"\"\"\n", " return [Honeycomb(letters, center) \n", " for letters in pangram_lettersets(wordlist)\n", - " for center in letters]\n", - "\n", - "def pangram_lettersets(wordlist) -> Set[Letterset]:\n", - " \"\"\"All lettersets from the pangram words in wordlist.\"\"\"\n", - " return {letterset(w) for w in wordlist if is_pangram(w)}" + " for center in letters]" ] }, { "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'ACEIORT', 'AEGLMPX'}" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pangram_lettersets(mini)" - ] - }, - { - "cell_type": "code", - "execution_count": 15, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -416,25 +369,25 @@ " Honeycomb(letters='AEGLMPX', center='X')]" ] }, - "execution_count": 15, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "candidate_honeycombs(mini) # 2×7 of them" + "candidate_honeycombs(mini) # 7 candidates for each of the 2 pangram lettersets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now we're ready to find the highest-scoring honeycomb with the `mini` word list:" + "Now we're ready to find the highest-scoring honeycomb with respect to the `mini` word list:" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -443,7 +396,7 @@ "(31, Honeycomb(letters='ACEIORT', center='T'))" ] }, - "execution_count": 16, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -456,16 +409,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**The program works.** But that's just the mini word list. \n", + "The program appears to work. But that's just the mini word list. \n", "\n", - "# 5: The Full Word List\n", + "# Big Word List\n", "\n", - "Here's the full-scale word list, `enable1.txt`:" + "Here's the big word list:" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -494,47 +447,156 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ - "enable1 = word_list(open('enable1.txt').read())" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Some counts for 'enable1.txt':\n", - " 172,820 total words\n", - " 44,585 valid Spelling Bee words\n", - " 14,741 pangram words\n", - " 7,986 distinct pangram lettersets\n", - " 55,902 candidate pangram-containing honeycombs\n", - "3,364,900 or 25 × (24 choose 6) possible honeycombs (98% invalid, non-pangram)\n" - ] - } - ], - "source": [ - "print(f\"\"\"Some counts for 'enable1.txt':\n", - "{172820:9,d} total words\n", - "{len(enable1):9,d} valid Spelling Bee words\n", - "{sum(map(is_pangram, enable1)):9,d} pangram words\n", - "{len(pangram_lettersets(enable1)):9,d} distinct pangram lettersets\n", - "{len(candidate_honeycombs(enable1)):9,d} candidate pangram-containing honeycombs\n", - "{25*24*23*22*21*20*19//720:9,d} or 25 × (24 choose 6) possible honeycombs (98% invalid, non-pangram)\"\"\")" + "file = 'enable1.txt'\n", + "big = word_list(open(file).read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "How long will it take to run `top_honeycomb(enable1)`? Most of the computation time is in `game_score`, which is called once for each of the 44,585 valid words, so let's estimate the total time by first checking how long it takes to compute the game score of a single honeycomb:" + "Here are some statistics:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "172,820 total words\n", + " 44,585 valid Spelling Bee words\n", + " 14,741 pangram words\n", + " 7,986 distinct pangram lettersets\n", + " 55,902 candidate honeycombs\n" + ] + } + ], + "source": [ + "print(f\"\"\"172,820 total words\n", + "{len(big):7,d} valid Spelling Bee words\n", + "{sum(map(is_pangram, big)):7,d} pangram words\n", + "{len(pangram_lettersets(big)):7,d} distinct pangram lettersets\n", + "{len(candidate_honeycombs(big)):7,d} candidate honeycombs\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How long will it take to run `top_honeycomb(big)`? Most of the computation time is in `total_score`, which is called once for each of the 55,902 candidate honeycombs, so let's estimate the total time by first checking how long it takes to compute the total score of a single honeycomb:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "8.6 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + ] + } + ], + "source": [ + "%timeit total_score(hc, big)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Roughly 9 milliseconds for one honeycomb. For all 55,902 valid honeycombs:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "503.11799999999994" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + ".009 * 55902" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "About 500 seconds, which is under 9 minutes. I could run `top_honeycomb(big)`, get a coffee, come back, and declare victory. \n", + "\n", + "But I think that a puzzle like this deserves a more elegant solution. \n", + "\n", + "# Faster Scoring: Points Table\n", + "\n", + "Here's an idea to make `total_score` faster by doing some precomputation:\n", + "\n", + "1. Do the following computation only once:\n", + " - Compute the `letterset` and `word_score` for each word in the word list. \n", + " - Make a table of `{letterset: sum_of_word_scores}` giving the total score for each letterset. \n", + " - I call this a **points table**.\n", + "2. For each honeycomb, do the following:\n", + " - Consider every **letter subset** of the honeycomb's 7 letters that includes the center letter.\n", + " - Sum the points table entries for each of these letter subsets.\n", + "\n", + "The resulting algorithm, `fast_total_score`, iterates over just 26 – 1 = 63 letter subsets; much fewer than 44,585 valid words. The function `top_honeycomb2` creates the points table and calls `fast_total_score`:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "def top_honeycomb2(wordlist) -> Tuple[int, Honeycomb]: \n", + " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", + " table = PointsTable(wordlist)\n", + " return max((fast_total_score(h, table), h) \n", + " for h in candidate_honeycombs(wordlist))\n", + "\n", + "class PointsTable(dict):\n", + " \"\"\"A table of {letterset: points} from words.\"\"\"\n", + " def __init__(self, wordlist):\n", + " for w in wordlist:\n", + " self[letterset(w)] += word_score(w)\n", + " def __missing__(self, key): return 0\n", + "\n", + "def letter_subsets(honeycomb) -> List[Letterset]:\n", + " \"\"\"The 63 subsets of the letters in the honeycomb, each including the center letter.\"\"\"\n", + " return [letters \n", + " for n in range(2, 8) \n", + " for letters in map(''.join, combinations(honeycomb.letters, n))\n", + " if honeycomb.center in letters]\n", + "\n", + "def fast_total_score(honeycomb, points_table) -> int:\n", + " \"\"\"The total score for this honeycomb, using a points table.\"\"\"\n", + " return sum(points_table[s] for s in letter_subsets(honeycomb))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is the points table for the mini word list:" ] }, { @@ -543,22 +605,33 @@ "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "8.47 ms ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" - ] + "data": { + "text/plain": [ + "{'AGLM': 8, 'ACEIORT': 31, 'AEGM': 1, 'AEGLMPX': 15}" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "%timeit game_score(hc, enable1)" + "table = PointsTable(mini)\n", + "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Roughly 8 or 9 milliseconds for one honeycomb. For all 55,902 valid honeycombs (in minutes):" + "The letterset `'AGLM'` gets 8 points (7 for AMALGAM and 1 for GLAM). `'ACEIORT'` gets 31 points (17 for CACCIATORE and 14 for EROTICA). `'AEGM'` gets 1 for GAME and `'AEGLMPX'` gets 15 for MEGAPLEX. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is the honeycomb `hc` again, and its 63 letter subsets:" ] }, { @@ -569,7 +642,7 @@ { "data": { "text/plain": [ - "8.385299999999999" + "Honeycomb(letters='AEGLMPX', center='G')" ] }, "execution_count": 21, @@ -578,188 +651,67 @@ } ], "source": [ - ".009 * 55902 / 60" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "About 8 or 9 minutes. I could run `top_honeycomb(enable1)`, get a coffee, come back, and declare victory. \n", - "\n", - "But I think that a puzzle like this deserves a more elegant solution. And I have an idea. \n", - "\n", - "# 6: Faster Algorithm: Points Table\n", - "\n", - "Here's my idea:\n", - "\n", - "1. Try every pangram letterset, but do some precomputation to make `game_score` much faster:\n", - " - Compute the `letterset` and `word_score` for each word in the word list. \n", - " - Make a table of `{letterset: total_points}` giving the total points of all words with a given letterset. \n", - " - I call this a **points table**.\n", - " - These calculations are independent of the honeycomb, so are done once, not 55,902 times. \n", - "2. `game_score2` considers every letter subset of a honeycomb, and sums the point table entries. \n", - " - Every word that a honeycomb can make is formed from a **letter subset** of the honeycomb's 7 letters. \n", - " - A letter subset must include the center letter, and may include any non-empty subset of the other 6 letters.\n", - " - So there are 26 – 1 = 63 valid letter subsets. \n", - " - Thus, `game_score2` iterates over just 63 letter subsets; much fewer than 44,585 valid words.\n", - "\n", - "\n", - "Here's the code:" + "hc" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['AG', 'EG', 'GL', 'GM', 'GP', 'GX', 'AEG', 'AGL', 'AGM', 'AGP', 'AGX', 'EGL', 'EGM', 'EGP', 'EGX', 'GLM', 'GLP', 'GLX', 'GMP', 'GMX', 'GPX', 'AEGL', 'AEGM', 'AEGP', 'AEGX', 'AGLM', 'AGLP', 'AGLX', 'AGMP', 'AGMX', 'AGPX', 'EGLM', 'EGLP', 'EGLX', 'EGMP', 'EGMX', 'EGPX', 'GLMP', 'GLMX', 'GLPX', 'GMPX', 'AEGLM', 'AEGLP', 'AEGLX', 'AEGMP', 'AEGMX', 'AEGPX', 'AGLMP', 'AGLMX', 'AGLPX', 'AGMPX', 'EGLMP', 'EGLMX', 'EGLPX', 'EGMPX', 'GLMPX', 'AEGLMP', 'AEGLMX', 'AEGLPX', 'AEGMPX', 'AGLMPX', 'EGLMPX', 'AEGLMPX']\n" + ] + } + ], "source": [ - "PointsTable = Dict[Letterset, int] # How many total points does a letterset score?\n", - "\n", - "def top_honeycomb2(wordlist) -> Tuple[int, Honeycomb]: \n", - " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", - " points_table = tabulate_points(wordlist)\n", - " return max((game_score2(h, points_table), h) \n", - " for h in candidate_honeycombs(wordlist))\n", - "\n", - "def tabulate_points(wordlist) -> PointsTable:\n", - " \"\"\"A table of {letterset: points} from words.\"\"\"\n", - " table = defaultdict(int)\n", - " for w in wordlist:\n", - " table[letterset(w)] += word_score(w)\n", - " return table\n", - "\n", - "def letter_subsets(honeycomb) -> List[Letterset]:\n", - " \"\"\"The 63 subsets of the letters in the honeycomb, each including the center letter.\"\"\"\n", - " return [letters \n", - " for n in range(2, 8) \n", - " for letters in map(''.join, combinations(honeycomb.letters, n))\n", - " if honeycomb.center in letters]\n", - "\n", - "def game_score2(honeycomb, points_table) -> int:\n", - " \"\"\"The total score for this honeycomb, using a points table.\"\"\"\n", - " return sum(points_table[s] for s in letter_subsets(honeycomb))" + "print(letter_subsets(hc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's get a feel for how this works. First, a 4-letter honeycomb has 7 letter subsets and a 7-letter honeycomb has 63:" + "The total from `fast_total_score` is the sum of its letter subsets (only 3 of which are in `PointsTable(mini)`):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['AG', 'LG', 'MG', 'ALG', 'AMG', 'LMG', 'ALMG']" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "letter_subsets(Honeycomb('ALMG', 'G')) " + "assert fast_total_score(hc, table) == 24 == table['AGLM'] + table['AEGM'] + table['AEGLMPX']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Finding the Top-Scoring Honeycomb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now solve the puzzle on the big word list:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "63" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(letter_subsets(Honeycomb(letterset('MEGAPLEX'), 'G')))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's look at the `mini` word list and the points table for it:" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(['AMALGAM', 'CACCIATORE', 'EROTICA', 'GAME', 'GLAM', 'MEGAPLEX'],\n", - " defaultdict(int, {'AGLM': 8, 'ACEIORT': 31, 'AEGM': 1, 'AEGLMPX': 15}))" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "points_table = tabulate_points(mini)\n", - "mini, points_table" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The letterset `'AGLM'` gets 8 points (7 for AMALGAM and 1 for GLAM). `'ACEIORT'` gets 31 points (17 for CACCIATORE and 14 for EROTICA). `'AEGM'` gets 1 for GAME and `'AEGLMPX'` gets 15 for MEGAPLEX. \n", - "\n", - "Let's make sure the new `top_honeycomb2` function works as well as the old one:" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [], - "source": [ - "assert top_honeycomb(mini) == top_honeycomb2(mini)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 7: The Solution" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can now solve the puzzle on the real word list:" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 1.43 s, sys: 4.11 ms, total: 1.44 s\n", - "Wall time: 1.44 s\n" + "CPU times: user 1.59 s, sys: 4.08 ms, total: 1.59 s\n", + "Wall time: 1.59 s\n" ] }, { @@ -768,57 +720,36 @@ "(3898, Honeycomb(letters='AEGINRT', center='R'))" ] }, - "execution_count": 27, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "%time top_honeycomb2(enable1)" + "%time top_honeycomb2(big)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Wow! 3898 is a high score!** And the whole computation took **less than 2 seconds**!\n", - "\n", - "We can see that `game_score2` is about 400 times faster than `game_score`:" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "21.3 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" - ] - } - ], - "source": [ - "points_table = tabulate_points(enable1)\n", - "\n", - "%timeit game_score2(hc, points_table)" + "**Wow! 3898 is a high score!** And the whole computation took **less than 2 seconds**!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 8: Even Faster Algorithm: Branch and Bound\n", + "# Scoring Fewer Honeycombs: Branch and Bound\n", "\n", - "A run time of less than 2 seconds is pretty good! But I think I can do even better.\n", + "A run time of less than 2 seconds to find the top honeycomb is pretty good! Can we do even better?\n", "\n", - "Consider **JUKEBOX**. It is a pangram, but with **J**, **K**, and **X**, it scores poorly, regardless of the center:" + "The program would run faster if we scored fewer honeycombs. But if we want to be guaranteed of finding the top-scoring honeycomb, how can we skip any? Consider the pangram **JUKEBOX**. With the unusual letters **J**, **K**, and **X**, it scores poorly, regardless of the choice of center:" ] }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 25, "metadata": {}, "outputs": [ { @@ -833,49 +764,50 @@ " Honeycomb(letters='BEJKOUX', center='X'): 15}" ] }, - "execution_count": 29, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "honeycombs = [Honeycomb(letterset('JUKEBOX'), C) for C in 'JUKEBOX']\n", + "jk = [Honeycomb(letterset('JUKEBOX'), C) for C in 'JUKEBOX']\n", "\n", - "{h: game_score(h, enable1) for h in honeycombs}" + "{h: total_score(h, big) for h in jk}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "It would be great if we could determine that **JUKEBOX** is not a top honeycomb in one call to `game_score2`, rather than seven calls. My idea:\n", + "We might be able to dismiss **JUKEBOX** in one call to `fast_total_score`, rather than seven, with this approach:\n", "- Keep track of the top score found so far.\n", "- For each pangram letterset, ask \"if we weren't required to use the center letter, what would this letterset score?\"\n", - "- Check if that score (which is an upper bound of the score using any one center letter) is higher than the top score so far.\n", - "- If yes, then try it with all seven centers; if not then discard it without trying any centers.\n", - " - This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: prune a whole **branch** (of 7 honeycombs) if an upper **bound** can't beat the top score found so far.\n", + "- Check if that score (which is an **upper bound** of the score using any one center letter) is higher than the top score so far.\n", + " - If yes, then try the pangram letterset with all seven centers; \n", + " - If not then dismiss it without trying any centers.\n", + "- This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: prune a **branch** of 7 honeycombs if an upper **bound** can't beat the top score.\n", "\n", - "*Note*: To represent a honeycomb with no center, I can just use `Honeycomb(letters, '')`. This works because of a quirk of Python: `game_score2` checks if `honeycomb.center in letters`; normally in Python the expression `e in s` means \"*is* `e` *an element of the collection* `s`\", but when `s` is a string it means \"*is* `e` *a substring of* `s`\", and the empty string is a substring of every string. (If I had represented a `Letterset` as a Python `set`, this wouldn't work.)\n", + "*Note*: To represent a honeycomb with no center, I can just use `Honeycomb(p, '')`. This works because of a quirk of Python: `letter_subsets` checks if `honeycomb.center in letters`; normally in Python the expression `e in s` means \"*is* `e` *an element of the collection* `s`\", but when `s` is a string it means \"*is* `e` *a substring of* `s`\", and the empty string is a substring of every string. \n", "\n", - "Thus, I can rewrite `top_honeycomb` in this more efficient form:" + "I can rewrite `top_honeycomb` as follows:" ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def top_honeycomb3(words) -> Tuple[int, Honeycomb]: \n", " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n", - " points_table = tabulate_points(words)\n", + " table = PointsTable(words)\n", " top_score, top_honeycomb = -1, None\n", - " pangrams = [s for s in points_table if len(s) == 7]\n", + " pangrams = [s for s in table if len(s) == 7]\n", " for p in pangrams:\n", - " if game_score2(Honeycomb(p, ''), points_table) > top_score:\n", + " if fast_total_score(Honeycomb(p, ''), table) > top_score:\n", " for center in p:\n", " honeycomb = Honeycomb(p, center)\n", - " score = game_score2(honeycomb, points_table)\n", + " score = fast_total_score(honeycomb, table)\n", " if score > top_score:\n", " top_score, top_honeycomb = score, honeycomb\n", " return top_score, top_honeycomb" @@ -883,15 +815,15 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 309 ms, sys: 1.67 ms, total: 311 ms\n", - "Wall time: 310 ms\n" + "CPU times: user 353 ms, sys: 1.61 ms, total: 354 ms\n", + "Wall time: 354 ms\n" ] }, { @@ -900,27 +832,29 @@ "(3898, Honeycomb(letters='AEGINRT', center='R'))" ] }, - "execution_count": 31, + "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "%time top_honeycomb3(enable1)" + "%time top_honeycomb3(big)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Awesome! We get the correct answer, and it runs four times faster.\n", + "Awesome! We get the correct answer, and it runs four times faster than `top_honeycomb2`.\n", + "\n", + "# Statistics for top_honeycomb3 and fast_total_score\n", "\n", "How many honeycombs does `top_honeycomb3` examine? We can use `functools.lru_cache` to make `Honeycomb` keep track:" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 28, "metadata": {}, "outputs": [ { @@ -929,7 +863,7 @@ "CacheInfo(hits=0, misses=8084, maxsize=None, currsize=8084)" ] }, - "execution_count": 32, + "execution_count": 28, "metadata": {}, "output_type": "execute_result" } @@ -937,7 +871,7 @@ "source": [ "import functools\n", "Honeycomb = functools.lru_cache(None)(Honeycomb)\n", - "top_honeycomb3(enable1)\n", + "top_honeycomb3(big)\n", "Honeycomb.cache_info()" ] }, @@ -945,22 +879,55 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "`top_honeycomb3` examined 8,084 honeycombs; an almost 7-fold reduction from the 55,902 examined by `top_honeycomb2`.\n", + "`top_honeycomb3` examined 8,084 honeycombs; a 6.9× reduction from the 55,902 examined by `top_honeycomb2`. Since there are 7,986 pangram lettersets, that means we had to look at all 7 centers for only (8084-7986)/7 = 14 of them.\n", "\n", - "# 9: Fancy Report\n", - "\n", - "I'd like to see the actual words that each honeycomb can make, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. This report turned out to be a lot more complicated than I anticipated. I guess it is difficult to create a practical extraction and reporting tool. I feel you, [Larry Wall](http://www.wall.org/~larry/)." + "How much faster is `fast_total_score` than `total_score` (which takes about 9 milliseconds per honeycomb)?" ] }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "26.3 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + ] + } + ], + "source": [ + "table = PointsTable(big)\n", + "\n", + "%timeit fast_total_score(hc, table)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that `fast_total_score` is about 300 times faster." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fancy Report\n", + "\n", + "I'd like to see the actual words that each honeycomb can make, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. This report turned out to be more complicated than I anticipated. I guess it is difficult to create a practical extraction and reporting tool. I feel you, [Larry Wall](http://www.wall.org/~larry/)." + ] + }, + { + "cell_type": "code", + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from textwrap import fill\n", "\n", - "def report(honeycomb=None, words=enable1):\n", + "def report(honeycomb=None, words=big):\n", " \"\"\"Print stats, words, and word scores for the given honeycomb (or the top\n", " honeycomb if no honeycomb is given) over the given word list.\"\"\"\n", " bins = group_by(words, key=letterset)\n", @@ -969,7 +936,7 @@ " score, honeycomb = top_honeycomb3(words)\n", " else:\n", " adj = \"\"\n", - " score = game_score(honeycomb, words)\n", + " score = total_score(honeycomb, words)\n", " subsets = letter_subsets(honeycomb)\n", " nwords = sum(len(bins[s]) for s in subsets)\n", " print(f'{adj}{honeycomb} scores {Ns(score, \"point\")} on {Ns(nwords, \"word\")}',\n", @@ -1005,7 +972,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 31, "metadata": {}, "outputs": [ { @@ -1026,7 +993,26 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Top Honeycomb(letters='ACEIORT', center='A') scores 31 points on 2 words from a 6 word list:\n", + "\n", + "ACEIORT 31 points 2 pangrams CACCIATORE(17) EROTICA(14)\n" + ] + } + ], + "source": [ + "report(words=mini)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, "metadata": {}, "outputs": [ { @@ -1132,21 +1118,21 @@ } ], "source": [ - "report(words=enable1)" + "report()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 10: 'S' Words\n", + "# 'S' Words\n", "\n", "What if we allowed honeycombs and words to have an 'S' in them? I'll make a new word list, and report on it:" ] }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 33, "metadata": {}, "outputs": [ { @@ -1333,10 +1319,11 @@ } ], "source": [ - "letters.add('S') # Make 'S' a legal letter\n", - "enable1s = word_list(open('enable1.txt').read())\n", + "valid_letters.add('S') # Make 'S' a legal letter\n", "\n", - "report(words=enable1s)" + "big_s = word_list(open(file).read())\n", + "\n", + "report(words=big_s)" ] }, { @@ -1362,22 +1349,22 @@ "source": [ "# Summary\n", "\n", - "This notebook showed how to find the highest-scoring honeycomb, with a baseline and three key ideas:\n", + "This notebook showed how to find the highest-scoring honeycomb, starting from a brute-force baseline approach (which we didn't actually run) and modifying the approach with three key improvements:\n", "\n", - "1. **Brute Force Enumeration**: Compute game score for every honeycomb; return the highest-scoring.\n", - "2. **Pangram Lettersets**: Compute game score for honeycombs that are pangram lettersets (with all 7 centers).\n", + "1. **Brute Force Enumeration**: Compute total score for every honeycomb; return the highest-scoring.\n", + "2. **Pangram Lettersets**: Compute total score for honeycombs that are pangram lettersets (with all 7 centers).\n", "3. **Points Table**: Precompute score for each letterset; for each candidate honeycomb, sum 63 letter subset scores.\n", "4. **Branch and Bound**: Try all 7 centers only for lettersets that score better than the top score so far.\n", "\n", "The key ideas paid off in efficiency improvements:\n", "\n", "\n", - "|Approach|Honeycombs|Reduction|`game_score` Time|Speedup|Overall Time|Overall Speedup|\n", + "|Approach|Honeycombs|Reduction|`total_score` Time|Speedup|Overall Time|Overall Speedup|\n", "|--------|----------|--------|----|---|---|---|\n", "|1. **Brute Force Enumeration**|3,364,900|——|9000 microseconds|——|8.5 hours (est.)|——|\n", "|2. **Pangram Lettersets**|55,902|60×|9000 microseconds|——|500 sec (est.)|60×|\n", - "|3. **Points Table**|55,902|——|22 microseconds|400×|1.5 seconds|20,000×|\n", - "|4. **Branch and Bound**|8,084 |7×|22 microseconds|——|0.31 seconds|100,000×|\n", + "|3. **Points Table**|55,902|——|26 microseconds|300×|1.6 seconds|20,000×|\n", + "|4. **Branch and Bound**|8,084 |7×|26 microseconds|——|0.35 seconds|90,000×|\n", "\n" ] }