From cee3abc829b95e7a0890a9e51fcfe96c37ed3245 Mon Sep 17 00:00:00 2001 From: Peter Norvig Date: Sat, 25 Jul 2020 00:09:11 -0700 Subject: [PATCH] Add files via upload --- ipynb/SpellingBee.ipynb | 860 ++++++++++++++++++++++++---------------- 1 file changed, 513 insertions(+), 347 deletions(-) diff --git a/ipynb/SpellingBee.ipynb b/ipynb/SpellingBee.ipynb index eefa54c..03edcf0 100644 --- a/ipynb/SpellingBee.ipynb +++ b/ipynb/SpellingBee.ipynb @@ -10,7 +10,7 @@ "\n", "The [3 Jan. 2020 edition of the 538 Riddler](https://fivethirtyeight.com/features/can-you-solve-the-vexing-vexillology/) concerns the popular NYTimes [Spelling Bee](https://www.nytimes.com/puzzles/spelling-bee) puzzle:\n", "\n", - "> In this game, seven letters are arranged in a **honeycomb lattice**, with one letter in the center. Here’s the lattice from December 24, 2019:\n", + "> In this game, seven letters are arranged in a **honeycomb** lattice, with one letter in the center. Here’s the lattice from Dec. 24, 2019:\n", "> \n", "> \n", "> \n", @@ -19,7 +19,7 @@ "> 2. The word must include the central letter.\n", "> 3. The word cannot include any letter beyond the seven given letters.\n", ">\n", - ">Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as “pangrams” and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.\n", + ">Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as \"**pangrams**\" and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.\n", ">\n", "> ***Which seven-letter honeycomb results in the highest possible game score?*** To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.\n", ">\n", @@ -27,7 +27,7 @@ "\n", "\n", "\n", - "Since the referenced [word list](https://norvig.com/ngrams/enable1.txt) came from *my* web site (I didn't make up the list; it is a standard Scrabble word list that I happen to host a copy of), I felt somewhat compelled to solve this one. " + "Since the referenced [word list](https://norvig.com/ngrams/enable1.txt) came from *my* web site, I felt somewhat compelled to solve this one. (Note I didn't make up the word list; it is a standard Scrabble word list that I happen to host a copy of.) I'll show you how I address the problem, step by step:" ] }, { @@ -38,10 +38,10 @@ "\n", "Let's start by defining some basics:\n", "\n", - "- A **valid word** is a string of (uppercase) letters, at least 4 letters, with no 's', and not more than 7 distinct letters.\n", + "- A **valid word** is a string of at least 4 letters, with no 'S', and not more than 7 distinct letters.\n", "- A **word list** is, well, a list of words.\n", - "- The **word score** is 1 for a four letter word, or the length of the word plus a bonus of 7 for a pangram.\n", - "- A **pangram** is a word with exactly 7 distinct letters.\n" + "- A **pangram** is a word with exactly 7 distinct letters.\n", + "- The **word score** is 1 for a four letter word, or the length of the word for longer words, plus a bonus of 7 for a pangram.\n" ] }, { @@ -64,18 +64,17 @@ "Word = str # Type for a word\n", "\n", "def valid_words(text) -> List[Word]:\n", - " \"\"\"A list of valid space-separated words in a string. Valid words \n", - " have at least 4 letters, no 'S', and no more than 7 distinct letters.\"\"\"\n", + " \"\"\"Words with at least 4 letters, no 'S', and no more than 7 distinct letters.\"\"\"\n", " return [w for w in text.upper().split() \n", " if len(w) >= 4 and 'S' not in w and len(set(w)) <= 7]\n", "\n", + "def is_pangram(word) -> bool: \n", + " \"\"\"Does a word have 7 distinct letters (some maybe more than once)?\"\"\"\n", + " return len(set(word)) == 7\n", + "\n", "def word_score(word) -> int: \n", " \"\"\"The points for this word, including bonus for pangram.\"\"\"\n", - " return 1 if (len(word) == 4) else len(word) + 7 * is_pangram(word)\n", - "\n", - "def is_pangram(word) -> bool: \n", - " \"\"\"Does a word use all 7 letters (some maybe more than once)?\"\"\"\n", - " return len(set(word)) == 7" + " return 1 if (len(word) == 4) else len(word) + 7 * is_pangram(word)" ] }, { @@ -89,7 +88,18 @@ "cell_type": "code", "execution_count": 3, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "['GAME', 'AMALGAM', 'GLAM', 'MEGAPLEX', 'CACCIATORE', 'EROTICA']" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "mini = valid_words('game amalgam amalgamation glam gem gems em megaplex cacciatore erotica')\n", "mini" @@ -99,35 +109,62 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that `gem` and `em` are too short, `gems` has an `s` which is not allowed, and `amalgamation` has too many distinct letters (8). We're left with six valid words out of the original ten. Here are examples of the other two functions in action:" + "Note that `gem` and `em` are too short, `gems` has an `s` which is not allowed, and `amalgamation` has too many distinct letters (8). We're left with six valid words out of the ten candidate words. Here are examples of the other two functions in action:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'CACCIATORE', 'EROTICA', 'MEGAPLEX'}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "{w: word_score(w) for w in mini}" + "{w for w in mini if is_pangram(w)}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'GAME': 1,\n", + " 'AMALGAM': 7,\n", + " 'GLAM': 1,\n", + " 'MEGAPLEX': 15,\n", + " 'CACCIATORE': 17,\n", + " 'EROTICA': 14}" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "{w for w in mini if is_pangram(w)}" + "{w: word_score(w) for w in mini}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Step 2: Honeycombs, Lettersets, and Game Scores\n", + "# Step 2: Honeycombs and Game Scores\n", "\n", - "A honeycomb lattice can be considered as a **set** of letters. The order of the letters doesn't matter; all that matters is:\n", - " 1. The set of seven letters in the honeycomb.\n", + "In a honeycomb the order of the letters doesn't matter; all that matters is:\n", + " 1. The seven distinct letters in the honeycomb.\n", " 2. The one distinguished center letter.\n", " \n", "Thus, we can represent a honeycomb as follows:\n", @@ -138,20 +175,30 @@ "cell_type": "code", "execution_count": 6, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "Honeycomb(letters='AEGLMPX', center='G')" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "Honeycomb = namedtuple('Honeycomb', 'letters, center')" + "Honeycomb = namedtuple('Honeycomb', 'letters, center')\n", + "\n", + "hc = Honeycomb('AEGLMPX', 'G')\n", + "hc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "I will represent a set of letters (from either a honeycomb or a word) as a sorted string. Why not a Python `set` or `frozenset`? Because a string takes up less space in memory, and its printed representation is more succint and easier to read when debugging. Compare:\n", - "- `frozenset({'A', 'C', 'E', 'I', 'O', 'R', 'T'})`\n", - "- `'ACEIORT'`\n", - "\n", - "I'll use the name `Letterset` for the type, and `letterset` for the function that converts a word to a set of letters:" + "The **game score** for a honeycomb is the sum of the word scores for all the words that the honeycomb can make. How do we know if a honeycomb can make a word? It can if (1) the word contains the honeycomb's center and (2) every letter in the word is in the honeycomb. " ] }, { @@ -160,7 +207,108 @@ "metadata": {}, "outputs": [], "source": [ - "Letterset = str # Type for sets of letters, like \"AGLM\"\n", + "def game_score(honeycomb, words) -> int:\n", + " \"\"\"The total score for this honeycomb.\"\"\"\n", + " return sum(word_score(w) for w in words if can_make(honeycomb, w))\n", + "\n", + "def can_make(honeycomb, word) -> bool:\n", + " \"\"\"Can the honeycomb make this word?\"\"\"\n", + " letters, center = honeycomb\n", + " return center in word and all(L in letters for L in word)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "24" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "game_score(hc, mini)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'GAME': 1, 'AMALGAM': 7, 'GLAM': 1, 'MEGAPLEX': 15}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{w: word_score(w) for w in mini if can_make(hc, w)}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 3: Best Honeycomb\n", + "\n", + "\n", + "How many possible honeycombs are there? We can put any letter in the center, then any 6 letters around the outside; since there is no 'S', this gives a total of 25 × (24 choose 6) = 3,364,900 possible honeycombs. We could conceivably ask for the game score of every one of them and pick the best; that would probably take hours of computation (not seconds, and not days).\n", + "\n", + "However, a key constraint of the game is that **there must be at least one pangram** in the set of words that a valid honeycomb can make. That means that a valid honeycomb must ***be*** the set of seven letters in one of the pangram words in the word list, with any of the seven letters as the center. My approach to find the best (highest scoring) honeycomb is:\n", + "\n", + " * Go through all the words and find all the valid honeycombs: the 7-letter pangram letter sets, with any of the 7 letters as center.\n", + " * Compute the game score for each valid honeycomb and return a honeycomb with maximal game score." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "def best_honeycomb(words) -> Honeycomb: \n", + " \"\"\"Return a honeycomb with highest game score on these words.\"\"\"\n", + " return max(valid_honeycombs(words), \n", + " key=lambda h: game_score(h, words))\n", + "\n", + "def valid_honeycombs(words) -> List[Honeycomb]:\n", + " \"\"\"Valid Honeycombs are the pangram lettersets, with any center.\"\"\"\n", + " pangram_lettersets = {letterset(w) for w in words if is_pangram(w)}\n", + " return [Honeycomb(letters, center) \n", + " for letters in pangram_lettersets \n", + " for center in letters]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I will represent a set of letters as a sorted string of distinct letters. Why not a Python `set` (or `frozenset` if we want it to be the key of a dict)? Because a string takes up less space in memory, and its printed representation is easier to read when debugging. Furthermore, with sets no more than 7 letters, the time to test for set membership will be small either way. Compare:\n", + "- `frozenset({'A', 'E', 'G', 'L', 'M', 'P', 'X'})`\n", + "- `'AEGLMPX'`\n", + "\n", + "I'll use the name `letterset` for the function that converts a word to a set of letters, and `Letterset` for the resulting type:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "Letterset = str # Type for a set of letters, like \"AGLM\"\n", "\n", "def letterset(word) -> Letterset:\n", " \"\"\"The set of letters in a word, represented as a sorted str.\"\"\"\n", @@ -169,17 +317,7 @@ }, { "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "honeycomb = Honeycomb(letterset('AEGLMPX'), 'G')\n", - "honeycomb" - ] - }, - { - "cell_type": "code", - "execution_count": 9, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -193,7 +331,7 @@ " 'EROTICA': 'ACEIORT'}" ] }, - "execution_count": 9, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -206,77 +344,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that 'AMALGAM' and 'GLAM' have the same letterset, as do 'CACCIATORE' and 'EROTICA'. \n", - "\n", - "The game score for a honeycomb is the sum of the word scores for all the words that the honeycomb can make. How do we know if a honeycomb can make a word? It can if (1) the word contains the honeycomb's center and (2) every letter in the word is in the honeycomb. " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "def game_score(honeycomb, words) -> int:\n", - " \"\"\"The total score for this honeycomb.\"\"\"\n", - " return sum(word_score(w) for w in words if can_make(honeycomb, w))\n", - "\n", - "def can_make(honeycomb, word) -> bool:\n", - " \"\"\"Can the honeycomb make this word?\"\"\"\n", - " return honeycomb.center in word and all(L in honeycomb.letters for L in word)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "24" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "game_score(honeycomb, mini)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Step 3: Best Honeycomb\n", - "\n", - "This puzzle is different from many other word puzzles because it deals with *unordered sets* of letters, not *ordered permutations* of letters. That makes things easier. When I searched for an optimal 5×5 [Boggle](Boggle.ipynb) board, I couldn't exhaustively try all $26^{(5×5)} \\approx 10^{35}$ possibilites; I could only do hillclimbing to find a local maximum. But for Spelling Bee, it *is* feasible to try every possibility and get a guaranteed highest-scoring honeycomb. \n", - "\n", - "A key constraint of the game is that **there must be at least one pangram** in the set of words that a valid honeycomb can make. That means that every valid honeycomb must ***be*** a pangram letterset of one of the words in the word list. So my approach to find the best (highest scoring) honeycomb is:\n", - "\n", - " * Go through all the words and find all the valid honeycombs: the 7-letter pangram lettersets, with any of the 7 letters as center.\n", - " * Compute the game score for each valid honeycomb and return a best one.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "def best_honeycomb(words) -> Tuple[int, Honeycomb]: \n", - " \"\"\"Return (score, honeycomb) for a honeycomb with highest score on these words.\"\"\"\n", - " honeycombs = valid_honeycombs(map(letterset, words))\n", - " return max((game_score(h, words), h) for h in honeycombs)\n", - "\n", - "def valid_honeycombs(lettersets) -> List[Honeycomb]:\n", - " \"\"\"All valid Honeycombs that can be made from these lettersets.\"\"\"\n", - " pangram_lettersets = {s for s in lettersets if len(s) == 7}\n", - " return [Honeycomb(letters, center) \n", - " for letters in pangram_lettersets \n", - " for center in letters]" + "Note that 'AMALGAM' and 'GLAM' have the same letterset, as do 'CACCIATORE' and 'EROTICA'." ] }, { @@ -287,7 +355,20 @@ { "data": { "text/plain": [ - "(31, Honeycomb(letters='ACEIORT', center='T'))" + "[Honeycomb(letters='ACEIORT', center='A'),\n", + " Honeycomb(letters='ACEIORT', center='C'),\n", + " Honeycomb(letters='ACEIORT', center='E'),\n", + " Honeycomb(letters='ACEIORT', center='I'),\n", + " Honeycomb(letters='ACEIORT', center='O'),\n", + " Honeycomb(letters='ACEIORT', center='R'),\n", + " Honeycomb(letters='ACEIORT', center='T'),\n", + " Honeycomb(letters='AEGLMPX', center='A'),\n", + " Honeycomb(letters='AEGLMPX', center='E'),\n", + " Honeycomb(letters='AEGLMPX', center='G'),\n", + " Honeycomb(letters='AEGLMPX', center='L'),\n", + " Honeycomb(letters='AEGLMPX', center='M'),\n", + " Honeycomb(letters='AEGLMPX', center='P'),\n", + " Honeycomb(letters='AEGLMPX', center='X')]" ] }, "execution_count": 13, @@ -295,6 +376,26 @@ "output_type": "execute_result" } ], + "source": [ + "valid_honeycombs(mini)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Honeycomb(letters='ACEIORT', center='A')" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "best_honeycomb(mini)" ] @@ -312,14 +413,14 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - " 172820 enable1.txt\r\n" + " 172820 enable1.txt\n" ] } ], @@ -330,7 +431,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, "metadata": {}, "outputs": [ { @@ -339,7 +440,7 @@ "44585" ] }, - "execution_count": 15, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } @@ -351,7 +452,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 17, "metadata": {}, "outputs": [ { @@ -360,7 +461,7 @@ "14741" ] }, - "execution_count": 16, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -372,7 +473,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -381,49 +482,13 @@ "7986" ] }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "lettersets = {letterset(w) for w in pangrams}\n", - "len(lettersets)" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "55902" - ] - }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "len(valid_honeycombs(lettersets))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "So we have the following counts:\n", - "\n", - "- 172,820 words in the `enable1` word list\n", - "- 44,585 valid Spelling Bee words\n", - "- 14,741 pangram words \n", - "- 7,986 distinct pangram lettersets\n", - "- 55,902 (7 × 7,986) valid honeycombs\n", - "\n", - "How long will it take to run `best_honeycomb(enable1)`? Most of the computation time is in `game_score` (which has to look at all 44,585 words), so let's estimate the total time by first checking how long it takes to compute the game score of a single honeycomb:" + "len({letterset(w) for w in pangrams}) # pangram lettersets" ] }, { @@ -431,18 +496,10 @@ "execution_count": 19, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 12.6 ms, sys: 260 µs, total: 12.9 ms\n", - "Wall time: 13 ms\n" - ] - }, { "data": { "text/plain": [ - "153" + "55902" ] }, "execution_count": 19, @@ -451,14 +508,22 @@ } ], "source": [ - "%time game_score(honeycomb, enable1)" + "len(valid_honeycombs(enable1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "About 13 milliseconds. How many minutes would that be for all 55,902 valid honeycombs?" + "To summarize, there are:\n", + "\n", + "- 172,820 words in the `enable1` word list\n", + "- 44,585 valid Spelling Bee words\n", + "- 14,741 pangram words \n", + "- 7,986 distinct pangram lettersets\n", + "- 55,902 (7 × 7,986) valid honeycombs\n", + "\n", + "How long will it take to run `best_honeycomb(enable1)`? Most of the computation time is in `game_score` (which has to look at all 44,585 valid words), so let's estimate the total time by first checking how long it takes to compute the game score of a single honeycomb:" ] }, { @@ -466,10 +531,18 @@ "execution_count": 20, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 11.4 ms, sys: 616 µs, total: 12 ms\n", + "Wall time: 11.9 ms\n" + ] + }, { "data": { "text/plain": [ - "12.1121" + "153" ] }, "execution_count": 20, @@ -478,46 +551,77 @@ } ], "source": [ - ".013 * 55902 / 60" + "%time game_score(hc, enable1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "About 12 minutes. I could run `best_honeycomb(enable1)` right now and take a coffee break until it completes, but I think that a puzzle like this deserves a more elegant solution. I'd like to get the run time under a minute (as is suggested in [Project Euler](https://projecteuler.net/)), and I have an idea how to do it.\n", - "\n", - "# Step 5: Making it Faster\n", - "\n", - "Here's my plan for a more efficient program:\n", - "\n", - "1. Keep the same strategy of trying every pangram letterset, but do some precomputation that will make `game_score` much faster.\n", - "1. The precomputation is: compute the `letterset` and `word_score` for each word, and make a table of `{letterset: total_points}` giving the total number of points that can be made with each letterset. I call this a **points table**.\n", - "3. These calculations are independent of the honeycomb, so they need to be done only once, not 55,902 times. \n", - "4. For each valid honeycomb, pass it and the points table to `game_score2` (we changed the name because the interface has changed). In `game_score2`, generate every valid **subset** of the letters in the honeycomb. A valid subset must include the center letter, and it may or may not include each of the other 6 letters, so there are exactly $2^6 = 64$ valid subsets. The function `letter_subsets(honeycomb)` computes these. \n", - "5. To compute `game_score`, just take the sum of the 64 subset entries in the points table.\n", - "\n", - "\n", - "That means that in `game_score` we no longer need to iterate over 44,585 words and check if each word is a subset of the honeycomb. Instead we iterate over the 64 subsets of the honeycomb and for each one check—in one table lookup—whether it is a word (or more than word) and how many total points those word(s) score. Since 64 < 44,585, that's a nice optimization!\n", - "\n", - "\n", - "Here's the code." + "About 11 milliseconds on my computer (this may vary). How many minutes would it be to run `game_score` for all 55,902 valid honeycombs?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10.248699999999998" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + ".011 * 55902 / 60" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "About 10 minutes. I could run `best_honeycomb(enable1)` right now and take a coffee break until it completes, but I think that a puzzle like this deserves a more elegant solution. I'd like to get the run time under a minute (as is suggested in [Project Euler](https://projecteuler.net/)), and I have an idea how to do it.\n", + "\n", + "# Step 5: Faster Algorithm: Points Table\n", + "\n", + "Here's my plan for a more efficient program:\n", + "\n", + "1. Keep the same strategy of trying every pangram letterset, but do some precomputation that will make `game_score` much faster.\n", + "1. The precomputation is: compute the `letterset` and `word_score` for each word, and make a table of `{letterset: total_points}` giving the total number of word score points for all the words that correspond to each letterset. I call this a **points table**.\n", + "3. These calculations are independent of the honeycomb, so they need to be done only once, not 55,902 times. \n", + "4. `game_score2` (the name is changed because the interface has changed) takes a honeycomb and a points table as input. The idea is that every word that the honeycomb can make must have a letterset that is the same as a valid **letter subset** of the honeycomb. A valid letter subset must include the center letter, and it may or may not include each of the other 6 letters, so there are exactly $2^6 = 64$ valid letter subsets. (The function `letter_subsets(honeycomb)` computes these.)\n", + "The result of `game_score2` is the sum of the honeycomb's 64 letter subset entries in the points table.\n", + "\n", + "\n", + "That means that in `game_score2` we no longer need to iterate over 44,585 words and check if each word is a subset of the honeycomb. Instead we iterate over the 64 subsets of the honeycomb and for each one check—in one table lookup—whether it is a word (or more than word) and how many total points those word(s) score. Since 64 < 44,585, that's a nice optimization!\n", + "\n", + "\n", + "Here's the code:" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, "outputs": [], "source": [ - "def best_honeycomb(words) -> Tuple[int, Honeycomb]: \n", - " \"\"\"Return (score, honeycomb) for the honeycomb with highest score on these words.\"\"\"\n", - " pts_table = points_table(words)\n", - " return max((game_score2(h, pts_table), h)\n", - " for h in valid_honeycombs(pts_table))\n", + "PointsTable = Dict[Letterset, int]\n", "\n", - "def points_table(words) -> Dict[Letterset, int]:\n", - " \"\"\"Return a dict of {letterset: points} from words.\"\"\"\n", + "def best_honeycomb(words) -> Honeycomb: \n", + " \"\"\"Return a honeycomb with highest game score on these words.\"\"\"\n", + " points_table = tabulate_points(words)\n", + " honeycombs = (Honeycomb(letters, center) \n", + " for letters in points_table if len(letters) == 7 \n", + " for center in letters)\n", + " return max(honeycombs, key=lambda h: game_score2(h, points_table))\n", + "\n", + "def tabulate_points(words) -> PointsTable:\n", + " \"\"\"Return a Counter of {letterset: points} from words.\"\"\"\n", " table = Counter()\n", " for w in words:\n", " table[letterset(w)] += word_score(w)\n", @@ -525,41 +629,23 @@ "\n", "def letter_subsets(honeycomb) -> List[Letterset]:\n", " \"\"\"The 64 subsets of the letters in the honeycomb, always including the center letter.\"\"\"\n", - " return [''.join(subset) \n", + " return [letters \n", " for n in range(1, 8) \n", - " for subset in combinations(honeycomb.letters, n)\n", - " if honeycomb.center in subset]\n", + " for letters in map(''.join, combinations(honeycomb.letters, n))\n", + " if honeycomb.center in letters]\n", "\n", - "def game_score2(honeycomb, pts_table) -> int:\n", + "def game_score2(honeycomb, points_table) -> int:\n", " \"\"\"The total score for this honeycomb, given a points_table.\"\"\"\n", - " return sum(pts_table[s] for s in letter_subsets(honeycomb))" + " return sum(points_table[letterset] for letterset in letter_subsets(honeycomb))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's get a feel for how this works. First the `letter_subsets` (a 4-letter honeycomb makes $2^3 = 8$ subsets; 7-letter honeycombs make 64):" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['C', 'AC', 'BC', 'CD', 'ABC', 'ACD', 'BCD', 'ABCD']" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "letter_subsets(Honeycomb('ABCD', 'C')) " + "Let's get a feel for how this works. \n", + "\n", + "First `letter_subsets` (a 4-letter honeycomb makes $2^3 = 8$ subsets; 7-letter honeycombs make 64):" ] }, { @@ -570,7 +656,7 @@ { "data": { "text/plain": [ - "['GAME', 'AMALGAM', 'GLAM', 'MEGAPLEX', 'CACCIATORE', 'EROTICA']" + "['C', 'AC', 'BC', 'CD', 'ABC', 'ACD', 'BCD', 'ABCD']" ] }, "execution_count": 23, @@ -578,6 +664,26 @@ "output_type": "execute_result" } ], + "source": [ + "letter_subsets(Honeycomb('ABCD', 'C')) " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['GAME', 'AMALGAM', 'GLAM', 'MEGAPLEX', 'CACCIATORE', 'EROTICA']" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "mini # Remind me again what the mini word list is?" ] @@ -586,12 +692,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now the `points_table`:" + "Now `tabulate_points`:" ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 25, "metadata": {}, "outputs": [ { @@ -600,13 +706,13 @@ "Counter({'AEGM': 1, 'AGLM': 8, 'AEGLMPX': 15, 'ACEIORT': 31})" ] }, - "execution_count": 24, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "points_table(mini)" + "tabulate_points(mini)" ] }, { @@ -620,12 +726,12 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ - "assert game_score2(honeycomb, points_table(mini)) == 24\n", - "assert best_honeycomb(mini) == (31, Honeycomb('ACEIORT', 'T'))" + "assert game_score2(hc, tabulate_points(mini)) == 24\n", + "assert best_honeycomb(mini).letters == 'ACEIORT'" ] }, { @@ -644,24 +750,129 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 2.2 s, sys: 6.33 ms, total: 2.21 s\n", - "Wall time: 2.21 s\n" + "CPU times: user 2.04 s, sys: 3.14 ms, total: 2.05 s\n", + "Wall time: 2.05 s\n" ] }, { "data": { "text/plain": [ - "(3898, Honeycomb(letters='AEGINRT', center='R'))" + "(Honeycomb(letters='AEGINRT', center='R'), 3898)" ] }, - "execution_count": 26, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%time best = best_honeycomb(enable1)\n", + "\n", + "best, game_score(best, enable1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Wow! 3898 is a high score!** \n", + "\n", + "And it took only 2 seconds of computation to find the best honeycomb, not 10 minutes!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 7: Even Faster Algorithm: Branch and Bound\n", + "\n", + "A run time of 2 seconds is pretty good! But what if the word list were 100 times bigger? What if a honeycomb could have 14 letters, not just 7? We might still be looking for ideas to speed up the computation. I happen to have one.\n", + "\n", + "Consider the word 'EQUIVOKE'. It is a pangram, but what with the 'Q' and 'V' and 'K', it is not a high-scoring honeycomb:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "48" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "game_score(Honeycomb('EIKOQUV', 'E'), enable1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "None of the other six center letters does any better. It would be great if we could eliminate all seven of these honeycombs at once, rather than trying each one in turn. So my idea is to:\n", + "- Keep track of the best honeycomb and best score found so far.\n", + "- For each new pangram letterset, ask \"if we weren't required to use the center letter, would this letterset score higher than the best honeycomb so far?\" \n", + "- If yes, then try it with all seven centers; if not then discard it immediately.\n", + "- This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: if an **upper bound** of the new letterset's score can't beat the best honeycomb so far, then we prune a whole **branch** of the search tree consisting of the seven honeycombs that have that letterset.\n", + "\n", + "What wpuld the score of a letterset be if we weren't required to use the center letter? It turns out I can make a dummy Honeycomb and specify the empty string for the center, `Honeycomb(letters, '')`, and call `game_score2` on that. This works because of a quirk of Python: we ask if `honeycomb.center in letters`; normally in Python the expression `x in y` means \"is `x` a member of the collection `y`\", but when `y` is a string it means \"is `x` a substring of `y`\", and the empty string is a substring of every string. (If I had represented a letterset as a Python `set`, this wouldn't work.)\n", + "\n", + "Thus, I can rewrite `best_honeycomb` as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "def best_honeycomb(words) -> Honeycomb: \n", + " \"\"\"Return a honeycomb with highest game score on these words.\"\"\"\n", + " points_table = tabulate_points(words)\n", + " best, best_score = None, 0\n", + " for letters in points_table:\n", + " if len(letters) == 7 and game_score2(Honeycomb(letters, ''), points_table) > best_score:\n", + " for center in letters:\n", + " honeycomb = Honeycomb(letters, center)\n", + " score = game_score2(honeycomb, points_table)\n", + " if score > best_score:\n", + " best, best_score = honeycomb, score\n", + " return best" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 459 ms, sys: 2.7 ms, total: 461 ms\n", + "Wall time: 461 ms\n" + ] + }, + { + "data": { + "text/plain": [ + "Honeycomb(letters='AEGINRT', center='R')" + ] + }, + "execution_count": 30, "metadata": {}, "output_type": "execute_result" } @@ -674,15 +885,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**Wow! 3898 is a high score!** And it took only 2 seconds of computation to find it!\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Curiosity\n", + "Same honeycomb for the answer, but four times faster—less than half a second.\n", + "\n", + "# Step 8: Curiosity\n", "\n", "I'm curious about a bunch of things.\n", "\n", @@ -691,7 +896,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 31, "metadata": {}, "outputs": [ { @@ -700,7 +905,7 @@ "'ANTITOTALITARIAN'" ] }, - "execution_count": 27, + "execution_count": 31, "metadata": {}, "output_type": "execute_result" } @@ -718,36 +923,51 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['AARDWOLF',\n", + " 'ANCIENTER',\n", " 'BABBLEMENT',\n", + " 'BIVARIATE',\n", " 'CABEZON',\n", + " 'CHEERFUL',\n", " 'COLLOGUING',\n", + " 'CRANKLE',\n", " 'DEMERGERING',\n", + " 'DWELLING',\n", " 'ETYMOLOGY',\n", + " 'FLATTING',\n", " 'GARROTTING',\n", + " 'HANDIER',\n", " 'IDENTIFY',\n", + " 'INTERVIEWER',\n", " 'LARVICIDAL',\n", + " 'MANDRAGORA',\n", " 'MORTGAGEE',\n", + " 'NOTABLE',\n", " 'OVERHELD',\n", + " 'PERONEAL',\n", " 'PRAWNED',\n", + " 'QUILTER',\n", " 'REINITIATED',\n", + " 'TABLEFUL',\n", " 'TOWHEAD',\n", - " 'UTOPIAN']" + " 'UNCHURCHLY',\n", + " 'UTOPIAN',\n", + " 'WINDAGE']" ] }, - "execution_count": 28, + "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "pangrams[::1000] # Every thousandth one" + "pangrams[::500] # Every five-hundreth pangram" ] }, { @@ -759,27 +979,24 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[('has an S', 103913),\n", - " ('valid', 44585),\n", - " ('more than 7 distinct letters', 23400),\n", - " ('less than 4 letters', 922)]" + "[('has S', 103913), ('valid', 44585), ('> 7', 23400), ('< 4', 922)]" ] }, - "execution_count": 29, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "Counter('has an S' if 'S' in w else \n", - " 'less than 4 letters' if len(w) < 4 else \n", - " 'more than 7 distinct letters' if len(set(w)) > 7 else \n", + "Counter('has S' if 'S' in w else \n", + " '< 4' if len(w) < 4 else \n", + " '> 7' if len(set(w)) > 7 else \n", " 'valid'\n", " for w in open('enable1.txt').read().upper().split()).most_common()" ] @@ -790,12 +1007,12 @@ "source": [ "There are more than twice as many words with an 'S' as there are valid words.\n", "\n", - "* About the `points_table`: How many different letter subsets are there? " + "* About the points table: How many different letter subsets are there? " ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 34, "metadata": {}, "outputs": [ { @@ -804,13 +1021,13 @@ "21661" ] }, - "execution_count": 30, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "pts = points_table(enable1)\n", + "pts = tabulate_points(enable1)\n", "len(pts)" ] }, @@ -825,7 +1042,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 35, "metadata": {}, "outputs": [ { @@ -843,7 +1060,7 @@ " ('ACDEIRT', 307)]" ] }, - "execution_count": 31, + "execution_count": 35, "metadata": {}, "output_type": "execute_result" } @@ -856,82 +1073,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The best honeycomb is also the highest scoring letter subset on its own (although it only gets 832 of the 3,898 total points from using all seven letters).\n", - "\n", - "* Which letter subsets score the least points?" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('ANYZ', 1),\n", - " ('BEUZ', 1),\n", - " ('EINZ', 1),\n", - " ('EKRZ', 1),\n", - " ('ILZ', 1),\n", - " ('CIOZ', 1),\n", - " ('KNOZ', 1),\n", - " ('NOZ', 1),\n", - " ('IORZ', 1),\n", - " ('EMYZ', 1)]" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pts.most_common()[-10:]" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "824" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sum(v == 1 for v in pts.values()) # How many letter subsets score 1 point?" + "The best honeycomb is also the highest scoring letter subset on its own (although it only gets 832 of the 3,898 total points from using all seven letters)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "There are 824 letter subsets that only appear in one four-letter word, for one point." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Fancy Report\n", + "# Step 9: Fancy Report\n", "\n", - "I'd like to see the actual words that each honeycomb can make, in addition to the total score, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this turned out to be a lot more complicated than I expected. I guess it is difficult to create a practical extraction and reporting tool. I feel you, Larry Wall." + "I'd like to see the actual words that each honeycomb can make, in addition to the total score, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this turned out to be a lot more complicated than I expected. I guess it is difficult to create a practical extraction and reporting tool. I feel you, [Larry Wall](http://www.wall.org/~larry/)." ] }, { "cell_type": "code", - "execution_count": 34, - "metadata": { - "scrolled": false - }, + "execution_count": 36, + "metadata": {}, "outputs": [], "source": [ "from textwrap import fill\n", @@ -941,7 +1098,7 @@ " for the best honeycomb if no honeycomb is given) over the given word list.\"\"\"\n", " bins = group_by(words, letterset)\n", " adj = (\"given\" if honeycomb else \"optimal\")\n", - " honeycomb = honeycomb or best_honeycomb(words)[1]\n", + " honeycomb = honeycomb or best_honeycomb(words)\n", " points = game_score(honeycomb, words)\n", " subsets = letter_subsets(honeycomb)\n", " nwords = sum(len(bins[s]) for s in subsets)\n", @@ -972,7 +1129,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 37, "metadata": {}, "outputs": [ { @@ -992,15 +1149,13 @@ } ], "source": [ - "report(mini, honeycomb)" + "report(mini, hc)" ] }, { "cell_type": "code", - "execution_count": 36, - "metadata": { - "scrolled": false - }, + "execution_count": 38, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -1174,14 +1329,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# S Words\n", + "# Step 10: S Words\n", "\n", "What if we allowed honeycombs and words to have an 'S' in them?" ] }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 39, "metadata": {}, "outputs": [ { @@ -1190,7 +1345,7 @@ "(98141, 44585)" ] }, - "execution_count": 37, + "execution_count": 39, "metadata": {}, "output_type": "execute_result" } @@ -1211,10 +1366,8 @@ }, { "cell_type": "code", - "execution_count": 38, - "metadata": { - "scrolled": false - }, + "execution_count": 40, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -1503,13 +1656,26 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Yes it does!\n", + "Yes it does (roughly) double the score!\n", + "\n", + "# Summary\n", + "\n", + "Thanks to a series of ideas, we were able to achieve a substantial reduction in the number of honeycombs that need to be examined (a factor of 400), the run time needed for `game_score` (a factor of about 200), and the overall run time (a factor of about 70,000).\n", + "\n", + "- **Enumeration (10 hours (estimate) run time; 3,364,900 honeycombs)**
Try every possible honeycomb.\n", + "- **Pangram Lettersets (11 minutes (estimate) run time; 55,902 honeycombs)**
Try just the honeycombs that are pangram lettersets (with every center).\n", + "- **Points Table (2 seconds run time; 55,902 honeycombs)**
Precompute the score for each letterset, and sum the 64 letter subsets of each honeycomb.\n", + "- **Branch and Bound (1/2 second run time; 8,084 honeycombs)**
Try every center only for lettersets that score better than the best score so far.\n", + "\n", "\n", - "# Pictures\n", "\n", "Here are pictures for the highest-scoring honeycombs, with and without an S:\n", "\n", - "" + "\n", + "
\n", + " 537 words; 3,898 points         1,179 words; 8,681 points\n", + "
\n", + "
" ] } ], @@ -1533,5 +1699,5 @@ } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 }