Add files via upload

This commit is contained in:
Peter Norvig 2021-02-14 17:38:04 -08:00 committed by GitHub
parent 4108cba6c8
commit 6ccd7558fb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -14,7 +14,7 @@
"> \n", "> \n",
"> <img src=\"https://fivethirtyeight.com/wp-content/uploads/2020/01/Screen-Shot-2019-12-24-at-5.46.55-PM.png?w=1136\" width=\"150\">\n", "> <img src=\"https://fivethirtyeight.com/wp-content/uploads/2020/01/Screen-Shot-2019-12-24-at-5.46.55-PM.png?w=1136\" width=\"150\">\n",
"> \n", "> \n",
"> The goal is to identify as many words that meet the following criteria:\n", "> The goal is to identify as many words as possible that meet the following criteria:\n",
"> 1. The word must be at least four letters long.\n", "> 1. The word must be at least four letters long.\n",
"> 2. The word must include the central letter.\n", "> 2. The word must include the central letter.\n",
"> 3. The word cannot include any letter beyond the seven given letters.\n", "> 3. The word cannot include any letter beyond the seven given letters.\n",
@ -68,13 +68,13 @@
" return [w for w in text.upper().split() \n", " return [w for w in text.upper().split() \n",
" if len(w) >= 4 and 'S' not in w and len(set(w)) <= 7]\n", " if len(w) >= 4 and 'S' not in w and len(set(w)) <= 7]\n",
"\n", "\n",
"def is_pangram(word) -> bool: \n", "def pangram_bonus(word) -> int: \n",
" \"\"\"Does a word have 7 distinct letters (some maybe more than once)?\"\"\"\n", " \"\"\"Does a word get a bonus for having 7 distinct letters (some maybe more than once)?\"\"\"\n",
" return len(set(word)) == 7\n", " return 7 if len(set(word)) == 7 else 0\n",
"\n", "\n",
"def word_score(word) -> int: \n", "def word_score(word) -> int: \n",
" \"\"\"The points for this word, including bonus for pangram.\"\"\"\n", " \"\"\"The points for this word, including bonus for pangram.\"\"\"\n",
" return 1 if (len(word) == 4) else len(word) + 7 * is_pangram(word)" " return 1 if len(word) == 4 else len(word) + pangram_bonus(word)"
] ]
}, },
{ {
@ -129,7 +129,7 @@
} }
], ],
"source": [ "source": [
"{w for w in mini if is_pangram(w)}" "{w for w in mini if pangram_bonus(w)}"
] ]
}, },
{ {
@ -285,7 +285,7 @@
"\n", "\n",
"def valid_honeycombs(words) -> List[Honeycomb]:\n", "def valid_honeycombs(words) -> List[Honeycomb]:\n",
" \"\"\"Valid Honeycombs are the pangram lettersets, with any center.\"\"\"\n", " \"\"\"Valid Honeycombs are the pangram lettersets, with any center.\"\"\"\n",
" pangram_lettersets = {letterset(w) for w in words if is_pangram(w)}\n", " pangram_lettersets = {letterset(w) for w in words if pangram_bonus(w)}\n",
" return [Honeycomb(letters, center) \n", " return [Honeycomb(letters, center) \n",
" for letters in pangram_lettersets \n", " for letters in pangram_lettersets \n",
" for center in letters]" " for center in letters]"
@ -295,7 +295,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"I will represent a set of letters as a sorted string of distinct letters. Why not a Python `set` (or `frozenset` if we want it to be the key of a dict)? Because a string takes up less space in memory, and its printed representation is easier to read when debugging. Furthermore, with sets no more than 7 letters, the time to test for set membership will be small either way. Compare:\n", "I will represent a set of letters as a sorted string of distinct letters. Why not a Python `set` (or `frozenset` if we want it to be the key of a dict)? Because a string takes up less space in memory, and its printed representation is easier to read when debugging. Compare:\n",
"- `frozenset({'A', 'E', 'G', 'L', 'M', 'P', 'X'})`\n", "- `frozenset({'A', 'E', 'G', 'L', 'M', 'P', 'X'})`\n",
"- `'AEGLMPX'`\n", "- `'AEGLMPX'`\n",
"\n", "\n",
@ -355,20 +355,20 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"[Honeycomb(letters='ACEIORT', center='A'),\n", "[Honeycomb(letters='AEGLMPX', center='A'),\n",
" Honeycomb(letters='ACEIORT', center='C'),\n",
" Honeycomb(letters='ACEIORT', center='E'),\n",
" Honeycomb(letters='ACEIORT', center='I'),\n",
" Honeycomb(letters='ACEIORT', center='O'),\n",
" Honeycomb(letters='ACEIORT', center='R'),\n",
" Honeycomb(letters='ACEIORT', center='T'),\n",
" Honeycomb(letters='AEGLMPX', center='A'),\n",
" Honeycomb(letters='AEGLMPX', center='E'),\n", " Honeycomb(letters='AEGLMPX', center='E'),\n",
" Honeycomb(letters='AEGLMPX', center='G'),\n", " Honeycomb(letters='AEGLMPX', center='G'),\n",
" Honeycomb(letters='AEGLMPX', center='L'),\n", " Honeycomb(letters='AEGLMPX', center='L'),\n",
" Honeycomb(letters='AEGLMPX', center='M'),\n", " Honeycomb(letters='AEGLMPX', center='M'),\n",
" Honeycomb(letters='AEGLMPX', center='P'),\n", " Honeycomb(letters='AEGLMPX', center='P'),\n",
" Honeycomb(letters='AEGLMPX', center='X')]" " Honeycomb(letters='AEGLMPX', center='X'),\n",
" Honeycomb(letters='ACEIORT', center='A'),\n",
" Honeycomb(letters='ACEIORT', center='C'),\n",
" Honeycomb(letters='ACEIORT', center='E'),\n",
" Honeycomb(letters='ACEIORT', center='I'),\n",
" Honeycomb(letters='ACEIORT', center='O'),\n",
" Honeycomb(letters='ACEIORT', center='R'),\n",
" Honeycomb(letters='ACEIORT', center='T')]"
] ]
}, },
"execution_count": 13, "execution_count": 13,
@ -420,7 +420,7 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
" 172820 enable1.txt\n" " 172820 enable1.txt\r\n"
] ]
} }
], ],
@ -467,7 +467,7 @@
} }
], ],
"source": [ "source": [
"pangrams = [w for w in enable1 if is_pangram(w)]\n", "pangrams = [w for w in enable1 if pangram_bonus(w)]\n",
"len(pangrams)" "len(pangrams)"
] ]
}, },
@ -535,8 +535,8 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"CPU times: user 11.4 ms, sys: 616 µs, total: 12 ms\n", "CPU times: user 9.11 ms, sys: 204 µs, total: 9.32 ms\n",
"Wall time: 11.9 ms\n" "Wall time: 9.12 ms\n"
] ]
}, },
{ {
@ -558,7 +558,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"About 11 milliseconds on my computer (this may vary). How many minutes would it be to run `game_score` for all 55,902 valid honeycombs?" "About 10 milliseconds on my computer (this may vary). How many minutes would it be to run `game_score` for all 55,902 valid honeycombs?"
] ]
}, },
{ {
@ -569,7 +569,7 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"10.248699999999998" "9.317"
] ]
}, },
"execution_count": 21, "execution_count": 21,
@ -578,14 +578,14 @@
} }
], ],
"source": [ "source": [
".011 * 55902 / 60" ".01 * 55902 / 60"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"About 10 minutes. I could run `best_honeycomb(enable1)` right now and take a coffee break until it completes, but I think that a puzzle like this deserves a more elegant solution. I'd like to get the run time under a minute (as is suggested in [Project Euler](https://projecteuler.net/)), and I have an idea how to do it.\n", "About 9 or 10 minutes. I could run `best_honeycomb(enable1)` right now and take a coffee break until it completes, but I think that a puzzle like this deserves a more elegant solution. I'd like to get the run time under a minute (as is suggested in [Project Euler](https://projecteuler.net/)), and I have an idea how to do it.\n",
"\n", "\n",
"# Step 5: Faster Algorithm: Points Table\n", "# Step 5: Faster Algorithm: Points Table\n",
"\n", "\n",
@ -635,7 +635,7 @@
" if honeycomb.center in letters]\n", " if honeycomb.center in letters]\n",
"\n", "\n",
"def game_score2(honeycomb, points_table) -> int:\n", "def game_score2(honeycomb, points_table) -> int:\n",
" \"\"\"The total score for this honeycomb, given a points_table.\"\"\"\n", " \"\"\"The total score for this honeycomb, using a points table.\"\"\"\n",
" return sum(points_table[letterset] for letterset in letter_subsets(honeycomb))" " return sum(points_table[letterset] for letterset in letter_subsets(honeycomb))"
] ]
}, },
@ -721,7 +721,7 @@
"source": [ "source": [
"The letterset `'AGLM'` gets 8 points, 7 for AMALGAM and 1 for GLAM. `'ACEIORT'` gets 31 points, 17 for CACCIATORE and 14 for EROTICA. The other lettersets represent one word each. \n", "The letterset `'AGLM'` gets 8 points, 7 for AMALGAM and 1 for GLAM. `'ACEIORT'` gets 31 points, 17 for CACCIATORE and 14 for EROTICA. The other lettersets represent one word each. \n",
"\n", "\n",
"Let's make sure we haven't broken the `game_score` and `best_honeycomb` functions:" "Let's make sure we haven't broken the `best_honeycomb` function:"
] ]
}, },
{ {
@ -730,7 +730,6 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"assert game_score2(hc, tabulate_points(mini)) == 24\n",
"assert best_honeycomb(mini).letters == 'ACEIORT'" "assert best_honeycomb(mini).letters == 'ACEIORT'"
] ]
}, },
@ -757,8 +756,8 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"CPU times: user 2.04 s, sys: 3.14 ms, total: 2.05 s\n", "CPU times: user 1.76 s, sys: 1.59 ms, total: 1.76 s\n",
"Wall time: 2.05 s\n" "Wall time: 1.76 s\n"
] ]
}, },
{ {
@ -784,7 +783,7 @@
"source": [ "source": [
"**Wow! 3898 is a high score!** \n", "**Wow! 3898 is a high score!** \n",
"\n", "\n",
"And it took only 2 seconds of computation to find the best honeycomb, not 10 minutes!" "And it took less than 2 seconds of computation to find the best honeycomb!"
] ]
}, },
{ {
@ -793,9 +792,9 @@
"source": [ "source": [
"# Step 7: Even Faster Algorithm: Branch and Bound\n", "# Step 7: Even Faster Algorithm: Branch and Bound\n",
"\n", "\n",
"A run time of 2 seconds is pretty good! But what if the word list were 100 times bigger? What if a honeycomb could have 14 letters, not just 7? We might still be looking for ideas to speed up the computation. I happen to have one.\n", "A run time of 2 seconds is pretty good! But what if the word list were 100 times bigger? What if a honeycomb had 12 letters around the outside, not just 6? We might still be looking for ideas to speed up the computation. I happen to have one.\n",
"\n", "\n",
"Consider the word 'EQUIVOKE'. It is a pangram, but what with the 'Q' and 'V' and 'K', it is not a high-scoring honeycomb:" "Consider the word 'EQUIVOKE'. It is a pangram, but what with the 'Q' and 'V' and 'K', it is not a high-scoring honeycomb, regardless of what center is used:"
] ]
}, },
{ {
@ -806,7 +805,7 @@
{ {
"data": { "data": {
"text/plain": [ "text/plain": [
"48" "{'E': 48, 'Q': 29, 'U': 29, 'I': 32, 'V': 35, 'O': 36, 'K': 34}"
] ]
}, },
"execution_count": 28, "execution_count": 28,
@ -815,20 +814,21 @@
} }
], ],
"source": [ "source": [
"game_score(Honeycomb('EIKOQUV', 'E'), enable1)" "{C: game_score(Honeycomb('EIKOQUV', C), enable1)\n",
" for C in 'EQUIVOK'}"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"None of the other six center letters does any better. It would be great if we could eliminate all seven of these honeycombs at once, rather than trying each one in turn. So my idea is to:\n", "It would be great if we could eliminate all seven of these honeycombs at once, rather than trying each one in turn. So my idea is to:\n",
"- Keep track of the best honeycomb and best score found so far.\n", "- Keep track of the best honeycomb and best score found so far.\n",
"- For each new pangram letterset, ask \"if we weren't required to use the center letter, would this letterset score higher than the best honeycomb so far?\" \n", "- For each new pangram letterset, ask \"if we weren't required to use the center letter, would this letterset score higher than the best honeycomb so far?\" \n",
"- If yes, then try it with all seven centers; if not then discard it immediately.\n", "- If yes, then try it with all seven centers; if not then discard it immediately.\n",
"- This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: if an **upper bound** of the new letterset's score can't beat the best honeycomb so far, then we prune a whole **branch** of the search tree consisting of the seven honeycombs that have that letterset.\n", "- This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: if an **upper bound** of the new letterset's score can't beat the best honeycomb so far, then we prune a whole **branch** of the search tree consisting of the seven honeycombs that have that letterset.\n",
"\n", "\n",
"What wpuld the score of a letterset be if we weren't required to use the center letter? It turns out I can make a dummy Honeycomb and specify the empty string for the center, `Honeycomb(letters, '')`, and call `game_score2` on that. This works because of a quirk of Python: we ask if `honeycomb.center in letters`; normally in Python the expression `x in y` means \"is `x` a member of the collection `y`\", but when `y` is a string it means \"is `x` a substring of `y`\", and the empty string is a substring of every string. (If I had represented a letterset as a Python `set`, this wouldn't work.)\n", "What would the score of a letterset be if we weren't required to use the center letter? It turns out I can make a dummy Honeycomb and specify the empty string for the center, `Honeycomb(letters, '')`, and call `game_score2` on that. This works because of a quirk of Python: we ask if `honeycomb.center in letters`; normally in Python the expression `x in y` means \"is `x` a member of the collection `y`\", but when `y` is a string it means \"is `x` a substring of `y`\", and the empty string is a substring of every string. (If I had represented a letterset as a Python `set`, this wouldn't work.)\n",
"\n", "\n",
"Thus, I can rewrite `best_honeycomb` as follows:" "Thus, I can rewrite `best_honeycomb` as follows:"
] ]
@ -843,10 +843,11 @@
" \"\"\"Return a honeycomb with highest game score on these words.\"\"\"\n", " \"\"\"Return a honeycomb with highest game score on these words.\"\"\"\n",
" points_table = tabulate_points(words)\n", " points_table = tabulate_points(words)\n",
" best, best_score = None, 0\n", " best, best_score = None, 0\n",
" for letters in points_table:\n", " pangrams = (s for s in points_table if len(s) == 7)\n",
" if len(letters) == 7 and game_score2(Honeycomb(letters, ''), points_table) > best_score:\n", " for p in pangrams:\n",
" for center in letters:\n", " if game_score2(Honeycomb(p, ''), points_table) > best_score:\n",
" honeycomb = Honeycomb(letters, center)\n", " for center in p:\n",
" honeycomb = Honeycomb(p, center)\n",
" score = game_score2(honeycomb, points_table)\n", " score = game_score2(honeycomb, points_table)\n",
" if score > best_score:\n", " if score > best_score:\n",
" best, best_score = honeycomb, score\n", " best, best_score = honeycomb, score\n",
@ -862,8 +863,8 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"CPU times: user 459 ms, sys: 2.7 ms, total: 461 ms\n", "CPU times: user 393 ms, sys: 943 µs, total: 394 ms\n",
"Wall time: 461 ms\n" "Wall time: 393 ms\n"
] ]
}, },
{ {
@ -891,7 +892,7 @@
"\n", "\n",
"I'm curious about a bunch of things.\n", "I'm curious about a bunch of things.\n",
"\n", "\n",
"* What's the highest-scoring individual word?" "### What's the highest-scoring individual word?"
] ]
}, },
{ {
@ -918,7 +919,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"* What are some of the pangrams?" "### What are some of the pangrams?"
] ]
}, },
{ {
@ -974,7 +975,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"* What's the breakdown of reasons why words are invalid?" "### What's the breakdown of reasons why words are invalid?"
] ]
}, },
{ {
@ -1007,7 +1008,7 @@
"source": [ "source": [
"There are more than twice as many words with an 'S' as there are valid words.\n", "There are more than twice as many words with an 'S' as there are valid words.\n",
"\n", "\n",
"* About the points table: How many different letter subsets are there? " "### About the points table: How many different letter subsets are there? "
] ]
}, },
{ {
@ -1037,7 +1038,7 @@
"source": [ "source": [
"That means there's about two valid words for each letterset.\n", "That means there's about two valid words for each letterset.\n",
"\n", "\n",
"* Which letter subsets score the most?" "### Which letter subsets score the most?"
] ]
}, },
{ {
@ -1663,9 +1664,9 @@
"Thanks to a series of ideas, we were able to achieve a substantial reduction in the number of honeycombs that need to be examined (a factor of 400), the run time needed for `game_score` (a factor of about 200), and the overall run time (a factor of about 70,000).\n", "Thanks to a series of ideas, we were able to achieve a substantial reduction in the number of honeycombs that need to be examined (a factor of 400), the run time needed for `game_score` (a factor of about 200), and the overall run time (a factor of about 70,000).\n",
"\n", "\n",
"- **Enumeration (10 hours (estimate) run time; 3,364,900 honeycombs)**<br>Try every possible honeycomb.\n", "- **Enumeration (10 hours (estimate) run time; 3,364,900 honeycombs)**<br>Try every possible honeycomb.\n",
"- **Pangram Lettersets (11 minutes (estimate) run time; 55,902 honeycombs)**<br>Try just the honeycombs that are pangram lettersets (with every center).\n", "- **Pangram Lettersets (10 minutes (estimate) run time; 55,902 honeycombs)**<br>Try just the honeycombs that are pangram lettersets (with every center).\n",
"- **Points Table (2 seconds run time; 55,902 honeycombs)**<br>Precompute the score for each letterset, and sum the 64 letter subsets of each honeycomb.\n", "- **Points Table (under 2 seconds run time; 55,902 honeycombs)**<br>Precompute the score for each letterset, and sum the 64 letter subsets of each honeycomb.\n",
"- **Branch and Bound (1/2 second run time; 8,084 honeycombs)**<br>Try every center only for lettersets that score better than the best score so far.\n", "- **Branch and Bound (under 1/2 second run time; 8,084 honeycombs)**<br>Try every center only for lettersets that score better than the best score so far.\n",
"\n", "\n",
"\n", "\n",
"\n", "\n",