Add files via upload

This commit is contained in:
Peter Norvig 2020-01-10 12:28:13 -08:00 committed by GitHub
parent 56b1aab373
commit f63d7be48a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -28,7 +28,9 @@
"\n",
"# My Approach\n",
"\n",
"Since the referenced [word list](https://norvig.com/ngrams/enable1.txt) came from *my* web site (it is a standard Scrabble word list that I host a copy of), I felt somewhat compelled to solve this. I had worked on word games before, like Scrabble and Boggle. This puzzle is different because it deals with *unordered sets* of letters, not *ordered permutations* of letters. That makes things much easier. When I searched for an optimal 5×5 Boggle board, I couldn't exhaustively try all $26^{(5×5)} \\approx 10^{35}$ possibilites; I could only do hillclimbing to find a local maximum. But for Spelling Bee, it is feasible to try every possibility and get a guaranteed highest-scoring honeycomb. Here's a sketch of my approach:\n",
"Since the referenced [word list](https://norvig.com/ngrams/enable1.txt) came from *my* web site (I didn't make up the list; it is a standard Scrabble word list that I happen to host a copy of), I felt somewhat compelled to solve this one. \n",
"\n",
"This puzzle is different from other word puzzles because it deals with *unordered sets* of letters, not *ordered permutations* of letters. That makes things easier. When I searched for an optimal 5×5 Boggle board, I couldn't exhaustively try all $26^{(5×5)} \\approx 10^{35}$ possibilites; I could only do hillclimbing to find a local maximum. But for Spelling Bee, it is feasible to try every possibility and get a guaranteed highest-scoring honeycomb. Here's a sketch of my approach:\n",
" \n",
"- Since order and repetition don't count, we can represent a word as a **set** of letters, which I will call a `letterset`. For simplicity I'll choose to implement that as a sorted string (not as a Python `set` or `frozenset`). For example:\n",
" letterset(\"GLAM\") == letterset(\"AMALGAM\") == \"AGLM\"\n",
@ -37,7 +39,7 @@
"- Since the rules say every valid honeycomb must contain a pangram, it must be that case that every valid honeycomb *is* a pangram. That means:\n",
" * The number of valid honeycombs is 7 times the number of pangram lettersets (because any of the 7 letters could be the center).\n",
" * I will consider every valid honeycomb and compute the game score for each one.\n",
" * The one with the highest game score is guaranteed to be the best possible honeycomb.\n"
" * The one with the highest game score is guaranteed to be the optimal honeycomb.\n"
]
},
{
@ -116,9 +118,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `I`, `me` and `gem` are too short, `games` has an `S` which is not allowed, and `amalgamation` has too many distinct letters. We're left with six valid words out of the original eleven.\n",
"\n",
"Here are examples of the functions in action:"
"Note that `I`, `me` and `gem` are too short, `games` has an `S` which is not allowed, and `amalgamation` has too many distinct letters (8). We're left with six valid words out of the original eleven. Here are examples of the functions in action:"
]
},
{
@ -366,116 +366,19 @@
"source": [
"So: we start with 172,820 words in the enable1 word list, reduce that to 44,585 valid Spelling Bee words, and find that 14,741 of those words are pangrams. \n",
"\n",
"I'm curious: what's the highest-scoring individual word?"
"How long will it take to run `best_honeycomb(enable1)`? Let's estimate by checking how long it takes to compute the game score of a single honeycomb:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ANTITOTALITARIAN'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max(enable1, key=word_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And what are some of the pangrams?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['AARDWOLF',\n",
" 'BABBLEMENT',\n",
" 'CABEZON',\n",
" 'COLLOGUING',\n",
" 'DEMERGERING',\n",
" 'ETYMOLOGY',\n",
" 'GARROTTING',\n",
" 'IDENTIFY',\n",
" 'LARVICIDAL',\n",
" 'MORTGAGEE',\n",
" 'OVERHELD',\n",
" 'PRAWNED',\n",
" 'REINITIATED',\n",
" 'TOWHEAD',\n",
" 'UTOPIAN']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pangrams[::1000] # Every thousandth one"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And what's the breakdown of reasons why words are invalid?\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('S', 103913), ('valid', 44585), ('long', 23400), ('short', 922)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Counter(('S' if 'S' in w else 'short' if len(w) < 4 else 'long' if len(set(w)) > 7 else 'valid')\n",
" for w in open('enable1.txt').read().upper().split()).most_common()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are more than twice as many words with an 'S' than there are valid words.\n",
"But how long will it take to run the computation on the big `enable1` word list? Let's see how long it takes to compute the game score of a single honeycomb:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 11.9 ms, sys: 506 µs, total: 12.4 ms\n",
"CPU times: user 11.9 ms, sys: 391 µs, total: 12.3 ms\n",
"Wall time: 12 ms\n"
]
},
@ -485,7 +388,7 @@
"153"
]
},
"execution_count": 17,
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
@ -503,7 +406,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 15,
"metadata": {},
"outputs": [
{
@ -512,7 +415,7 @@
"20.6374"
]
},
"execution_count": 18,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
@ -533,7 +436,7 @@
"\n",
"1. Keep the same strategy of trying every pangram, but do some precomputation that will make `game_score` much faster.\n",
"1. The precomputation is: compute the `letterset` and `word_score` for each word, and make a table of `{letterset: points}` giving the total number of points that can be made with each letterset. I call this a `points_table`.\n",
"3. These calculations are independent of the honeycomb, so they only need to be done, not 14,741 × 7 times. \n",
"3. These calculations are independent of the honeycomb, so they need to be done only once, not 14,741 × 7 times. \n",
"4. Within `game_score`, generate every valid **subset** of the letters in the honeycomb. A valid subset must include the center letter, and it may or may not include each of the other 6 letters, so there are exactly $2^6 = 64$ subsets. The function `letter_subsets(honeycomb)` returns these.\n",
"5. To compute `game_score`, just take the sum of the 64 subset entries in the points table.\n",
"\n",
@ -543,12 +446,12 @@
"Since 64 &lt; 44,585, that's a nice optimization!\n",
"\n",
"\n",
"Here's the code. Notice we've changed the interface to `game_score`; it now takes a points table, not a word list."
"Here's the code. Notice we've changed the interface to `game_score`; it now takes a points table, not a word list. So beware if you are jumping around in this notebook and re-executing previous cells."
]
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
@ -589,7 +492,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 17,
"metadata": {},
"outputs": [
{
@ -598,13 +501,14 @@
"['C', 'AC', 'BC', 'CD', 'ABC', 'ACD', 'BCD', 'ABCD']"
]
},
"execution_count": 20,
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"letter_subsets(('ABCD', 'C')) # A 4-letter honeycomb gives 2**3 = 8 subsets; 7-letter gives 64"
"# A 4-letter honeycomb makes 2**3 = 8 subsets; 7-letter honeycombs make 64\n",
"letter_subsets(('ABCD', 'C')) "
]
},
{
@ -616,29 +520,41 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['AMALGAM', 'GAME', 'GLAM', 'MEGAPLEX', 'CACCIATORE', 'EROTICA']\n"
]
},
"data": {
"text/plain": [
"['AMALGAM', 'GAME', 'GLAM', 'MEGAPLEX', 'CACCIATORE', 'EROTICA']"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"words # Remind me again what the words are?"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'AGLM': 8, 'AEGM': 1, 'AEGLMPX': 15, 'ACEIORT': 31})"
]
},
"execution_count": 21,
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(words)\n",
"points_table(words)"
]
},
@ -653,7 +569,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
@ -677,15 +593,15 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2.05 s, sys: 5.2 ms, total: 2.05 s\n",
"Wall time: 2.06 s\n"
"CPU times: user 1.99 s, sys: 4.75 ms, total: 2 s\n",
"Wall time: 2 s\n"
]
},
{
@ -694,7 +610,7 @@
"[3898, ('AEGINRT', 'R')]"
]
},
"execution_count": 23,
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
@ -708,15 +624,236 @@
"metadata": {},
"source": [
"**Wow! 3898 is a high score!** And it took only 2 seconds to find it!\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Curiosity\n",
"\n",
"# Fancy Report\n",
"I'm curious about a bunch of things.\n",
"\n",
"I'd like to see the actual words in addition to the total score, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this all turned out to be more complicated than I expected."
"What's the highest-scoring individual word?"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ANTITOTALITARIAN'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max(enable1, key=word_score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are some of the pangrams?"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['AARDWOLF',\n",
" 'BABBLEMENT',\n",
" 'CABEZON',\n",
" 'COLLOGUING',\n",
" 'DEMERGERING',\n",
" 'ETYMOLOGY',\n",
" 'GARROTTING',\n",
" 'IDENTIFY',\n",
" 'LARVICIDAL',\n",
" 'MORTGAGEE',\n",
" 'OVERHELD',\n",
" 'PRAWNED',\n",
" 'REINITIATED',\n",
" 'TOWHEAD',\n",
" 'UTOPIAN']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pangrams[::1000] # Every thousandth one"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the breakdown of reasons why words are invalid?\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('S', 103913), ('valid', 44585), ('>7', 23400), ('<4', 922)]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Counter('S' if 'S' in w else '<4' if len(w) < 4 else '>7' if len(set(w)) > 7 else 'valid'\n",
" for w in open('enable1.txt').read().upper().split()).most_common()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are more than twice as many words with an 'S' as there are valid words.\n",
"\n",
"About the `points_table`: How many different letter subsets are there? Which ones score the most? The least?"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"21661"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pts = points_table(enable1)\n",
"len(pts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That means there's about two valid words for each letterset."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('AEGINRT', 832),\n",
" ('ADEGINR', 486),\n",
" ('ACILNOT', 470),\n",
" ('ACEINRT', 465),\n",
" ('CEINORT', 398),\n",
" ('AEGILNT', 392),\n",
" ('AGINORT', 380),\n",
" ('ADEINRT', 318),\n",
" ('CENORTU', 318),\n",
" ('ACDEIRT', 307),\n",
" ('AEGILNR', 304),\n",
" ('AEILNRT', 283),\n",
" ('AEGINR', 270),\n",
" ('ACINORT', 266),\n",
" ('ADENRTU', 265),\n",
" ('EGILNRT', 259),\n",
" ('AILNORT', 252),\n",
" ('DEGINR', 251),\n",
" ('AEIMNRT', 242),\n",
" ('ACELORT', 241)]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pts.most_common(20)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('IRY', 1),\n",
" ('AGOY', 1),\n",
" ('GHOY', 1),\n",
" ('GIOY', 1),\n",
" ('EKOY', 1),\n",
" ('ORUY', 1),\n",
" ('EOWY', 1),\n",
" ('ANUY', 1),\n",
" ('AGUY', 1),\n",
" ('ELUY', 1),\n",
" ('ANYZ', 1),\n",
" ('BEUZ', 1),\n",
" ('EINZ', 1),\n",
" ('EKRZ', 1),\n",
" ('ILZ', 1),\n",
" ('CIOZ', 1),\n",
" ('KNOZ', 1),\n",
" ('NOZ', 1),\n",
" ('IORZ', 1),\n",
" ('EMYZ', 1)]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pts.most_common()[-20:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fancy Report\n",
"\n",
"I'd like to see the actual words that each honeycomb can make, in addition to the total score, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this all turned out to be more complicated than I expected. I guess it is difficult to create a practical extraction and reporting tool. I feel you, Larry Wall."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": false
},
@ -733,18 +870,22 @@
" subsets = letter_subsets(honeycomb)\n",
" bins = group_by(words, letterset)\n",
" score = sum(word_score(w) for w in words if letterset(w) in subsets)\n",
" N = sum(len(bins[s]) for s in subsets)\n",
" print(f'For this list of {len(words):,d} words:')\n",
" nwords = sum(len(bins[s]) for s in subsets)\n",
" print(f'For this list of {Ns(len(words), \"word\")}:')\n",
" print(f'The {optimal}honeycomb {honeycomb} forms '\n",
" f'{N:,d} words for {score:,d} points.')\n",
" print(f'Here are the words formed, with pangrams first:\\n')\n",
" f'{Ns(nwords, \"word\")} for {Ns(score, \"point\")}.')\n",
" print(f'Here are the words formed by each subset, with pangrams first:\\n')\n",
" for s in sorted(subsets, key=lambda s: (-len(s), s)):\n",
" if bins[s]:\n",
" pts = sum(word_score(w) for w in bins[s])\n",
" print(f'{s} forms {len(bins[s])} words for {pts:,d} points:')\n",
" print(f'{s} forms {Ns(len(bins[s]), \"word\")} for {Ns(pts, \"point\")}:')\n",
" words = [f'{w}({word_score(w)})' for w in sorted(bins[s])]\n",
" print(fill(' '.join(words), width=80,\n",
" initial_indent=' ', subsequent_indent=' '))\n",
" \n",
"def Ns(n, things):\n",
" \"\"\"Ns(3, 'bear') => '3 bears'; Ns(1, 'world') => '1 world'\"\"\" \n",
" return f\"{n:,d} {things}{'' if n == 1 else 's'}\"\n",
"\n",
"def group_by(items, key):\n",
" \"Group items into bins of a dict, each bin keyed by key(item).\"\n",
@ -756,7 +897,7 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 29,
"metadata": {},
"outputs": [
{
@ -765,11 +906,11 @@
"text": [
"For this list of 6 words:\n",
"The honeycomb ('AEGLMPX', 'G') forms 4 words for 24 points.\n",
"Here are the words formed, with pangrams first:\n",
"Here are the words formed by each subset, with pangrams first:\n",
"\n",
"AEGLMPX forms 1 words for 15 points:\n",
"AEGLMPX forms 1 word for 15 points:\n",
" MEGAPLEX(15)\n",
"AEGM forms 1 words for 1 points:\n",
"AEGM forms 1 word for 1 point:\n",
" GAME(1)\n",
"AGLM forms 2 words for 8 points:\n",
" AMALGAM(7) GLAM(1)\n"
@ -782,7 +923,7 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 30,
"metadata": {
"scrolled": false
},
@ -793,7 +934,7 @@
"text": [
"For this list of 44,585 words:\n",
"The optimal honeycomb ('AEGINRT', 'R') forms 537 words for 3,898 points.\n",
"Here are the words formed, with pangrams first:\n",
"Here are the words formed by each subset, with pangrams first:\n",
"\n",
"AEGINRT forms 50 words for 832 points:\n",
" AERATING(15) AGGREGATING(18) ARGENTINE(16) ARGENTITE(16) ENTERTAINING(19)\n",
@ -858,9 +999,9 @@
" AGRARIAN(8) AIRING(6) ANGARIA(7) ARRAIGN(7) ARRAIGNING(10) ARRANGING(9)\n",
" GARAGING(8) GARNI(5) GARRING(7) GNARRING(8) GRAIN(5) GRAINING(8) INGRAIN(7)\n",
" INGRAINING(10) RAGGING(7) RAGING(6) RAINING(7) RANGING(7) RARING(6)\n",
"AGIRT forms 1 words for 5 points:\n",
"AGIRT forms 1 word for 5 points:\n",
" TRAGI(5)\n",
"AGNRT forms 1 words for 5 points:\n",
"AGNRT forms 1 word for 5 points:\n",
" GRANT(5)\n",
"AINRT forms 9 words for 64 points:\n",
" ANTIAIR(7) ANTIAR(6) ANTIARIN(8) INTRANT(7) IRRITANT(8) RIANT(5) TITRANT(7)\n",
@ -943,7 +1084,7 @@
" EGER(1) EGGER(5) GREE(1) GREEGREE(8)\n",
"EIR forms 2 words for 11 points:\n",
" EERIE(5) EERIER(6)\n",
"ENR forms 1 words for 1 points:\n",
"ENR forms 1 word for 1 point:\n",
" ERNE(1)\n",
"ERT forms 7 words for 27 points:\n",
" RETE(1) TEETER(6) TERETE(6) TERRET(6) TETTER(6) TREE(1) TRET(1)\n",
@ -967,7 +1108,7 @@
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
@ -979,7 +1120,7 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 32,
"metadata": {
"scrolled": false
},
@ -990,7 +1131,7 @@
"text": [
"For this list of 98,141 words:\n",
"The optimal honeycomb ('AEINRST', 'E') forms 1,179 words for 8,681 points.\n",
"Here are the words formed, with pangrams first:\n",
"Here are the words formed by each subset, with pangrams first:\n",
"\n",
"AEINRST forms 86 words for 1,381 points:\n",
" ANESTRI(14) ANTISERA(15) ANTISTRESS(17) ANTSIER(14) ARENITES(15)\n",
@ -1164,7 +1305,7 @@
" AERIE(5) AERIER(6) AIRER(5) AIRIER(6)\n",
"AEIS forms 2 words for 13 points:\n",
" EASIES(6) SASSIES(7)\n",
"AEIT forms 1 words for 6 points:\n",
"AEIT forms 1 word for 6 points:\n",
" TATTIE(6)\n",
"AENR forms 9 words for 40 points:\n",
" ANEAR(5) ARENA(5) EARN(1) EARNER(6) NEAR(1) NEARER(6) RANEE(5) REEARN(6)\n",
@ -1234,15 +1375,15 @@
" ASEA(1) ASSES(5) ASSESS(6) ASSESSES(8) EASE(1) EASES(5) SASSES(6) SEAS(1)\n",
"AET forms 2 words for 2 points:\n",
" TATE(1) TEAT(1)\n",
"EIN forms 1 words for 1 points:\n",
"EIN forms 1 word for 1 point:\n",
" NINE(1)\n",
"EIR forms 2 words for 11 points:\n",
" EERIE(5) EERIER(6)\n",
"EIS forms 7 words for 35 points:\n",
" ISSEI(5) ISSEIS(6) SEIS(1) SEISE(5) SEISES(6) SISES(5) SISSIES(7)\n",
"EIT forms 1 words for 6 points:\n",
"EIT forms 1 word for 6 points:\n",
" TITTIE(6)\n",
"ENR forms 1 words for 1 points:\n",
"ENR forms 1 word for 1 point:\n",
" ERNE(1)\n",
"ENS forms 6 words for 20 points:\n",
" NESS(1) NESSES(6) SEEN(1) SENE(1) SENSE(5) SENSES(6)\n",
@ -1257,7 +1398,7 @@
" SESTET(6) SESTETS(7) SETS(1) SETT(1) SETTEE(6) SETTEES(7) SETTS(5) STET(1)\n",
" STETS(5) TEES(1) TEST(1) TESTEE(6) TESTEES(7) TESTES(6) TESTS(5) TETS(1)\n",
" TSETSE(6) TSETSES(7)\n",
"EN forms 1 words for 1 points:\n",
"EN forms 1 word for 1 point:\n",
" NENE(1)\n",
"ES forms 3 words for 7 points:\n",
" ESES(1) ESSES(5) SEES(1)\n"
@ -1272,9 +1413,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the highest-scoring honeycombs, with and without an S:\n",
"# Pictures\n",
"\n",
"<img src=\"http://norvig.com/honeycombs.png\" width=\"400\">"
"Here are pictures for the highest-scoring honeycombs, with and without an S:\n",
"\n",
"<img src=\"http://norvig.com/honeycombs.png\" width=\"350\">"
]
}
],