diff --git a/ipynb/SpellingBee.ipynb b/ipynb/SpellingBee.ipynb
index 7f42b25..bc7fd35 100644
--- a/ipynb/SpellingBee.ipynb
+++ b/ipynb/SpellingBee.ipynb
@@ -19,7 +19,7 @@
"> 2. The word must include the central letter.\n",
"> 3. The word cannot include any letter beyond the seven given letters.\n",
">\n",
- ">Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, etc. Words that use all seven letters in the honeycomb are known as **pangrams** and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 8 + 7 = 15 points.\n",
+ ">Note that letters can be repeated. For example, GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, etc. Words that use all seven letters in the honeycomb are known as **pangrams** and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 8 + 7 = 15 points.\n",
">\n",
"> ***Which seven-letter honeycomb results in the highest possible score?*** To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.\n",
">\n",
@@ -27,7 +27,7 @@
"\n",
"\n",
"\n",
- "Since the referenced [word list](https://norvig.com/ngrams/enable1.txt) came from [***my*** web site](https://norvig.com/ngrams), I felt somewhat compelled to solve this one. (Note it is a standard Scrabble® word list that I happen to host a copy of; I didn't curate it.) \n",
+ "Since the referenced word list came from [***my*** web site](https://norvig.com/ngrams), I felt compelled to solve this puzzle. (Note it is a standard public domain Scrabble® word list that I happen to host a copy of; I didn't curate it, Mendel Cooper and Alan Beale did.) \n",
"\n",
"I'll show you how I address the problem. First some imports, then we'll work through 10 steps."
]
@@ -38,24 +38,26 @@
"metadata": {},
"outputs": [],
"source": [
- "from collections import Counter, defaultdict\n",
+ "from collections import defaultdict\n",
"from dataclasses import dataclass\n",
"from itertools import combinations\n",
- "from typing import List, Set, Dict, Tuple"
+ "from typing import List, Set, Dict, Tuple, Iterable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# 1: Words, Word Scores, and Pangrams\n",
+ "# 1: Letters, Lettersets, Words, and Pangrams\n",
"\n",
- "Let's start by defining some basic terms:\n",
+ "Let's start by defining the most basic terms:\n",
"\n",
- "- **valid word**: a string of at least 4 letters ('A' to 'Z' but not 'S'), and not more than 7 distinct letters.\n",
- "- **word list**: a list of valid words.\n",
- "- **pangram**: a word with exactly 7 distinct letters.\n",
- "- **word score**: 1 for a four letter word, or the length of the word for longer words, plus 7 for a pangram."
+ "- **Letter**: the valid letters are uppercase 'A' to 'Z', but not 'S'.\n",
+ "- **Letterset**: the set of distinct letters in a word.\n",
+ "- **Word**: A string of letters.\n",
+ "- **valid word**: a word of at least 4 letters, all valid, and not more than 7 distinct letters.\n",
+ "- **pangram**: a valid word with exactly 7 distinct letters.\n",
+ "- **word list**: a list of valid words."
]
},
{
@@ -64,26 +66,31 @@
"metadata": {},
"outputs": [],
"source": [
- "def is_valid(word) -> bool:\n",
- " \"\"\"Is word 4 or more letters, no 'S', and no more than 7 distinct letters?\"\"\"\n",
- " return len(word) >= 4 and 'S' not in word and len(set(word)) <= 7\n",
+ "letters = set('ABCDEFGHIJKLMNOPQR' + 'TUVWXYZ')\n",
+ "Letter = str\n",
+ "Letterset = str\n",
+ "Word = str \n",
"\n",
- "def word_list(text) -> List[str]: \n",
- " \"\"\"All the valid words in text (uppercased).\"\"\"\n",
- " return [w for w in text.upper().split() if is_valid(w)]\n",
+ "def letterset(word) -> Letterset:\n",
+ " \"\"\"The set of distinct letters in a word.\"\"\"\n",
+ " return ''.join(sorted(set(word)))\n",
+ "\n",
+ "def is_valid(word) -> bool:\n",
+ " \"\"\"Is word 4 or more valid letters and no more than 7 distinct letters?\"\"\"\n",
+ " return len(word) >= 4 and letters.issuperset(word) and len(set(word)) <= 7 \n",
"\n",
"def is_pangram(word) -> bool: return len(set(word)) == 7\n",
"\n",
- "def word_score(word) -> int: \n",
- " \"\"\"The points for this word, including bonus for pangram.\"\"\"\n",
- " return 1 if len(word) == 4 else len(word) + 7 * is_pangram(word)"
+ "def word_list(text: str) -> List[Word]: \n",
+ " \"\"\"All the valid words in a text (uppercased).\"\"\"\n",
+ " return [w for w in text.upper().split() if is_valid(w)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "I'll make a mini word list to experiment with: "
+ "Here's a mini word list to experiment with:"
]
},
{
@@ -111,7 +118,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Note that `em` and `gem` are too short, `gems` has an `s` which is not allowed, and `amalgamation` has too many distinct letters (8). We're left with six valid words out of the ten candidate words. Here are examples of the other two functions in action:"
+ "Note that `em` and `gem` are too short, `gems` has an `s`, and `amalgamation` has 8 distinct letters. We're left with six valid words out of the ten candidate words. The pangrams are:"
]
},
{
@@ -131,7 +138,19 @@
}
],
"source": [
- "{w for w in mini if is_pangram(w)}"
+ "set(filter(is_pangram, mini))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Why did I choose to represent a `Letterset` as a sorted string (and not a `set`)? Because:\n",
+ "- A `set` can't be the key of a dict.\n",
+ "- A `frozenset` can be a key, and would be a reasonable choice for `Letterset`, but it:\n",
+ " - Takes up more memory than a `str`.\n",
+ " - Is verbose and hard to read when debugging: `frozenset({'A', 'G', 'L', 'M'})`\n",
+ "- A `str` of distinct letters in sorted order fixes all these issues."
]
},
{
@@ -142,12 +161,7 @@
{
"data": {
"text/plain": [
- "{'AMALGAM': 7,\n",
- " 'CACCIATORE': 17,\n",
- " 'EROTICA': 14,\n",
- " 'GAME': 1,\n",
- " 'GLAM': 1,\n",
- " 'MEGAPLEX': 15}"
+ "'AGLM'"
]
},
"execution_count": 5,
@@ -156,16 +170,20 @@
}
],
"source": [
- "{w: word_score(w) for w in mini}"
+ "assert letterset('AMALGAM') == letterset('GLAM')\n",
+ "\n",
+ "letterset('AMALGAM')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# 2: Honeycombs and Lettersets\n",
+ "# 2: Honeycombs\n",
"\n",
- "A honeycomb lattice consists of (1) a set of seven distinct letters and (2) the one distinguished center letter:\n"
+ "A honeycomb lattice consists of:\n",
+ "- A set of seven distinct letters\n",
+ "- The one distinguished center letter"
]
},
{
@@ -174,17 +192,11 @@
"metadata": {},
"outputs": [],
"source": [
- "Letter = Letterset = str # Types\n",
- "\n",
"@dataclass(frozen=True, order=True)\n",
"class Honeycomb:\n",
" \"\"\"A Honeycomb lattice, with 7 letters, 1 of which is the center.\"\"\"\n",
- " letters: Letterset # 7 letters\n",
- " center: Letter # 1 letter\n",
- " \n",
- "def letterset(word) -> Letterset:\n",
- " \"\"\"The set of letters in a word, represented as a sorted str.\"\"\"\n",
- " return ''.join(sorted(set(word)))"
+ " letters: Letterset \n",
+ " center: Letter "
]
},
{
@@ -212,11 +224,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The type `Letter` is a `str` of 1 letter and `Letterset` is an unordered collection of letters, which I will represent as a sorted `str`. Why not a Python `set` or `frozenset`? Because a `str` takes up less space in memory, and its printed representation is easier to read when debugging. Compare:\n",
- "- `frozenset({'A', 'E', 'G', 'L', 'M', 'P', 'X'})`\n",
- "- `'AEGLMPX'`\n",
+ "# 3: Scoring\n",
"\n",
- "Why sorted? So that equal lettersets are equal:"
+ "- The **word score** is 1 point for a 4-letter word, or the word length for longer words, plus 7 bonus points for a pangram.\n",
+ "- The **game score** for a honeycomb is the sum of the word scores for the words that the honeycomb can make. \n",
+ "- A honeycomb **can make** a word if:\n",
+ " - the word contains the honeycomb's center, and\n",
+ " - every letter in the word is in the honeycomb. "
]
},
{
@@ -225,36 +239,49 @@
"metadata": {},
"outputs": [],
"source": [
- "assert letterset('EROTICA') == letterset('CACCIATORE') == 'ACEIORT'"
+ "def word_score(word) -> int: \n",
+ " \"\"\"The points for this word, including bonus for pangram.\"\"\"\n",
+ " return 1 if len(word) == 4 else (len(word) + 7 * is_pangram(word))\n",
+ "\n",
+ "def game_score(honeycomb, wordlist) -> int:\n",
+ " \"\"\"The total score for this honeycomb.\"\"\"\n",
+ " return sum(word_score(w) for w in wordlist if can_make(honeycomb, w))\n",
+ "\n",
+ "def can_make(honeycomb, word) -> bool:\n",
+ " \"\"\"Can the honeycomb make this word?\"\"\"\n",
+ " return honeycomb.center in word and all(L in honeycomb.letters for L in word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# 3: Game Score\n",
- "\n",
- "The **game score** for a honeycomb is the sum of the word scores for all the words that the honeycomb **can make**. \n",
- "\n",
- "A honeycomb can make a word if\n",
- "(1) the word contains the honeycomb's center, and\n",
- "(2) every letter in the word is in the honeycomb. "
+ "The word scores, game score (on `hc`), and makeable words for `mini` are as follows:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'AMALGAM': 7,\n",
+ " 'CACCIATORE': 17,\n",
+ " 'EROTICA': 14,\n",
+ " 'GAME': 1,\n",
+ " 'GLAM': 1,\n",
+ " 'MEGAPLEX': 15}"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
- "def game_score(honeycomb, wordlist) -> int:\n",
- " \"\"\"The total score for this honeycomb.\"\"\"\n",
- " return sum(word_score(w) \n",
- " for w in wordlist if can_make(honeycomb, w))\n",
- "\n",
- "def can_make(honeycomb, word) -> bool:\n",
- " \"\"\"Can the honeycomb make this word?\"\"\"\n",
- " return honeycomb.center in word and all(L in honeycomb.letters for L in word)"
+ "{w: word_score(w) for w in mini}"
]
},
{
@@ -265,7 +292,7 @@
{
"data": {
"text/plain": [
- "{'AMALGAM': 7, 'GAME': 1, 'GLAM': 1, 'MEGAPLEX': 15}"
+ "24"
]
},
"execution_count": 10,
@@ -274,28 +301,39 @@
}
],
"source": [
- "{w: word_score(w) for w in mini if can_make(hc, w)}"
+ "game_score(hc, mini) # 7 + 1 + 1 + 15"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'AMALGAM', 'GAME', 'GLAM', 'MEGAPLEX'}"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
- "assert game_score(hc, mini) == 24 == sum(_.values())"
+ "{w for w in mini if can_make(hc, w)}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# 4: Top Honeycomb\n",
+ "# 4: Top Honeycomb on Mini Word List\n",
"\n",
- "The strategy for finding the top (highest-scoring) honeycomb is:\n",
- " - Compile a list of valid candidate honeycombs.\n",
+ "A simple strategy for finding the top (highest-game-score) honeycomb is:\n",
+ " - Compile a list of all valid candidate honeycombs.\n",
" - For each honeycomb, compute the game score.\n",
- " - Return a (score, honeycomb) tuple with the highest score."
+ " - Return a (score, honeycomb) tuple with the maximum score."
]
},
{
@@ -304,19 +342,17 @@
"metadata": {},
"outputs": [],
"source": [
- "def top_honeycomb(words) -> Tuple[int, Honeycomb]: \n",
- " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n",
- " return max((game_score(h, words), h) \n",
- " for h in candidate_honeycombs(words))"
+ "def top_honeycomb(wordlist) -> Tuple[int, Honeycomb]: \n",
+ " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n",
+ " return max((game_score(h, wordlist), h) \n",
+ " for h in candidate_honeycombs(wordlist))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "What are the possible candidate honeycombs? We can put any letter (except 'S') in the center, then any 6 remaining letters around the outside; this gives 25 × (24 choose 6) = 3,364,900 possible honeycombs. It would take hours to apply `game_score` to all of these.\n",
- "\n",
- "Fortunately, we can use the constraint that a valid honeycomb **must make at least one pangram**. So the letters of any valid honeycomb must ***be*** the letterset of some pangram (and the center can be any of those letters):"
+ "What are the possible candidate honeycombs? We could try all letters in all slots, but that's a lot of honeycombs. Fortunately, we can use the constraint that a valid honeycomb **must make at least one pangram**. So the letters of any valid honeycomb must ***be*** the letterset of some pangram (and the center can be any of those letters):"
]
},
{
@@ -325,15 +361,15 @@
"metadata": {},
"outputs": [],
"source": [
- "def candidate_honeycombs(words) -> List[Honeycomb]:\n",
+ "def candidate_honeycombs(wordlist) -> List[Honeycomb]:\n",
" \"\"\"Valid honeycombs have pangram letters, with any center.\"\"\"\n",
" return [Honeycomb(letters, center) \n",
- " for letters in pangram_lettersets(words)\n",
+ " for letters in pangram_lettersets(wordlist)\n",
" for center in letters]\n",
"\n",
- "def pangram_lettersets(words) -> Set[Letterset]:\n",
- " \"\"\"All lettersets from the pangram words.\"\"\"\n",
- " return {letterset(w) for w in words if is_pangram(w)}"
+ "def pangram_lettersets(wordlist) -> Set[Letterset]:\n",
+ " \"\"\"All lettersets from the pangram words in wordlist.\"\"\"\n",
+ " return {letterset(w) for w in wordlist if is_pangram(w)}"
]
},
{
@@ -364,20 +400,20 @@
{
"data": {
"text/plain": [
- "[Honeycomb(letters='AEGLMPX', center='A'),\n",
- " Honeycomb(letters='AEGLMPX', center='E'),\n",
- " Honeycomb(letters='AEGLMPX', center='G'),\n",
- " Honeycomb(letters='AEGLMPX', center='L'),\n",
- " Honeycomb(letters='AEGLMPX', center='M'),\n",
- " Honeycomb(letters='AEGLMPX', center='P'),\n",
- " Honeycomb(letters='AEGLMPX', center='X'),\n",
- " Honeycomb(letters='ACEIORT', center='A'),\n",
+ "[Honeycomb(letters='ACEIORT', center='A'),\n",
" Honeycomb(letters='ACEIORT', center='C'),\n",
" Honeycomb(letters='ACEIORT', center='E'),\n",
" Honeycomb(letters='ACEIORT', center='I'),\n",
" Honeycomb(letters='ACEIORT', center='O'),\n",
" Honeycomb(letters='ACEIORT', center='R'),\n",
- " Honeycomb(letters='ACEIORT', center='T')]"
+ " Honeycomb(letters='ACEIORT', center='T'),\n",
+ " Honeycomb(letters='AEGLMPX', center='A'),\n",
+ " Honeycomb(letters='AEGLMPX', center='E'),\n",
+ " Honeycomb(letters='AEGLMPX', center='G'),\n",
+ " Honeycomb(letters='AEGLMPX', center='L'),\n",
+ " Honeycomb(letters='AEGLMPX', center='M'),\n",
+ " Honeycomb(letters='AEGLMPX', center='P'),\n",
+ " Honeycomb(letters='AEGLMPX', center='X')]"
]
},
"execution_count": 15,
@@ -393,7 +429,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Now we're ready to find the highest-scoring honeycomb with the mini word list:"
+ "Now we're ready to find the highest-scoring honeycomb with the `mini` word list:"
]
},
{
@@ -420,11 +456,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "**It works.** But that's just the mini word list. \n",
+ "**The program works.** But that's just the mini word list. \n",
"\n",
- "# 5: The enable1 Word List\n",
+ "# 5: The Full Word List\n",
"\n",
- "Here's the real word list, `enable1.txt`, and some counts derived from it:"
+ "Here's the full-scale word list, `enable1.txt`:"
]
},
{
@@ -445,30 +481,24 @@
"aalii\n",
"aaliis\n",
"aals\n",
- "aardvark\n"
+ "aardvark\n",
+ " 172820 enable1.txt\n"
]
}
],
"source": [
- "! [ -e enable1.txt ] || curl -O http://norvig.com/ngrams/enable1.txt\n",
- "! head enable1.txt"
+ "! [ -e enable1.txt ] || curl -O http://norvig.com/ngrams/enable1.txt\n",
+ "! head enable1.txt\n",
+ "! wc -w enable1.txt"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- " 172820 enable1.txt\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
- "! wc -w enable1.txt"
+ "enable1 = word_list(open('enable1.txt').read())"
]
},
{
@@ -477,108 +507,46 @@
"metadata": {},
"outputs": [
{
- "data": {
- "text/plain": [
- "44585"
- ]
- },
- "execution_count": 19,
- "metadata": {},
- "output_type": "execute_result"
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Some counts for 'enable1.txt':\n",
+ " 172,820 total words\n",
+ " 44,585 valid Spelling Bee words\n",
+ " 14,741 pangram words\n",
+ " 7,986 distinct pangram lettersets\n",
+ " 55,902 candidate pangram-containing honeycombs\n",
+ "3,364,900 or 25 × (24 choose 6) possible honeycombs (98% invalid, non-pangram)\n"
+ ]
}
],
"source": [
- "enable1 = word_list(open('enable1.txt').read())\n",
- "\n",
- "len(enable1)"
+ "print(f\"\"\"Some counts for 'enable1.txt':\n",
+ "{172820:9,d} total words\n",
+ "{len(enable1):9,d} valid Spelling Bee words\n",
+ "{sum(map(is_pangram, enable1)):9,d} pangram words\n",
+ "{len(pangram_lettersets(enable1)):9,d} distinct pangram lettersets\n",
+ "{len(candidate_honeycombs(enable1)):9,d} candidate pangram-containing honeycombs\n",
+ "{25*24*23*22*21*20*19//720:9,d} or 25 × (24 choose 6) possible honeycombs (98% invalid, non-pangram)\"\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "How long will it take to run `top_honeycomb(enable1)`? Most of the computation time is in `game_score`, which is called once for each of the 44,585 valid words, so let's estimate the total time by first checking how long it takes to compute the game score of a single honeycomb:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "14741"
- ]
- },
- "execution_count": 20,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "len([w for w in enable1 if is_pangram(w)])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "7986"
- ]
- },
- "execution_count": 21,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "len(pangram_lettersets(enable1))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "55902"
- ]
- },
- "execution_count": 22,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "len(candidate_honeycombs(enable1))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To summarize, there are:\n",
- "\n",
- "- 172,820 words in the `enable1` word list\n",
- "- 44,585 valid Spelling Bee words\n",
- "- 14,741 pangram words \n",
- "- 7,986 distinct pangram lettersets\n",
- "- 55,902 (7 × 7,986) candidate pangram-containing honeycombs\n",
- "- out of 3,364,900 theoretically possible honeycombs\n",
- "\n",
- "How long will it take to run `top_honeycomb(enable1)`? Most of the computation time is in `game_score` (each call has to look at all 44,585 valid words), so let's estimate the total time by first checking how long it takes to compute the game score of a single honeycomb:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "8.48 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+ "8.47 ms ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
@@ -590,47 +558,51 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Roughly 9 milliseconds on my computer (this may vary). How many seconds to run `game_score` for all 55,902 valid honeycombs?"
+ "Roughly 8 or 9 milliseconds for one honeycomb. For all 55,902 valid honeycombs (in minutes):"
]
},
{
"cell_type": "code",
- "execution_count": 24,
+ "execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "503.118"
+ "8.385299999999999"
]
},
- "execution_count": 24,
+ "execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "55902 * 9/1000"
+ ".009 * 55902 / 60"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "About 500 seconds, or 8 minutes. I could run `top_honeycomb(enable1)`, take a coffee break, come back, and declare victory. \n",
+ "About 8 or 9 minutes. I could run `top_honeycomb(enable1)`, get a coffee, come back, and declare victory. \n",
"\n",
- "But I think that a puzzle like this deserves a more elegant solution. I'd like to get the run time under a minute (as is suggested in [Project Euler](https://projecteuler.net/)), and I have an idea how to do it. \n",
+ "But I think that a puzzle like this deserves a more elegant solution. And I have an idea. \n",
"\n",
"# 6: Faster Algorithm: Points Table\n",
"\n",
- "Here's my plan:\n",
+ "Here's my idea:\n",
"\n",
- "1. Keep the same strategy of trying every pangram letterset, but do some precomputation that will make `game_score` much faster.\n",
- "1. The precomputation is: compute the `letterset` and `word_score` for each word in the word list, and make a table of `{letterset: total_points}` giving the total number of word score points for all the words that correspond to each letterset. I call this a **points table**.\n",
- "3. These calculations are independent of the honeycomb, so they need be done only once, not 55,902 times. \n",
- "4. Every word that a honeycomb can make is formed from a **letter subset** of the honeycomb's 7 letters. A valid letter subset must include the center letter, and may include any non-empty subset of the other 6 letters, so there are 26 – 1 = 63 valid letter subsets. \n",
- "4. `game_score2` considers each of the 63 letter subsets of a honeycomb, and sums the point table entry for each one. \n",
- "5. Thus, `game_score2` iterates over just 63 letter subsets; a big optimization over `game_score`, which iterated over 44,585 words.\n",
+ "1. Try every pangram letterset, but do some precomputation to make `game_score` much faster:\n",
+ " - Compute the `letterset` and `word_score` for each word in the word list.\n",
+ " - Make a table of `{letterset: total_points}` giving the total points of all words with a given letterset. \n",
+ " - I call this a **points table**.\n",
+ " - These calculations are independent of the honeycomb, so are done once, not 55,902 times. \n",
+ "2. `game_score2` considers every letter subset of a honeycomb, and sums the point table entries. \n",
+ " - Every word that a honeycomb can make is formed from a **letter subset** of the honeycomb's 7 letters. \n",
+ " - A letter subset must include the center letter, and may include any non-empty subset of the other 6 letters.\n",
+ " - So there are 26 – 1 = 63 valid letter subsets. \n",
+ " - Thus, `game_score2` iterates over just 63 letter subsets; much fewer than 44,585 valid words.\n",
"\n",
"\n",
"Here's the code:"
@@ -638,22 +610,22 @@
},
{
"cell_type": "code",
- "execution_count": 25,
+ "execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
- "PointsTable = Dict[Letterset, int] # How many points does a letterset score?\n",
+ "PointsTable = Dict[Letterset, int] # How many total points does a letterset score?\n",
"\n",
- "def top_honeycomb2(words) -> Tuple[int, Honeycomb]: \n",
- " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n",
- " points_table = tabulate_points(words)\n",
+ "def top_honeycomb2(wordlist) -> Tuple[int, Honeycomb]: \n",
+ " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n",
+ " points_table = tabulate_points(wordlist)\n",
" return max((game_score2(h, points_table), h) \n",
- " for h in candidate_honeycombs(words))\n",
+ " for h in candidate_honeycombs(wordlist))\n",
"\n",
- "def tabulate_points(words) -> PointsTable:\n",
- " \"\"\"Return a Counter of {letterset: points} from words.\"\"\"\n",
- " table = Counter()\n",
- " for w in words:\n",
+ "def tabulate_points(wordlist) -> PointsTable:\n",
+ " \"\"\"A table of {letterset: points} from words.\"\"\"\n",
+ " table = defaultdict(int)\n",
+ " for w in wordlist:\n",
" table[letterset(w)] += word_score(w)\n",
" return table\n",
"\n",
@@ -673,14 +645,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Let's get a feel for how this works. \n",
- "\n",
- "First consider `letter_subsets`. A 4-letter honeycomb makes $2^3 - 1= 7$ subsets; 7-letter honeycombs make $2^6 - 1= 63$ subsets:"
+ "Let's get a feel for how this works. First, a 4-letter honeycomb has 7 letter subsets and a 7-letter honeycomb has 63:"
]
},
{
"cell_type": "code",
- "execution_count": 26,
+ "execution_count": 23,
"metadata": {},
"outputs": [
{
@@ -689,7 +659,7 @@
"['AG', 'LG', 'MG', 'ALG', 'AMG', 'LMG', 'ALMG']"
]
},
- "execution_count": 26,
+ "execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
@@ -698,53 +668,67 @@
"letter_subsets(Honeycomb('ALMG', 'G')) "
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now let's eminded ourselves what `mini` is, and compute `tabulate_points(mini)`:"
- ]
- },
{
"cell_type": "code",
- "execution_count": 27,
+ "execution_count": 24,
"metadata": {},
"outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "mini = ['AMALGAM', 'CACCIATORE', 'EROTICA', 'GAME', 'GLAM', 'MEGAPLEX']\n"
- ]
- },
{
"data": {
"text/plain": [
- "Counter({'AGLM': 8, 'ACEIORT': 31, 'AEGM': 1, 'AEGLMPX': 15})"
+ "63"
]
},
- "execution_count": 27,
+ "execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "print('mini =', mini)\n",
- "tabulate_points(mini)"
+ "len(letter_subsets(Honeycomb(letterset('MEGAPLEX'), 'G')))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "The letterset `'AGLM'` gets 8 points, 7 for AMALGAM and 1 for GLAM. `'ACEIORT'` gets 31 points, 17 for CACCIATORE and 14 for EROTICA. The other lettersets represent one word each. \n",
- "\n",
- "Let's make sure we haven't broken the `top_honeycomb` function:"
+ "Now let's look at the `mini` word list and the points table for it:"
]
},
{
"cell_type": "code",
- "execution_count": 28,
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(['AMALGAM', 'CACCIATORE', 'EROTICA', 'GAME', 'GLAM', 'MEGAPLEX'],\n",
+ " defaultdict(int, {'AGLM': 8, 'ACEIORT': 31, 'AEGM': 1, 'AEGLMPX': 15}))"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "points_table = tabulate_points(mini)\n",
+ "mini, points_table"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The letterset `'AGLM'` gets 8 points (7 for AMALGAM and 1 for GLAM). `'ACEIORT'` gets 31 points (17 for CACCIATORE and 14 for EROTICA). `'AEGM'` gets 1 for GAME and `'AEGLMPX'` gets 15 for MEGAPLEX. The other 59 lettersets have no words, no points.\n",
+ "\n",
+ "Let's make sure the new `top_honeycomb2` function works as well as the old one:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
@@ -762,20 +746,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Finally, the solution to the puzzle on the real word list:"
+ "We can now solve the puzzle on the real word list:"
]
},
{
"cell_type": "code",
- "execution_count": 29,
+ "execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 1.65 s, sys: 3.78 ms, total: 1.65 s\n",
- "Wall time: 1.65 s\n"
+ "CPU times: user 1.43 s, sys: 4.11 ms, total: 1.44 s\n",
+ "Wall time: 1.44 s\n"
]
},
{
@@ -784,7 +768,7 @@
"(3898, Honeycomb(letters='AEGINRT', center='R'))"
]
},
- "execution_count": 29,
+ "execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
@@ -799,19 +783,19 @@
"source": [
"**Wow! 3898 is a high score!** And the whole computation took **less than 2 seconds**!\n",
"\n",
- "We can see that `game_score2` is about 300 times faster than `game_score`:"
+ "We can see that `game_score2` is about 400 times faster than `game_score`:"
]
},
{
"cell_type": "code",
- "execution_count": 30,
+ "execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "26.4 µs ± 90.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+ "21.3 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
]
}
],
@@ -827,14 +811,14 @@
"source": [
"# 8: Even Faster Algorithm: Branch and Bound\n",
"\n",
- "A run time of less than 2 seconds is pretty good! But I'm not ready to stop now.\n",
+ "A run time of less than 2 seconds is pretty good! But I think I can do even better.\n",
"\n",
- "Consider the word **JUKEBOX**. It is a pangram, but what with the **J**, **K**, and **X**, it is a low-scoring honeycomb, regardless of what center is used:"
+ "Consider **JUKEBOX**. It is a pangram, but with **J**, **K**, and **X**, it scores poorly, regardless of the center:"
]
},
{
"cell_type": "code",
- "execution_count": 31,
+ "execution_count": 29,
"metadata": {},
"outputs": [
{
@@ -849,7 +833,7 @@
" Honeycomb(letters='BEJKOUX', center='X'): 15}"
]
},
- "execution_count": 31,
+ "execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
@@ -869,45 +853,45 @@
"- For each pangram letterset, ask \"if we weren't required to use the center letter, what would this letterset score?\"\n",
"- Check if that score (which is an upper bound of the score using any one center letter) is higher than the top score so far.\n",
"- If yes, then try it with all seven centers; if not then discard it without trying any centers.\n",
- "- This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: prune a whole **branch** (of 7 honeycombs) if an upper **bound** can't beat the top score.\n",
+ " - This is called a [**branch and bound**](https://en.wikipedia.org/wiki/Branch_and_bound) algorithm: prune a whole **branch** (of 7 honeycombs) if an upper **bound** can't beat the top score found so far.\n",
"\n",
- "To compute the score of a honeycomb with no center, it turns out I can just call `game_score2` on `Honeycomb(letters, '')`. This works because of a quirk of Python: `game_score2` checks if `honeycomb.center in letters`; normally in Python the expression `x in y` means \"is `x` a member of the collection `y`\", but when `y` is a string it means \"is `x` a substring of `y`\", and the empty string is a substring of every string. (If I had represented a letterset as a Python `set`, this wouldn't work.)\n",
+ "*Note*: To represent a honeycomb with no center, I can just use `Honeycomb(letters, '')`. This works because of a quirk of Python: `game_score2` checks if `honeycomb.center in letters`; normally in Python the expression `e in s` means \"*is* `e` *an element of the collection* `s`\", but when `s` is a string it means \"*is* `e` *a substring of* `s`\", and the empty string is a substring of every string. (If I had represented a `Letterset` as a Python `set`, this wouldn't work.)\n",
"\n",
- "Thus, I can rewrite `top_honeycomb` as follows:"
+ "Thus, I can rewrite `top_honeycomb` in this more efficient form:"
]
},
{
"cell_type": "code",
- "execution_count": 32,
+ "execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"def top_honeycomb3(words) -> Tuple[int, Honeycomb]: \n",
- " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n",
+ " \"\"\"Find a (score, honeycomb) tuple with a highest-scoring honeycomb.\"\"\"\n",
" points_table = tabulate_points(words)\n",
- " top_score, top = -1, None\n",
- " pangrams = (s for s in points_table if len(s) == 7)\n",
+ " top_score, top_honeycomb = -1, None\n",
+ " pangrams = [s for s in points_table if len(s) == 7]\n",
" for p in pangrams:\n",
" if game_score2(Honeycomb(p, ''), points_table) > top_score:\n",
" for center in p:\n",
" honeycomb = Honeycomb(p, center)\n",
" score = game_score2(honeycomb, points_table)\n",
" if score > top_score:\n",
- " top_score, top = score, honeycomb\n",
- " return top_score, top"
+ " top_score, top_honeycomb = score, honeycomb\n",
+ " return top_score, top_honeycomb"
]
},
{
"cell_type": "code",
- "execution_count": 33,
+ "execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "CPU times: user 360 ms, sys: 2 ms, total: 362 ms\n",
- "Wall time: 361 ms\n"
+ "CPU times: user 309 ms, sys: 1.67 ms, total: 311 ms\n",
+ "Wall time: 310 ms\n"
]
},
{
@@ -916,7 +900,7 @@
"(3898, Honeycomb(letters='AEGINRT', center='R'))"
]
},
- "execution_count": 33,
+ "execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
@@ -929,126 +913,48 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Awesome! We get the same answer, and the computation is about 5 times faster; about **1/3 second**.\n",
+ "Awesome! We get the correct answer, and it runs four times faster.\n",
"\n",
- "For how many pangram lettersets did we have to check all 7 centers? We can find out by copy-pasting `top_honeycomb3` and annotating it to keep a `COUNT` of the number of pangrams that are checked, and to print a line of output when a honeycomb (either with or without a center letter) outscores the top score."
+ "How many honeycombs does `top_honeycomb3` examine? We can use `functools.lru_cache` to make `Honeycomb` keep track:"
]
},
{
"cell_type": "code",
- "execution_count": 34,
+ "execution_count": 32,
"metadata": {},
"outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "ADFLORW scores 333 with center ∅ (pangram 1 (1/7986))\n",
- " scores 265 with center A \n",
- "ABDENOR scores 1856 with center ∅ (pangram 2 (2/7986))\n",
- " scores 1148 with center A \n",
- " scores 1476 with center D \n",
- " scores 1578 with center E \n",
- "ABEIRTV scores 1585 with center ∅ (pangram 3 (4/7986))\n",
- "ABDEORT scores 2434 with center ∅ (pangram 4 (28/7986))\n",
- " scores 1679 with center A \n",
- " scores 2134 with center E \n",
- "ABCDERT scores 2254 with center ∅ (pangram 5 (35/7986))\n",
- "ACELNRT scores 2529 with center ∅ (pangram 6 (46/7986))\n",
- " scores 2158 with center A \n",
- " scores 2225 with center E \n",
- "ACDELRT scores 2746 with center ∅ (pangram 7 (47/7986))\n",
- " scores 2273 with center A \n",
- " scores 2608 with center E \n",
- "ACENORT scores 2799 with center ∅ (pangram 8 (50/7986))\n",
- "ACEIPRT scores 2653 with center ∅ (pangram 9 (57/7986))\n",
- "ACDEIRT scores 3407 with center ∅ (pangram 10 (71/7986))\n",
- " scores 3023 with center E \n",
- "ACEINRT scores 3575 with center ∅ (pangram 11 (77/7986))\n",
- "ADEOPRT scores 3031 with center ∅ (pangram 12 (157/7986))\n",
- "AEGINRT scores 4688 with center ∅ (pangram 13 (178/7986))\n",
- " scores 3372 with center A \n",
- " scores 3769 with center E \n",
- " scores 3782 with center N \n",
- " scores 3898 with center R \n",
- "ADEINRT scores 4020 with center ∅ (pangram 14 (419/7986))\n"
- ]
- },
{
"data": {
"text/plain": [
- "(3898, Honeycomb(letters='AEGINRT', center='R'))"
+ "CacheInfo(hits=0, misses=8084, maxsize=None, currsize=8084)"
]
},
- "execution_count": 34,
+ "execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "def top_honeycomb3_annotated(words) -> Tuple[int, Honeycomb]: \n",
- " \"\"\"Return a (score, honeycomb) tuple with a highest-scoring honeycomb. Print stuff.\"\"\"\n",
- " points_table = tabulate_points(words)\n",
- " top_score, top = -1, None\n",
- " pangrams = [s for s in points_table if len(s) == 7]\n",
- " COUNT = 0\n",
- " for i, p in enumerate(pangrams, 1):\n",
- " if game_score2(Honeycomb(p, ''), points_table) > top_score:\n",
- " COUNT +=1; \n",
- " print(f'{p} scores {game_score2(Honeycomb(p, \"\"), points_table):4}',\n",
- " f'with center ∅ (pangram {COUNT:2} ({i}/{len(pangrams)}))')\n",
- " for center in p:\n",
- " honeycomb = Honeycomb(p, center)\n",
- " score = game_score2(honeycomb, points_table)\n",
- " if score > top_score:\n",
- " top_score, top = score, honeycomb\n",
- " print(f'{\" \":8}scores {top_score:4} with center {top.center} ')\n",
- " return top_score, top\n",
- "\n",
- "top_honeycomb3_annotated(enable1)"
+ "import functools\n",
+ "Honeycomb = functools.lru_cache(None)(Honeycomb)\n",
+ "top_honeycomb3(enable1)\n",
+ "Honeycomb.cache_info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Only 14 pangram lettersets had to have all 7 centers checked. We were lucky that the 4th pangram out of 7986, ABDEORT, happened to be aa good one, scoring 2134 points (with center E), setting a high score so that most of the remaining pangrams only needed to be checked with an empty center. The total number of calls to `game_score2` is:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "8084"
- ]
- },
- "execution_count": 35,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "len(pangram_lettersets(enable1)) + 14 * 7"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "8,084 is a big improvement over 55,902.\n",
+ "This says that `top_honeycomb3` examined 8,084 honeycombs; an almost 7-fold improvement over the 55,902 examined by `top_honeycomb2`.\n",
"\n",
"# 9: Fancy Report\n",
"\n",
- "I'd like to see the actual words that each honeycomb can make, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this turned out to be a lot more complicated than I expected. I guess it is difficult to create a practical extraction and reporting tool. I feel you, [Larry Wall](http://www.wall.org/~larry/)."
+ "I'd like to see the actual words that each honeycomb can make, and I'm curious about how the words are divided up by letterset. Here's a function to provide such a report. I remembered that there is a `fill` function in Python (it is in the `textwrap` module) but this report turned out to be a lot more complicated than I anticipated. I guess it is difficult to create a practical extraction and reporting tool. I feel you, [Larry Wall](http://www.wall.org/~larry/)."
]
},
{
"cell_type": "code",
- "execution_count": 36,
+ "execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
@@ -1067,7 +973,7 @@
" subsets = letter_subsets(honeycomb)\n",
" nwords = sum(len(bins[s]) for s in subsets)\n",
" print(f'{adj}{honeycomb} scores {Ns(score, \"point\")} on {Ns(nwords, \"word\")}',\n",
- " f'from a {len(words)} word list:\\n')\n",
+ " f'from a {len(words):,d} word list:\\n')\n",
" for s in sorted(subsets, key=lambda s: (-len(s), s)):\n",
" if bins[s]:\n",
" pts = sum(word_score(w) for w in bins[s])\n",
@@ -1080,7 +986,7 @@
"def Ns(n, noun):\n",
" \"\"\"A string with `n` followed by the plural or singular of noun:\n",
" Ns(3, 'bear') => '3 bears'; Ns(1, 'world') => '1 world'\"\"\" \n",
- " return f\"{n:d} {noun}{' ' if n == 1 else 's'}\"\n",
+ " return f\"{n:,d} {noun}{' ' if n == 1 else 's'}\"\n",
"\n",
"def group_by(items, key):\n",
" \"Group items into bins of a dict, each bin keyed by key(item).\"\n",
@@ -1090,9 +996,16 @@
" return bins"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here are reports for the mini and full word lists:"
+ ]
+ },
{
"cell_type": "code",
- "execution_count": 37,
+ "execution_count": 34,
"metadata": {},
"outputs": [
{
@@ -1108,19 +1021,19 @@
}
],
"source": [
- "report(Honeycomb('AEGLMPX', 'G'), mini)"
+ "report(hc, mini)"
]
},
{
"cell_type": "code",
- "execution_count": 38,
+ "execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Top Honeycomb(letters='AEGINRT', center='R') scores 3898 points on 537 words from a 44585 word list:\n",
+ "Top Honeycomb(letters='AEGINRT', center='R') scores 3,898 points on 537 words from a 44,585 word list:\n",
"\n",
"AEGINRT 832 points 50 pangrams AERATING(15) AGGREGATING(18) ARGENTINE(16) ARGENTITE(16) ENTERTAINING(19)\n",
" ENTRAINING(17) ENTREATING(17) GARNIERITE(17) GARTERING(16) GENERATING(17) GNATTIER(15) GRANITE(14)\n",
@@ -1233,16 +1146,16 @@
},
{
"cell_type": "code",
- "execution_count": 39,
+ "execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Top Honeycomb(letters='AEINRST', center='E') scores 8681 points on 1179 words from a 98141 word list:\n",
+ "Top Honeycomb(letters='AEINRST', center='E') scores 8,681 points on 1,179 words from a 98,141 word list:\n",
"\n",
- "AEINRST 1381 points 86 pangrams ANESTRI(14) ANTISERA(15) ANTISTRESS(17) ANTSIER(14) ARENITES(15) ARSENITE(15)\n",
+ "AEINRST 1,381 points 86 pangrams ANESTRI(14) ANTISERA(15) ANTISTRESS(17) ANTSIER(14) ARENITES(15) ARSENITE(15)\n",
" ARSENITES(16) ARTINESS(15) ARTINESSES(17) ATTAINERS(16) ENTERTAINERS(19) ENTERTAINS(17) ENTRAINERS(17)\n",
" ENTRAINS(15) ENTREATIES(17) ERRANTRIES(17) INERTIAS(15) INSTANTER(16) INTENERATES(18) INTERSTATE(17)\n",
" INTERSTATES(18) INTERSTRAIN(18) INTERSTRAINS(19) INTRASTATE(17) INTREATS(15) IRATENESS(16)\n",
@@ -1420,8 +1333,8 @@
}
],
"source": [
- "enable1s = [w for w in open('enable1.txt').read().upper().split()\n",
- " if len(w) >= 4 and len(set(w)) <= 7]\n",
+ "letters.add('S') # Make 'S' a legal letter\n",
+ "enable1s = word_list(open('enable1.txt').read())\n",
"\n",
"report(words=enable1s)"
]
@@ -1432,14 +1345,14 @@
"source": [
"Allowing 'S' words more than doubles the score!\n",
"\n",
- "Here are pictures for the highest-scoring honeycombs, with and without an S:\n",
+ "Here are the highest-scoring honeycombs (with and without an S) with their stats and mnemonics:\n",
"\n",
"
\n",
"