From 5b9b2c2a1174e25396176446f036d0268996cb64 Mon Sep 17 00:00:00 2001 From: Peter Norvig Date: Wed, 25 Mar 2026 17:16:09 -0700 Subject: [PATCH] Add files via upload --- ipynb/xkcd1313.ipynb | 647 +++++++++++++++++++++++-------------------- 1 file changed, 350 insertions(+), 297 deletions(-) diff --git a/ipynb/xkcd1313.ipynb b/ipynb/xkcd1313.ipynb index 85eadab..732d5fb 100644 --- a/ipynb/xkcd1313.ipynb +++ b/ipynb/xkcd1313.ipynb @@ -4,7 +4,6 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -13,17 +12,19 @@ "source": [ "# xkcd 1313: Regex Golf\n", "\n", - "

Peter Norvig
January 2014
revised November 2015

\n", + "

Peter Norvig
January 2014
new movies, November 2015
Python 3, March 2026

\n", "\n", - "I ♡ [xkcd](http://xkcd.com)! It reliably provides top-rate [insights](http://xkcd.com/285/), [humor](http://xkcd.com/612/), or [both](http://xkcd.com/627/). I was thrilled when I got to [introduce Randall Munroe](http://www.youtube.com/watch?v=zJOS0sV2a24) for a talk in 2007. But in [xkcd #1313](http://xkcd.com/1313), \n", + "I ❤️ [**xkcd**](http://xkcd.com)! It reliably provides top-rate [insights](http://xkcd.com/285/), [humor](http://xkcd.com/612/), or [both](http://xkcd.com/627/). I was thrilled when I got to [introduce Randall Monroe](http://www.youtube.com/watch?v=zJOS0sV2a24) for a talk in 2007. \n", + "\n", + "But in [xkcd #1313](http://xkcd.com/1313), \n", "\n", "\n", "\n", - "I found that the hover text, \"/bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents\", contains a confusing contradiction. I'm old enough to remember that Jimmy Carter won one term and lost a second. No regular expression could both match and not match \"Carter\". \n", + "I found that the hover text, \"/bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents\", contains a confusing contradiction. I'm old enough to remember that Jimmy Carter won one term and lost a second. No regular expression could both match and not match \"Carter\"!\n", "\n", "But this got me thinking: can I come up with an algorithm to match or beat Randall's regex golf scores? The game is on.\n", "\n", - "# Presidents\n", + "# Part 1: Presidents\n", " \n", "I started by finding a [listing](http://www.anesi.com/presname.htm) of presidential elections, giving me these winners and losers:" ] @@ -33,8 +34,6 @@ "execution_count": 1, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -42,11 +41,10 @@ }, "outputs": [], "source": [ - "from __future__ import division, print_function\n", "import re\n", "import itertools\n", "\n", - "def words(text): return set(text.split())\n", + "def words(text: str) -> set[str]: return set(text.split())\n", "\n", "winners = words('''washington adams jefferson jefferson madison madison monroe \n", " monroe adams jackson jackson van-buren harrison polk taylor pierce buchanan \n", @@ -67,7 +65,6 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -82,8 +79,6 @@ "execution_count": 2, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -120,7 +115,6 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -135,8 +129,6 @@ "execution_count": 3, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -151,7 +143,6 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -169,8 +160,6 @@ "execution_count": 4, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -178,23 +167,16 @@ }, "outputs": [], "source": [ - "def mistakes(regex, winners, losers):\n", - " \"The set of mistakes made by this regex in classifying winners and losers.\"\n", - " return ({\"Should have matched: \" + W \n", - " for W in winners if not re.search(regex, W)} |\n", - " {\"Should not have matched: \" + L \n", - " for L in losers if re.search(regex, L)})\n", - "\n", - "def verify(regex, winners, losers): \n", - " assert not mistakes(regex, winners, losers)\n", - " return True" + "def mistakes(regex, winners, losers) -> set[str]:\n", + " \"\"\"The set of mistakes made by this regex in classifying winners and not losers.\"\"\"\n", + " return ({\"Should have matched: \" + W for W in winners if not re.search(regex, W)}\n", + " | {\"Should not have matched: \" + L for L in losers if re.search(regex, L)})" ] }, { "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -209,8 +191,6 @@ "execution_count": 5, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -238,16 +218,15 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ - "The xkcd regex incorrectly matches `\"fremont\"`, representing John C. Frémont, the Republican candidate who lost to James Buchanan in 1856. Could Randall Munroe have made an error? Is someone **[wrong](http://xkcd.com/386/)** on the Internet? Investigating the [1856 election](http://en.wikipedia.org/wiki/United_States_presidential_election,_1856), I see that Randall must have had Millard Fillmore, the third-party candidate, as the opponent. Fillmore is more famous, having served as the 13th president (although he never won an election; he became president when Taylor died in office). But Fillmore only got 8 electoral votes in 1856 while Fremont got 114, so I will stick with Fremont in *my* list of losers. \n", + "The xkcd regex incorrectly matches `\"fremont\"`, representing John C. Frémont, the Republican candidate who lost to James Buchanan in 1856. Could Randall Monroe have made an error? Is someone **[wrong](http://xkcd.com/386/)** on the Internet? Investigating the [1856 election](http://en.wikipedia.org/wiki/United_States_presidential_election,_1856), I see that Randall must have used Millard Fillmore, the third-party candidate, as the opponent. Fillmore is more famous, having served as the 13th president (although he never won an election; he became president when Taylor died in office). But Fillmore only got 8 electoral votes in 1856 while Fremont got 114, so I will stick with Fremont in *my* list of losers. \n", "\n", - "We can verify that Randall got it right under the interpretation that Fillmore, not Fremont, was the loser: " + "Under the interpretation that Fillmore, not Fremont, was the loser, Randall's regex makes no mistakes:" ] }, { @@ -255,8 +234,6 @@ "execution_count": 6, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -277,14 +254,13 @@ "source": [ "alternative_losers = {'fillmore'} | losers - {'fremont'}\n", "\n", - "verify(xkcd, winners, alternative_losers)" + "not mistakes(xkcd, winners, alternative_losers)" ] }, { "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -303,6 +279,7 @@ "- Set cover is an NP-hard problem, so I feel justified in using an approximation approach that finds a small but not necessarily smallest solution.\n", "- For many NP-hard problems a good approximation can be had with a [greedy algorithm](http://en.wikipedia.org/wiki/Greedy_algorithm): Pick the \"best\" part first (the one that covers the most winners with the fewest characters), and repeat, choosing the \"best\" each time until there are no more winners to cover.\n", "- To guarantee that we will find a solution, make sure that each winner has at least one part that matches it.\n", + "- The quality (shortness) of our solution depends on generating good parts and on searching over them well.\n", "\n", "There are three ways this strategy can fail to find the shortest possible regex:\n", "\n", @@ -310,60 +287,6 @@ "- The shortest regex might be a disjunction formed with different parts. For example, `\"[rn]t\"` is not in our pool of parts.\n", "- The greedy algorithm isn't guaranteed to find the shortest solution. We might have all the right parts, but pick the wrong ones.\n", "\n", - "The algorithm is below. Our pool of parts is a set of strings created with `regex_parts(winners, losers)`. We accumulate parts into the list `solution`, which starts empty. On each iteration choose the best part: the one with a maximum score. (I decided by default to score 4 points for each winner matched, minus one point for each character in the part.) We then add the best part to `solution`, and remove from winners all the strings that are matched by `best`. Finally, we update the pool, keeping only those parts that still match one or more of the remaining winners. When there are no more winners left, OR together all the solution parts to give the final regular expression string." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "button": false, - "collapsed": false, - "deletable": true, - "new_sheet": false, - "run_control": { - "read_only": false - } - }, - "outputs": [], - "source": [ - "def findregex(winners, losers, k=4):\n", - " \"Find a regex that matches all winners but no losers (sets of strings).\"\n", - " # Make a pool of regex parts, then pick from them to cover winners.\n", - " # On each iteration, add the 'best' part to 'solution',\n", - " # remove winners covered by best, and keep in 'pool' only parts\n", - " # that still match some winner.\n", - " pool = regex_parts(winners, losers)\n", - " solution = []\n", - " def score(part): return k * len(matches(part, winners)) - len(part)\n", - " while winners:\n", - " best = max(pool, key=score)\n", - " solution.append(best)\n", - " winners = winners - matches(best, winners)\n", - " pool = {r for r in pool if matches(r, winners)}\n", - " return OR(solution)\n", - "\n", - "def matches(regex, strings):\n", - " \"Return a set of all the strings that are matched by regex.\"\n", - " return {s for s in strings if re.search(regex, s)}\n", - "\n", - "OR = '|'.join # Join a sequence of strings with '|' between them" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "button": false, - "deletable": true, - "new_sheet": false, - "run_control": { - "read_only": false - } - }, - "source": [ - "# Glossary\n", - "\n", - "\n", "Just to be clear, I define the terms I will be using:\n", "\n", "- *winners:* A set of strings; our solution is required to match each of them.\n", @@ -374,31 +297,16 @@ "- *solution:* A regular expression that matches all winners but no losers.\n", "- *whole:* A part that matches a whole word (and nothing else): `'^word$'`\n", "\n", - " \n", - "# Regex Parts\n", + "# Main Algorithm\n", "\n", - "\n", - "\n", - "Now we need to define what the `regex_parts` are. Here's what I came up with:\n", - "\n", - "- For each winner, include a regex that matches the entire string exactly. I call this regex a *whole*.\n", - "
Example: for `'word'`, include `'^word$'`\n", - "- For each whole, generate *subparts* consisting of 1 to 4 consecutive characters.\n", - "
Example: `subparts('^it$')` == `{'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}`\n", - "- For each subpart, generate all ways to replace any of the letters with a dot (the \"match any\" character).\n", - "
Example: `dotify('it')` == `{'it', 'i.', '.t', '..'}`\n", - "- Keep only the dotified subparts that do not match any of the losers.\n", - "\n", - "Note that I only used a few of the regular expression mechanisms: `'.'`, `'^'`, and `'$'`. I didn't try to use character classes (`[a-z]`), nor any of the repetition operators, nor other advanced mechanisms. Why? I thought that the advanced features usually take too many characters. For example, I don't allow the part `'[rn]t'`, but I can achieve the same effect with the same number of characters by combining two parts: `'rt|nt'`. I could add more complicated mechanisms later, but for now, [YAGNI](https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it). Here is the code:" + "The algorithm is below. Our pool of parts is a set of strings created with `regex_parts(winners, losers)`. We accumulate parts into the list `solution`, which starts empty. On each iteration choose the best part: the one with a maximum score. (I decided by default to score 4 points for each winner matched, minus one point for each character in the part.) We then add the best part to `solution`, and remove from winners all the strings that are matched by `best`. Finally, we update the pool, keeping only those parts that still match one or more of the remaining winners. When there are no more winners left, OR together all the solution parts to give the final regular expression string." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -406,38 +314,101 @@ }, "outputs": [], "source": [ - "def regex_parts(winners, losers):\n", - " \"Return parts that match at least one winner, but no loser.\"\n", - " wholes = {'^' + w + '$' for w in winners}\n", - " parts = {d for w in wholes for p in subparts(w) for d in dotify(p)}\n", - " return wholes | {p for p in parts if not matches(p, losers)}\n", + "def findregex(winners, losers, k=4) -> str:\n", + " \"\"\"Find a regex that matches all winners but no losers (sets of strings).\"\"\"\n", + " # Make a pool of regex parts, then pick from them to cover winners.\n", + " # On each iteration, add the 'best' part to 'solution',\n", + " # remove winners covered by best, and keep in 'pool' only parts\n", + " # that still match some winner.\n", + " pool = regex_parts(winners, losers)\n", + " solution = []\n", + " def score(part: str) -> int: return k * len(matches(part, winners)) - len(part)\n", + " while winners:\n", + " best = max(pool, key=score)\n", + " solution.append(best)\n", + " winners = winners - matches(best, winners)\n", + " pool = {r for r in pool if matches(r, winners) and r != best}\n", + " return OR(solution)\n", "\n", - "def subparts(word, N=4):\n", - " \"Return a set of subparts of word: consecutive characters up to length N (default 4).\"\n", - " return set(word[i:i+n+1] for i in range(len(word)) for n in range(N)) \n", - " \n", - "def dotify(part):\n", - " \"Return all ways to replace a subset of chars in part with '.'.\"\n", - " choices = map(replacements, part)\n", - " return {cat(chars) for chars in itertools.product(*choices)}\n", + "def matches(regex, strings: set[str]) -> set[str]:\n", + " \"\"\"Return a set of all the strings that are matched by regex.\"\"\"\n", + " return {s for s in strings if re.search(regex, s)}\n", "\n", - "def replacements(c): return c if c in '^$' else c + '.'\n", - "\n", - "cat = ''.join" + "OR = '|'.join # Join a sequence of strings with '|' between them" ] }, { "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ - "Our program is complete! We can run `findregex`, verify the solution, and compare the length of our solution to Randall's:" + "# Regex Parts\n", + "\n", + "Now we need to define what the `regex_parts` are. Here's what I came up with:\n", + "\n", + "- For each winner, include a regex that matches the entire string exactly. I call this regex a *whole*.\n", + " - Example: for `'word'`, include `'^word$'`\n", + " - This way, we know there is at least one part that will match each winner.\n", + "- For each whole, generate *subparts* consisting of 1 to 4 consecutive characters.\n", + " - Example: `subparts('^it$')` == `{'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}`\n", + "- For each subpart, generate all ways to replace any of the letters with a dot (the \"match any\" character).\n", + " - Example: `dotify('it')` == `{'it', 'i.', '.t', '..'}`\n", + "- Keep only the dotified subparts that do not match any of the losers.\n", + "\n", + "Note that I only used a few of the regular expression mechanisms: `'.'`, `'^'`, and `'$'` (all joined with `|`). I didn't try to use character classes (`[a-z]`), nor any of the repetition operators, nor other advanced mechanisms. Why? I thought that the advanced features usually take too many characters. For example, I don't allow the part `'[rn]t'`, but I can achieve the same effect with the same number of characters by combining two parts: `'rt|nt'`. I could add more complicated mechanisms later, but for now, [YAGNI](https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it). Here is the code:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "button": false, + "new_sheet": false, + "run_control": { + "read_only": false + } + }, + "outputs": [], + "source": [ + "def regex_parts(winners, losers) -> set[str]:\n", + " \"\"\"Return parts that match at least one winner, but no loser.\"\"\"\n", + " wholes = {'^' + w + '$' for w in winners}\n", + " sub_parts = union(map(subparts, wholes))\n", + " parts = union(map(dotify, sub_parts))\n", + " return wholes | {p for p in parts if not matches(p, losers)}\n", + "\n", + "def subparts(word: str, N=4) -> set[str]:\n", + " \"\"\"Return a set of subparts of word: consecutive characters up to length N (default 4).\"\"\"\n", + " return {word[i:i+n+1] for i in range(len(word)) for n in range(N)}\n", + " \n", + "def dotify(part: str) -> set[str]:\n", + " \"\"\"Return all ways to replace a subset of chars in `part` with '.'.\"\"\"\n", + " choices = [(c if c in '^$' else c + '.') for c in part]\n", + " return {cat(chars) for chars in itertools.product(*choices)}\n", + "\n", + "def union(collections) -> set: \n", + " \"\"\"The union of some sets.\"\"\"\n", + " return set().union(*collections)\n", + "\n", + "cat = ''.join # cat(strs: Collection[str]) -> str: concatenate strings " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "button": false, + "new_sheet": false, + "run_control": { + "read_only": false + } + }, + "source": [ + "Our program is complete! To better view the output I define `report` to either take a solution as input or call `findregex` to create a solution, and then to verify the solution, and print some stats: the number of characters in the solution, the number of parts, the *competitive ratio* (the ratio between the lengths of a trivial solution and the actual solution), and the number of winners and losers. Let's try it:" ] }, { @@ -445,30 +416,22 @@ "execution_count": 9, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, - "outputs": [ - { - "data": { - "text/plain": [ - "(53, 'a.a|i..n|j|oo|a.t|i..o|a..i|bu|n.e|ay.|r.e$|po|ma|nd$')" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "solution = findregex(winners, losers)\n", - "verify(solution, winners, losers)\n", - "\n", - "len(solution), solution" + "def report(winners, losers, solution=None):\n", + " \"\"\"Find a regex to match winners but not losers. Print summary.\"\"\"\n", + " solution = solution or findregex(winners, losers)\n", + " assert not mistakes(solution, winners, losers)\n", + " N = len(solution)\n", + " T = len('^(' + OR(winners) + ')$')\n", + " print(f'Characters: {N}, Parts: {solution.count(\"|\") + 1}, Competitive ratio: '\n", + " f'{T / N:.1f}, Winners: {len(winners)}, Losers: {len(losers)}')\n", + " return solution" ] }, { @@ -476,18 +439,23 @@ "execution_count": 10, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 54, Parts: 15, Competitive ratio: 5.1, Winners: 35, Losers: 34\n" + ] + }, { "data": { "text/plain": [ - "(63, 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls')" + "'a.a|a..i|j|li|a.t|i..n|oo|bu|ay.|n.e|r.e$|ru|ls|po|v.l'" ] }, "execution_count": 10, @@ -496,34 +464,97 @@ } ], "source": [ - "len(xkcd), xkcd" + "report(winners, losers)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 54, Parts: 15, Competitive ratio: 5.1, Winners: 35, Losers: 34\n" + ] + }, + { + "data": { + "text/plain": [ + "'a.a|a..i|j|li|a.t|i..n|oo|bu|ay.|n.e|r.e$|ru|ls|po|v.l'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report(winners, alternative_losers)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We get the same regex, regardless of which set of losers we use. How does that compare to Randall's solution?" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 63, Parts: 13, Competitive ratio: 4.4, Winners: 35, Losers: 34\n" + ] + }, + { + "data": { + "text/plain": [ + "'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report(winners, alternative_losers, solution=xkcd)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our regex is shorter than Randall's by 10 characters." ] }, { "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ - "Our regex is 15% shorter than Randall's—success!\n", - "\n", - "# Tests\n", + "# Test Suite\n", "\n", "Here's a test suite to give us more confidence in (and familiarity with) our functions:" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 13, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -536,7 +567,7 @@ "'tests pass'" ] }, - "execution_count": 11, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -565,6 +596,7 @@ " \n", " assert OR(['a', 'b', 'c']) == 'a|b|c'\n", " assert OR(['a']) == 'a'\n", + " assert OR([]) == ''\n", " \n", " assert words('this is a test this is') == {'this', 'is', 'a', 'test'}\n", " \n", @@ -586,83 +618,25 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ - "# Regex Golf with Arbitrary Lists\n", + "\n", "\n", - "Let's move on to arbitrary lists. I define `report`, to call `findregex`, verify the solution, and print the number of characters in the solution, the number of parts, the *competitive ratio* (the ratio between the lengths of a trivial solution and the actual solution), and the number of winners and losers." + "# Part 2: A Program that Plays Regex Golf with Arbitrary Lists\n", + "\n", + "\n", + "Let's move on to arbitrary lists. Two arbitrary lists are the top 10 [boys and girls names](http://www.ssa.gov/oact/babynames/) for 2012:" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 14, "metadata": { "button": false, - "collapsed": false, - "deletable": true, - "new_sheet": false, - "run_control": { - "read_only": false - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Characters: 53, Parts: 14, Competitive ratio: 5.2, Winners: 35, Losers: 34\n" - ] - }, - { - "data": { - "text/plain": [ - "'a.a|i..n|j|oo|a.t|i..o|a..i|bu|n.e|ay.|r.e$|po|ma|nd$'" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "def report(winners, losers):\n", - " \"Find a regex to match A but not B, and vice-versa. Print summary.\"\n", - " solution = findregex(winners, losers)\n", - " verify(solution, winners, losers)\n", - " trivial = '^(' + OR(winners) + ')$'\n", - " print('Characters: {}, Parts: {}, Competitive ratio: {:.1f}, Winners: {}, Losers: {}'.format(\n", - " len(solution), solution.count('|') + 1, len(trivial) / len(solution) , len(winners), len(losers)))\n", - " return solution\n", - "\n", - "report(winners, losers)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "button": false, - "deletable": true, - "new_sheet": false, - "run_control": { - "read_only": false - } - }, - "source": [ - "The top 10 [boys and girls names](http://www.ssa.gov/oact/babynames/) for 2012:" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -679,10 +653,10 @@ { "data": { "text/plain": [ - "'e.$|a.$|a.o'" + "'a.$|e.$|a.o'" ] }, - "execution_count": 13, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } @@ -698,7 +672,6 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -710,56 +683,76 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 15, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 10, Parts: 2, Competitive ratio: 7.0, Winners: 10, Losers: 10\n" + ] + }, { "data": { "text/plain": [ - "True" + "'[ae].(o|$)'" ] }, - "execution_count": 14, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "verify('[ae].(o|$)', boys, girls)" + "report(boys, girls, solution='[ae].(o|$)')" ] }, { "cell_type": "markdown", - "metadata": { - "button": false, - "deletable": true, - "new_sheet": false, - "run_control": { - "read_only": false - } - }, + "metadata": {}, "source": [ - "\n", - "\n", - "We have now fulfilled panel two of the strip. Let's try another example, separating \n", - "the top ten best-selling drugs from the top 10 cities to visit:" + "Of course we can always go the other way:" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 10, Parts: 3, Competitive ratio: 7.1, Winners: 10, Losers: 10\n" + ] + }, + { + "data": { + "text/plain": [ + "'a$|^..i|is'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report(girls, boys)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -776,10 +769,10 @@ { "data": { "text/plain": [ - "'o.$|x|ir|q|f|po'" + "'o.$|x|ir|q|f|og'" ] }, - "execution_count": 15, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -791,11 +784,74 @@ "report(drugs, cities)" ] }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 11, Parts: 4, Competitive ratio: 7.6, Winners: 10, Losers: 10\n" + ] + }, + { + "data": { + "text/plain": [ + "'ri|an|ca|de'" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report(cities, drugs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's try [separating the sheeps from the goats](https://www.biblegateway.com/passage/?search=Matthew%2025%3A31-46&version=NIV):" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 16, Parts: 5, Competitive ratio: 5.2, Winners: 10, Losers: 10\n" + ] + }, + { + "data": { + "text/plain": [ + "'^..r|c.|es|.m|k$'" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sheep = words('merino leicester lincoln dorset dorper hampshire suffolk barbados jacob friesian')\n", + "goats = words('alpine nubian toffenburg oberhasli sable spanish boer kalahari kiko myotonic')\n", + "\n", + "report(sheep, goats)" + ] + }, { "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -804,16 +860,16 @@ "source": [ "\n", "\n", - "We can answer the challenge from panel one of the strip:" + "# Star (Wars|Trek)\n", + "\n", + "The challenge from the first panel of the strip can be answered with this code:" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 20, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -830,10 +886,10 @@ { "data": { "text/plain": [ - "' T|E.P|OP'" + "' T|E.P|PE'" ] }, - "execution_count": 16, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } @@ -855,7 +911,6 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -867,44 +922,48 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 21, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 10, Parts: 3, Competitive ratio: 11.7, Winners: 6, Losers: 9\n" + ] + }, { "data": { "text/plain": [ - "True" + "'M | [TN]|B'" ] }, - "execution_count": 17, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "verify('M | [TN]|B', starwars, startrek)" + "report(starwars, startrek, solution='M | [TN]|B')" ] }, { "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ - "**Update (Nov 2015): There are two new movies in the works!**\n", + "## Update (Nov 2015): There are two new movies in the works!\n", "\n", "
\n", "\n", @@ -917,24 +976,29 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 22, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Characters: 10, Parts: 3, Competitive ratio: 13.5, Winners: 7, Losers: 10\n" + ] + }, { "data": { "text/plain": [ - "' T|CE| ..P'" + "' T|KE|E..H'" ] }, - "execution_count": 18, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } @@ -943,14 +1007,13 @@ "starwars.add('THE FORCE AWAKENS')\n", "startrek.add('BEYOND')\n", "\n", - "findregex(starwars, startrek)" + "report(starwars, startrek)" ] }, { "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -964,11 +1027,9 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 23, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -985,10 +1046,10 @@ { "data": { "text/plain": [ - "'foo'" + "'f.o'" ] }, - "execution_count": 19, + "execution_count": 23, "metadata": {}, "output_type": "execute_result" } @@ -1009,25 +1070,22 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ - "The answer varies with different runs; sometimes it is `'foo'` and sometimes `'f.o'`. Both have 3 characters, but `'f.o'` is smaller in terms of the total amount of ink/pixels needed to render it. (How can the answer vary, when there are no calls to any `random` function? Because when `max` iterates over a set and several elements have the same best score, it is *unspecified* which one will be selected.)\n", + "The answer varies with different runs; sometimes it is `'foo'` and sometimes `'f.o'`. (Both have 3 characters, but `'f.o'` is smaller in terms of the total amount of ink/pixels needed to render it.) You may ask how can the answer vary, when the program has no calls to any `random` function? The answer is that the program does use `max` to iterate over a set, and the order of elements in a set is *unspecified* by Python, and depends on details of memory layout that can vary from run to run. \n", "\n", - "Of course, we can run any of these examples in the other direction:" + "This is an example that is designed to be much harder in the other direction:" ] }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 24, "metadata": { "button": false, - "collapsed": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -1044,10 +1102,10 @@ { "data": { "text/plain": [ - "'r..$|k|.m|...n|ld|la|dg|or'" + "'r..$|k|.m|...n|la|ld|es|dg'" ] }, - "execution_count": 20, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -1060,7 +1118,6 @@ "cell_type": "markdown", "metadata": { "button": false, - "deletable": true, "new_sheet": false, "run_control": { "read_only": false @@ -1074,9 +1131,9 @@ "- Stop here and declare victory! *Yay!*\n", "- Try to make the program faster and capable of finding shorter regexes. \n", "\n", - "My first inclination was \"stop here\", and that's what this notebook will shortly do. But several correspondents offered very interesting suggestions, so I returned to the problem in [a second notebook](http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313-part2.ipynb?create=1). \n", + "My first inclination was \"stop here\", and that's what this notebook will shortly do. But several correspondents offered very interesting suggestions, so I returned to the problem in **[a second notebook](http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313-part2.ipynb?create=1)**. \n", "\n", - "I was asked whether Randall was **[wrong](http://xkcd.com/386/)** to come up with \"only\" a 10-character Star Wars regex, whereas I showed there is a 9-character version. I would say that, given his role as a cartoonist, author, public speaker, educator, and entertainer, he has [chosen ... wisely](https://www.youtube.com/watch?v=puo1Enh9h5k&feature=youtu.be&t=294). He wrote a program that was good enough to allow him to make a great webcomic. A 9-character regex would not improve the comic. Randall stated that he used a genetic algorithm to find his regexes, and it has been said that genetic algorithms are often the second (or was it the third?) best method to solve any problem, and that's all he needed. But if you consider that in addition to all those roles, Randall is also still a practicing computer scientist, you could say\n", + "I was asked whether Randall was **[wrong](http://xkcd.com/386/)** to come up with \"only\" a 10-character Star Wars regex, whereas I showed there is a 9-character version. I would say that, given his role as a cartoonist, author, public speaker, educator, and entertainer, he has [chosen ... wisely](https://www.youtube.com/watch?v=-_IlNbsILLE). He wrote a program that was good enough to allow him to make a great webcomic. A 9-character regex would not improve the comic. Randall stated that he used a genetic algorithm to find his regexes, and it has been said that genetic algorithms are often the second (or was it the third?) best method to solve any problem, and that's all he needed. But if you consider that in addition to all those roles, Randall is also still a practicing computer scientist, you could say\n", "[he chose ... poorly](https://www.youtube.com/watch?v=Ubw5N8iVDHI). Genetic algorithms are good when you want to combine the structure of two solutions to yield a better solution, so they would work well if the best regexes had a complicated tree structure. But they don't! The best solutions are disjunctions of small parts. So the genetic algorithm is trying to combine the first half of one disjunction with the second half of another—but that isn't useful, because the components of a disjunction are unordered; imposing an ordering on them doesn't help.\n", "\n", "# Summary\n", @@ -1097,19 +1154,15 @@ "# Thanks!\n", "\n", "\n", - "Thanks especially to [Randall Munroe](http://xkcd.com/) for inspiring me to do this, to [regex.alf.nu](http://regex.alf.nu) for inspiring Randall, to Sean Lip for correcting \"Wilkie\" to \"Willkie,\" and to [Davide Canton](https://plus.sandbox.google.com/108324296451294887432/posts), [Thomas Breuel](https://plus.google.com/118190679520611168174/posts), and [Stefan Pochmann](http://www.stefan-pochmann.info/spocc/) for providing suggestions to improve my code.\n", - "\n", - "\n", - "\n", - "
\n" + "Thanks especially to [Randall Monroe](http://xkcd.com/) for inspiring me to do this, to [regex.alf.nu](http://regex.alf.nu) for inspiring Randall, to Sean Lip for correcting \"Wilkie\" to \"Willkie,\" and to [Davide Canton](https://plus.sandbox.google.com/108324296451294887432/posts), [Thomas Breuel](https://plus.google.com/118190679520611168174/posts), and [Stefan Pochmann](http://www.stefan-pochmann.info/spocc/) for providing suggestions to improve my code; see my **[second notebook](http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313-part2.ipynb?create=1)**.\n" ] - + } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python [conda env:base] *", "language": "python", - "name": "python3" + "name": "conda-base-py" }, "language_info": { "codemirror_mode": { @@ -1121,9 +1174,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.5.0" + "version": "3.13.9" } }, "nbformat": 4, - "nbformat_minor": 0 + "nbformat_minor": 4 }