\n",
"\n",
- "# Dice Baseball\n",
+ "# Baseball Simulation\n",
"\n",
- "The [538 Riddler for March 22, 2019](https://fivethirtyeight.com/features/can-you-turn-americas-pastime-into-a-game-of-yahtzee/) asks us to simulate baseball using probabilities from a 19th century dice game called *Our National Ball Game*:\n",
- "\n",
- " 1,1: double 2,2: strike 3,3: out at 1st 4,4: fly out\n",
- " 1,2: single 2,3: strike 3,4: out at 1st 4,5: fly out\n",
- " 1,3: single 2,4: strike 3,5: out at 1st 4,6: fly out\n",
- " 1,4: single 2,5: strike 3,6: out at 1st 5,5: double play\n",
- " 1,5: base on error 2,6: foul out 5,6: triple\n",
- " 1,6: base on balls 6,6: home run\n",
+ "The [538 Riddler for March 22, 2019](https://fivethirtyeight.com/features/can-you-turn-americas-pastime-into-a-game-of-yahtzee/) asks us to simulate baseball using probabilities from a 19th century dice game called *Our National Ball Game*. The Riddler description of the rules said *you can assume some standard baseball things* but left many things unspecified, so I [looked up](http://baseballgames.dreamhosters.com/BbDiceHome.htm) the original rules of *Our National Ball Game*, which are shown below and which, it turns out, contradict some of the *standard baseball things* assumed by 538. I'll go with the rules as stated below.\n",
"\n",
"\n",
- "The rules left some things unspecified; the following are my current choices (in an early version I made different choices that resulted in slightly more runs):\n",
+ "|RULES FOR PLAYING \"OUR NATIONAL BALL GAME\"|DICE ROLL OUTCOMES|\n",
+ "|-----|-----|\n",
+ "|  |  |\n",
"\n",
- "* On a* b*-base hit, runners advance* b* bases, except that a runner on second scores on a 1-base hit.\n",
- "* On an \"out at first\", all runners advance one base.\n",
- "* A double play only applies if there is a runner on first; in that case other runners advance.\n",
- "* On a fly out, a runner on third scores; other runners do not advance.\n",
- "* On an error all runners advance one base. \n",
- "* On a base on balls, only forced runners advance.\n",
+ "# Design Choices\n",
"\n",
- "I also made some choices about the implementation:\n",
"\n",
- "- Exactly one outcome happens to each batter. We call that an *event*.\n",
- "- I'll represent events with the following one letter codes:\n",
- " - `K`, `O`, `o`, `f`, `D`: strikeout, foul out, out at first, fly out, double play\n",
- " - `1`, `2`, `3`, `4`: single, double, triple, home run\n",
- " - `E`, `B`: error, base on balls\n",
- "- Note the \"strike\" dice roll is not an event; it is only part of an event. From the probability of a \"strike\" dice roll, I compute the probability of three strikes in a row, and call that a strikeout event. Sice there are 7 dice rolls giving \"strike\", the probability of a strike is 7/36, and the probability of a strikeout is (7/36)**3.\n",
- "- Note that a die roll such as `1,1` is a 1/36 event, whereas `1,2` is a 2/36 event, because it also represents (2, 1).\n",
- "- I'll keep track of runners with a list of occupied bases; `runners = [1, 2]` means runners on first and second.\n",
- "- A runner who advances to base 4 or higher has scored a run (unless there are already 3 outs).\n",
- "- The function `inning` simulates a half inning and returns the number of runs scored.\n",
- "- I want to be able to test `inning` by feeding it specific events, and I also want to generate random innings. So I'll make the interface be that I pass in an *iterable* of events. The function `event_stream` generates an endless stream of randomly sampled events.\n",
- "- Note that it is consider good Pythonic style to automatically convert Booleans to integers, so for a runner on second (`r = 2`) when the event is a single (`e = '1'`), the expression `r + int(e) + (r == 2)` evaluates to `2 + 1 + 1` or `4`, meaning the runner on second scores.\n",
- "- I'll play 1 million innings and store the resulting scores in `innings`.\n",
- "- To simulate a game I just sample 9 elements of `innings` and sum them.\n",
+ "- Exactly one thing happens to each batter. I'll call that an **event**.\n",
+ "- To clarify: the dice roll `1,1` has probability 1/36, whereas `1,2` has probability 2/36, because it also represents `2,1`.\n",
+ "- The \"One Strike\" dice roll is not an event; it is only *part* of an event. From the probability of a \"One Strike\" dice roll, 7/36, I compute the probability of three strikes in a row, `(7/36)**3 == 0.00735`, and call that a strikeout event. \n",
+ "- I'll represent events with the following 11 one letter **event codes**:\n",
+ " - `1`, `2`, `3`, `4`: one-, two-, three-, and four-base (home run) hits. Runners advance same number of bases.\n",
+ " - `B`: base on balls. Runners advance only if forced.\n",
+ " - `D`: double play. Batter and runner nearest home are out; others advance one base.\n",
+ " - `E`: error. Batter reaches first and all runners advance one base.\n",
+ " - `F`, `K`, `O`: fly out, strikeout, foul out. Batter is out, runners do not advance.\n",
+ " - `S`: called \"out at first\" in rules, but actually a sacrifice. Batter is out, runners advance one base.\n",
"\n",
- "# The Code"
+ "\n",
+ "# Implementation"
]
},
{
@@ -53,9 +39,11 @@
"metadata": {},
"outputs": [],
"source": [
- "%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
- "import random"
+ "import random\n",
+ "from statistics import mean, stdev\n",
+ "from collections import Counter\n",
+ "from itertools import islice"
]
},
{
@@ -64,85 +52,82 @@
"metadata": {},
"outputs": [],
"source": [
- "def event_stream(events='2111111EEBBOOooooooofffffD334', strike=7/36):\n",
- " \"An iterator of random events. Defaults from `Our National Ball Game`.\"\n",
- " while True:\n",
- " yield 'K' if (random.random() < strike ** 3) else random.choice(events)\n",
- " \n",
- "def inning(events=event_stream(), verbose=False) -> int:\n",
- " \"Simulate a half inning based on events, and return number of runs scored.\"\n",
- " outs = runs = 0 # Inning starts with no outs and no runs,\n",
- " runners = [] # ... and with nobody on base\n",
- " for e in events:\n",
- " if verbose: print(f'{outs} outs, {runs} runs, event: {e}, runners: {runners}')\n",
- " # What happens to the batter?\n",
- " if e in 'KOofD': outs += 1 # Batter is out\n",
- " elif e in '1234EB': runners.append(0) # Batter becomes a runner\n",
- " # What happens to the runners?\n",
- " if e == 'D' and 1 in runners: # double play: runner on 1st out, others advance\n",
- " outs += 1\n",
- " runners = [r + 1 for r in runners if r != 1]\n",
- " elif e in 'oE': # out at first or error: runners advance\n",
- " runners = [r + 1 for r in runners]\n",
- " elif e == 'f' and 3 in runners and outs < 3: # fly out: runner on 3rd scores\n",
- " runners.remove(3)\n",
- " runs += 1\n",
- " elif e in '1234': # single, double, triple, homer\n",
- " runners = [r + int(e) + (r == 2) for r in runners]\n",
- " elif e == 'B': # base on balls: forced runners advance \n",
- " runners = [r + forced(runners, r) for r in runners]\n",
- " # See if inning is over, and if not, whether anyone scored\n",
- " if outs >= 3:\n",
- " return runs\n",
- " runs += sum(r >= 4 for r in runners)\n",
- " runners = [r for r in runners if r < 4]\n",
- " \n",
- "def forced(runners, r) -> bool: return all(b in runners for b in range(r))"
+ "event_codes = {\n",
+ " '1': 'single', '2': 'double', '3': 'triple', '4': 'home run',\n",
+ " 'B': 'base on balls', 'D': 'double play', 'E': 'error',\n",
+ " 'F': 'fly out', 'K': 'strikeout', 'O': 'foul out', 'S': 'out at first'}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Testing\n",
+ "I'll define the function `inning` to simulate a half inning and return the number of runs scored. Design choices for `inning`:\n",
"\n",
- "Let's peek at some random innings:"
+ "- I'll keep track of runners with a set of occupied bases; `runners = {1, 3}` means runners on first and third.\n",
+ "- I'll keep track of the number of `runs` and `outs` in an inning, and return the number of `runs` when there are three `outs`.\n",
+ "- Each event follows four steps. If `runners = {1, 3}` and the event is `'2'` (a double), then the steps are:\n",
+ " - The batter steps up to the plate. The plate is represented as base `0`, so now `runners = {0, 1, 3}`.\n",
+ " - Check if the event causes runner(s) to be out, and if the inning is over. In this case, no.\n",
+ " - Advance each runner according to `advance(r, e)`. In this case, `runners = {2, 3, 5}`.\n",
+ " - Remove the runners who have `scored` and increment `runs` accordingly. In this case, runner `5` has scored, so we increment `runs` by 1 and end up with `runners = {2, 3}`.\n",
+ "- I want `inning` to be easily **testable**: I want to say `assert 2 = inning('1KO4F')`.\n",
+ "- I also want `inning` to be capable of simulating many independent random innings. So the interface is to accept an *iterable* of event codes. That could be string, or a generator, as provided by `event_stream()`.\n",
+ "- I want `inning` to be **loggable**: calling `inning(events, verbose=True)` should produce printed output for each event.\n",
+ "- `advance(r, e)` says that a runner advances `e` bases on an `e` base hit; one base on an error, sacrifice, or double play; and one base on a base on balls only if forced.\n",
+ "- A runner on base `r` is `forced` if all the lower-numbered bases have runners.\n",
+ "- `ONBG` is defined as a generator of random events with the probabilities from \"Our National Ball Game\"."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "0 outs, 0 runs, event: E, runners: []\n",
- "0 outs, 0 runs, event: 4, runners: [1]\n",
- "0 outs, 2 runs, event: E, runners: []\n",
- "0 outs, 2 runs, event: 1, runners: [1]\n",
- "0 outs, 2 runs, event: f, runners: [2, 1]\n",
- "1 outs, 2 runs, event: B, runners: [2, 1]\n",
- "1 outs, 2 runs, event: 1, runners: [3, 2, 1]\n",
- "1 outs, 4 runs, event: E, runners: [2, 1]\n",
- "1 outs, 4 runs, event: o, runners: [3, 2, 1]\n",
- "2 outs, 5 runs, event: o, runners: [3, 2]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "5"
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
+ "outputs": [],
"source": [
- "inning(verbose=True)"
+ "def inning(events, verbose=True) -> int:\n",
+ " \"\"\"Simulate a half inning based on events, and return number of runs scored.\"\"\"\n",
+ " outs = runs = 0 # Inning starts with no outs and no runs,\n",
+ " runners = set() # and with nobody on base\n",
+ " def out(r) -> int: runners.remove(r); return 1\n",
+ " def forced(r) -> bool: return all(b in runners for b in range(r))\n",
+ " def advance(r, e) -> int: \n",
+ " return int(e if e in '1234' else (e in 'ESD' or (e == 'B' and forced(r))))\n",
+ " for e in events:\n",
+ " if verbose: show(outs, runs, runners, e)\n",
+ " runners.add(batter) # Batter steps up to the plate\n",
+ " if e == 'D' and len(runners) > 1: # Double play: batter and lead runner out\n",
+ " outs += out(batter) + out(max(runners))\n",
+ " elif e in 'DSKOF': # Batter is out\n",
+ " outs += out(batter) \n",
+ " if outs >= 3: # If inning is over: return runs scored\n",
+ " return runs \n",
+ " runners = {r + advance(r, e) for r in runners} # Runners advance\n",
+ " runs += len(runners & scored) # Tally runs\n",
+ " runners = runners - scored # Remove runners who scored\n",
+ " \n",
+ "def event_stream(events, strikes=0):\n",
+ " \"\"\"A generator of random baseball events.\"\"\"\n",
+ " while True:\n",
+ " yield 'K' if (random.random() < strikes ** 3) else random.choice(events)\n",
+ "\n",
+ "def show(outs, runs, runners, event):\n",
+ " \"\"\"Print a representation of the current state of play.\"\"\"\n",
+ " bases = ''.join(b if int(b) in runners else '-' for b in '321')\n",
+ " print(f'{outs} outs {runs} runs {bases} {event} ({event_codes[event]})')\n",
+ " \n",
+ "ONBG = event_stream('2111111EEBBOOSSSSSSSFFFFFD334', 7/36) # Our National Ball Game\n",
+ "batter = 0 # The batter is not yet at first base\n",
+ "scored = {4, 5, 6, 7} # Runners in these positions have scored"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Examples and Tests\n",
+ "\n",
+ "Let's peek at some random innings:"
]
},
{
@@ -154,20 +139,16 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "0 outs, 0 runs, event: 1, runners: []\n",
- "0 outs, 0 runs, event: B, runners: [1]\n",
- "0 outs, 0 runs, event: O, runners: [2, 1]\n",
- "1 outs, 0 runs, event: 1, runners: [2, 1]\n",
- "1 outs, 1 runs, event: 3, runners: [2, 1]\n",
- "1 outs, 3 runs, event: 1, runners: [3]\n",
- "1 outs, 4 runs, event: f, runners: [1]\n",
- "2 outs, 4 runs, event: o, runners: [1]\n"
+ "0 outs 0 runs --- 3 (triple)\n",
+ "0 outs 0 runs 3-- F (fly out)\n",
+ "1 outs 0 runs 3-- S (out at first)\n",
+ "2 outs 1 runs --- F (fly out)\n"
]
},
{
"data": {
"text/plain": [
- "4"
+ "1"
]
},
"execution_count": 4,
@@ -176,14 +157,7 @@
}
],
"source": [
- "inning(verbose=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And we can feed in any events we want to test the code:"
+ "inning(ONBG)"
]
},
{
@@ -195,22 +169,15 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "0 outs, 0 runs, event: 2, runners: []\n",
- "0 outs, 0 runs, event: E, runners: [2]\n",
- "0 outs, 0 runs, event: B, runners: [3, 1]\n",
- "0 outs, 0 runs, event: B, runners: [3, 2, 1]\n",
- "0 outs, 1 runs, event: 1, runners: [3, 2, 1]\n",
- "0 outs, 3 runs, event: D, runners: [2, 1]\n",
- "2 outs, 3 runs, event: B, runners: [3]\n",
- "2 outs, 3 runs, event: 1, runners: [3, 1]\n",
- "2 outs, 4 runs, event: 2, runners: [2, 1]\n",
- "2 outs, 5 runs, event: f, runners: [3, 2]\n"
+ "0 outs 0 runs --- F (fly out)\n",
+ "1 outs 0 runs --- S (out at first)\n",
+ "2 outs 0 runs --- S (out at first)\n"
]
},
{
"data": {
"text/plain": [
- "5"
+ "0"
]
},
"execution_count": 5,
@@ -219,36 +186,40 @@
}
],
"source": [
- "inning('2EBB1DB12f', verbose=True)"
+ "inning(ONBG)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "That looks good.\n",
- "\n",
- "# Simulating\n",
- "\n",
- "Now, simulate a million innings, and then sample from them to simulate a million nine-inning games (for one team):"
+ "Let's also test some historic innings. I'll take some of the Red Sox innings from their 2004 playoff series against the Yankees."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- K (strikeout)\n",
+ "1 outs 0 runs --- 2 (double)\n",
+ "1 outs 0 runs -2- O (foul out)\n",
+ "2 outs 0 runs -2- 1 (single)\n",
+ "2 outs 0 runs 3-1 2 (double)\n",
+ "2 outs 1 runs 32- 1 (single)\n",
+ "2 outs 2 runs 3-1 4 (home run)\n",
+ "2 outs 5 runs --- K (strikeout)\n"
+ ]
+ }
+ ],
"source": [
- "N = 1000000\n",
- "innings = [inning() for _ in range(N)]\n",
- "games = [sum(random.sample(innings, 9)) for _ in range(N)]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's see histograms:"
+ "# 7th inning in game 1: 5 runs (Homer by Varitek)\n",
+ "# (But not a perfect reproduction, because our simulation doesn't have passed balls.)\n",
+ "assert 5 == inning('K2O1214K')"
]
},
{
@@ -257,35 +228,90 @@
"metadata": {},
"outputs": [
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- S (out at first)\n",
+ "1 outs 0 runs --- S (out at first)\n",
+ "2 outs 0 runs --- 2 (double)\n",
+ "2 outs 0 runs -2- 1 (single)\n",
+ "2 outs 0 runs 3-1 1 (single)\n",
+ "2 outs 1 runs -21 4 (home run)\n",
+ "2 outs 4 runs --- F (fly out)\n"
+ ]
}
],
"source": [
- "def hist(nums, title): \n",
- " \"Plot a histogram.\"\n",
- " plt.hist(nums, ec='black', bins=max(nums)-min(nums)+1, align='left')\n",
- " plt.title(f'{title} Mean: {sum(nums)/len(nums):.3f}, Min: {min(nums)}, Max: {max(nums)}')\n",
- " \n",
- "hist(innings, 'Runs per inning:')"
+ "# 4th inning in game 6: 4 runs (Homer by Bellhorn)\n",
+ "assert 4 == inning('SS2114F')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- S (out at first)\n",
+ "1 outs 0 runs --- 1 (single)\n",
+ "1 outs 0 runs --1 B (base on balls)\n",
+ "1 outs 0 runs -21 B (base on balls)\n",
+ "1 outs 0 runs 321 4 (home run)\n",
+ "1 outs 4 runs --- B (base on balls)\n",
+ "1 outs 4 runs --1 F (fly out)\n",
+ "2 outs 4 runs --1 S (out at first)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 2nd inning in game 7: 4 runs (Grand Slam by Damon)\n",
+ "assert 4 == inning('S1BB4BFS')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "That looks good to me.\n",
+ "\n",
+ "# Simulation\n",
+ "\n",
+ "Now, simulate a hundred thousand innings, and then sample from them to simulate a hundred thousand nine-inning games (for one team), and show histograms of the results, labelled with statistics:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def simulate(N=100000, inning=inning, events=ONBG) -> None:\n",
+ " innings = [inning(events=events, verbose=False) for _ in range(N)]\n",
+ " games = [sum(random.sample(innings, 9)) for _ in range(N)]\n",
+ " hist(innings, 'Runs/inning (for one team)')\n",
+ " hist(games, 'Runs/game (for one team)')\n",
+ " \n",
+ "def hist(nums, title): \n",
+ " \"\"\"Plot a histogram and show some statistics.\"\"\"\n",
+ " plt.hist(nums, ec='black', bins=max(nums)-min(nums), align='left')\n",
+ " plt.xlabel(title)\n",
+ " plt.title(f'μ: {mean(nums):.2f}, σ: {stdev(nums):.2f}, max: {max(nums)}')\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "scrolled": false
+ },
"outputs": [
{
"data": {
- "image/png": "\n",
+ "image/png": "\n",
"text/plain": [
""
]
@@ -294,10 +320,506 @@
"needs_background": "light"
},
"output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 2.52 s, sys: 16.6 ms, total: 2.54 s\n",
+ "Wall time: 2.58 s\n"
+ ]
}
],
"source": [
- "hist(games, 'Runs per game:')"
+ "%time simulate()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "So, about 13 runs per game (per team). This shows that the dice game is not very realistic with respect to current-day baseball. It is true that games were higher-scoring 130 years ago, and perhaps a dice game is more fun when there is a lot of action.\n",
+ "\n",
+ "# Real Major League Baseball Stats\n",
+ "\n",
+ "Could I make the game reflect baseball as it is played today? To do so I would need:\n",
+ "1. A source of major league baseball (MLB) statistics.\n",
+ "2. A way to convert those statistics into the format expected by the function `inning`.\n",
+ "3. Possibly some modifications to `inning`, depending on how the conversion goes.\n",
+ "\n",
+ "[Baseball-reference.com](https://www.baseball-reference.com) has lots of stats, in particular \n",
+ "[MLB annual batting stats](https://www.baseball-reference.com/leagues/MLB/bat.shtml) and\n",
+ "[fielding stats](https://www.baseball-reference.com/leagues/MLB/field.shtml); I'll use the stats for the complete 2019 season. The batting stats have most of what we need, and the fielding stats give us double plays and errors.\n",
+ "\n",
+ "I start by defining two utility functions that can be useful for any tabular data: `cell_value`, which converts a table cell entry into an `int`, `float`, or `str` as appropriate; and `header_row_dict`, which creates a dict of `{column_name: value}` entries. The function `mlb_convert` then converts this format (a dict keyed by `H/2B/3B/HR` etc.) into the event code format (a string of `'1234...'`). As part of the conversion I'll add hit-by-pitch (`HBP`) into the \"base on balls\" category, and I'll record all otherwise unaccounted-for outs under the \"fly out\" (`F`) category (runners do not advance). With this understood, we won't need to change the function `inning` at all. (It is true that `mlb_convert` returns a very long string, equal in length to the number of plate appearances over the whole MLB season. But that takes up less space than storing one photo, so I'm not going to worry about it.)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def cell_value(entry, types=(int, float, str)):\n",
+ " \"\"\"Convert a cell entry into the first type that doesn't raise an error.\"\"\"\n",
+ " for typ in types:\n",
+ " try:\n",
+ " return typ(entry)\n",
+ " except ValueError:\n",
+ " pass\n",
+ " \n",
+ "def header_row_dict(header, row, sep=None, value=cell_value) -> dict:\n",
+ " \"\"\"Parse a header and table row into a dict of `{column_name: value(cell)}`.\"\"\"\n",
+ " return dict(zip(header.split(sep), map(value, row.split(sep))))\n",
+ "\n",
+ "def mlb_convert(stats: dict) -> str:\n",
+ " \"\"\"Given baseball stats return a string '11...FFF'.\"\"\"\n",
+ " events = Counter({\n",
+ " '1': stats['H'] - stats['2B'] - stats['3B'] - stats['HR'],\n",
+ " '2': stats['2B'], '3': stats['3B'], '4': stats['HR'],\n",
+ " 'E': stats['E'], 'B': stats['BB'] + stats['HBP'],\n",
+ " 'K': stats['SO'], 'D': stats['DP'], 'S': stats['SH'] + stats['SF']})\n",
+ " events['F'] = stats['PA'] - sum(events.values()) # All unaccounted-for outs\n",
+ " return ''.join(events.elements()) # A str of events"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Below I copy-and-paste the data I need from baseball-reference.com to create the dict `mlb_stats`; convert it to the string `mlb_string`; and use that to create the event generator `mlb_stream`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mlb_stats = header_row_dict(\n",
+ " \"Year Tms #Bat BatAge R/G G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS TB GDP HBP SH SF IBB E DP\",\n",
+ " \"\"\"2019 30 1284 27.9 4.84 4828 185377 165622 23346 41794 8485 783 6735 22358 2261 827 15806 42546 \n",
+ " .252 .323 .435 .758 72050 3441 1968 774 1146 752 2882 3981\"\"\")\n",
+ "mlb_string = mlb_convert(mlb_stats)\n",
+ "mlb_stream = event_stream(mlb_string)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can take a look:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'Year': 2019,\n",
+ " 'Tms': 30,\n",
+ " '#Bat': 1284,\n",
+ " 'BatAge': 27.9,\n",
+ " 'R/G': 4.84,\n",
+ " 'G': 4828,\n",
+ " 'PA': 185377,\n",
+ " 'AB': 165622,\n",
+ " 'R': 23346,\n",
+ " 'H': 41794,\n",
+ " '2B': 8485,\n",
+ " '3B': 783,\n",
+ " 'HR': 6735,\n",
+ " 'RBI': 22358,\n",
+ " 'SB': 2261,\n",
+ " 'CS': 827,\n",
+ " 'BB': 15806,\n",
+ " 'SO': 42546,\n",
+ " 'BA': 0.252,\n",
+ " 'OBP': 0.323,\n",
+ " 'SLG': 0.435,\n",
+ " 'OPS': 0.758,\n",
+ " 'TB': 72050,\n",
+ " 'GDP': 3441,\n",
+ " 'HBP': 1968,\n",
+ " 'SH': 774,\n",
+ " 'SF': 1146,\n",
+ " 'IBB': 752,\n",
+ " 'E': 2882,\n",
+ " 'DP': 3981}"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "mlb_stats"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'111111111111111111111111112222222223444444EEEBBBBBBBBBBBBBBBBBBKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKDDDDSSFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF'"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "mlb_string[::1000] # Just look at every 1000th character"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- 2 (double)\n",
+ "0 outs 0 runs -2- F (fly out)\n",
+ "1 outs 0 runs -2- F (fly out)\n",
+ "2 outs 0 runs -2- F (fly out)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "0"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "inning(mlb_stream)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "I can simulate:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 2.39 s, sys: 21.8 ms, total: 2.41 s\n",
+ "Wall time: 2.46 s\n"
+ ]
+ }
+ ],
+ "source": [
+ "%time simulate(events=mlb_stream)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "That looks a *lot* more like real baseball. But MLB averaged 4.84 runs per team per game in 2019, and this is significantly lower. I think we can make some minor changes to the function `inning`—some \"standard baseball things\"—to make the simulation more realistic. I'm thinking of two changes:\n",
+ "- The most common double play eliminates the batter and the runner on first, not the runner closest to home. \n",
+ "- On a single, a runner on second often scores. \n",
+ "\n",
+ "I'll make those two things the case for all double plays and singles."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def inning2(events, verbose=False) -> int:\n",
+ " \"\"\"Simulate a half inning based on events, and return number of runs scored.\"\"\"\n",
+ " outs = runs = 0 # Inning starts with no outs and no runs,\n",
+ " runners = set() # and with nobody on base\n",
+ " def out(r) -> int: runners.remove(r); return 1\n",
+ " def forced(r) -> bool: return all(b in runners for b in range(r))\n",
+ " def advance(r, e) -> int: \n",
+ " return ((2 if r == 2 else int(e)) if e in '1234' else \n",
+ " (e in 'ESD' or (e == 'B' and forced(r)))) \n",
+ " for e in events:\n",
+ " if verbose: show(outs, runs, runners, e)\n",
+ " runners.add(batter) # Batter steps up to the plate\n",
+ " if e == 'D' and 1 in runners: # Double play: batter and runner on first out\n",
+ " outs += out(batter) + out(1)\n",
+ " elif e in 'DSKOF': # Batter is out\n",
+ " outs += out(batter) \n",
+ " if outs >= 3: # If inning is over: return runs scored\n",
+ " return runs \n",
+ " runners = {r + advance(r, e) for r in runners} # Runners advance\n",
+ " runs += len(runners & scored) # Tally runs\n",
+ " runners = runners - scored # Remove runners who scored"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We show the difference with two examples. First, a triple/walk/double-play sequence scores a run under `inning2` but not `inning`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- 3 (triple)\n",
+ "0 outs 0 runs 3-- B (base on balls)\n",
+ "0 outs 0 runs 3-1 D (double play)\n",
+ "2 outs 1 runs --- K (strikeout)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "inning2('3BDK', True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- 3 (triple)\n",
+ "0 outs 0 runs 3-- B (base on balls)\n",
+ "0 outs 0 runs 3-1 D (double play)\n",
+ "2 outs 0 runs -2- K (strikeout)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "0"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "inning('3BDK', True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Second, a double/single sequence scores a run under `inning2` but not `inning`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- 2 (double)\n",
+ "0 outs 0 runs -2- 1 (single)\n",
+ "0 outs 1 runs --1 F (fly out)\n",
+ "1 outs 1 runs --1 F (fly out)\n",
+ "2 outs 1 runs --1 F (fly out)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "inning2('21FFF', True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 outs 0 runs --- 2 (double)\n",
+ "0 outs 0 runs -2- 1 (single)\n",
+ "0 outs 0 runs 3-1 F (fly out)\n",
+ "1 outs 0 runs 3-1 F (fly out)\n",
+ "2 outs 0 runs 3-1 F (fly out)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "0"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "inning('21FFF', True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can simulate again and note any differences:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "scrolled": false
+ },
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "CPU times: user 2.37 s, sys: 17.1 ms, total: 2.39 s\n",
+ "Wall time: 2.41 s\n"
+ ]
+ }
+ ],
+ "source": [
+ "%time simulate(events=mlb_stream, inning=inning2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There is a slight increase in the number of runs.\n",
+ "\n",
+ "# Opportunities for Improvement\n",
+ "\n",
+ "There are many problems with the code as it is. For example:\n",
+ "\n",
+ "- It assumes all teams are equal. They're not.\n",
+ "- It assumes all pitchers and defense are equal. They're not.\n",
+ "- It assumes all batters are equal. They're not. (I think this is the main place where we get a shortfall in runs: real lineups cluster their best hitters together, and they are more apt to produce runs than a lineup of all median players.)\n",
+ "- It assumes all hits are the same (runners always advance the same number of bases). They're not.\n",
+ "- There's only one type of double play (batter and runner on first out) and no triple play.\n",
+ "- It ignores stolen bases, pickoffs, passed balls, wild pitches, runners taking extra bases, and runners being out on attempted steals, extra bases, or sacrifices.\n",
+ "- There is no strategy (offense and defense behave the same, regardless of the situation).\n",
+ "- It assumes both teams bat for 9 innings. But if the home team is ahead at the bottom of the 9th, they do not bat, and if the score is tied: extra innings.\n",
+ "- With two outs, or with no runners on base, there can be no sacrifice or double play; those types of events would just be regular outs. The stats say that a double play should occur in 3981 out of 185377 at bats, or about 2% of the time. In our simulation the `D` event code would come up that often, but perhaps only half the time there would be a runner and less than two outs, so we would only actually get a double play maybe 1% of the time.\n",
+ "\n",
+ "\n",
+ "What can you do to make the simulation better?"
]
}
],
@@ -317,7 +839,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.2"
+ "version": "3.7.6"
}
},
"nbformat": 4,