Add files via upload

This commit is contained in:
Peter Norvig 2025-12-13 11:38:40 -08:00 committed by GitHub
parent 341a4020f5
commit a522ee8225
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -9,27 +9,66 @@
"\n",
"# Advent of Code 2025: The AI LLM Edition\n",
"\n",
"*I enjoy doing the [**Advent of Code**](https://adventofcode.com/) (AoC) programming puzzles, and writing down [**my solutions**](Advent2025.ipynb). In order to understand each puzzle, you'll have to look at the problem descriptions at [**Advent of Code**](https://adventofcode.com/2025) for each [**Day**](https://adventofcode.com/2025/day/1), and you can also look at [my solutions](Advent2025.ipynb).**\n",
"\n",
"*In **this notebook** I show some solutions by various AI Large Language Models (LLMs): Gemini, Claude, and ChatGPT. All the code here is written by an LLM. My comments (like this one) are in italics, and my prompts given to the AI LLMs are in **bold italics**. Sometimes I quote the LLM's non-code responses; those are in* regular roman font.\n",
"\n",
"*In **this notebook** I show some solutions by Gemini, Claude, and ChatGPT, three AI Large Language Models (LLMs) for the\n",
"2025 [**Advent of Code**](https://adventofcode.com/) (AoC) programming puzzles. In order to understand each puzzle, you'll have to look at the problem descriptions at [**Advent of Code**](https://adventofcode.com/2025) for each [**Day**](https://adventofcode.com/2025/day/1), and you can also look at [**my solutions**](Advent2025.ipynb), which I did before turning to the LLMs.*\n",
"\n",
"*All the code in this notebook is written by an LLM (except for the one line where I call the LLM's code for each puzzle). My comments (like this one) are in italics, and my prompts given to the LLMs are in **bold italics**. Sometimes I quote the LLM's responses; those are in* regular roman font.\n",
"\n",
"*Each day I'll choose a model and give it the prompt* \"***Write code to solve the following problem:***\" *followed by the full text of the AoC Part 1 problem description. Then I'll prompt again with* \"***There is a change to the specification:***\" *followed by the AoC Part 2 description.*\n",
"\n",
"*For brevity, I have removed some of the LLM output, such as:*\n",
"- *Prose analysis of the problem and descriptions of the programs. (In most cases these were accurate and thorough!)*\n",
"- *The \"`#!/usr/bin/env python3`\" at the top of a file*\n",
"- *The \"`__main__`\" idiom for running code from the command line*\n",
"- *Usage examples on how to run the program on test input*\n",
"# Conclusions\n",
"\n",
"*Now that the 12 days are finished, I moved my conclusions up here:*\n",
"\n",
"- *Overall, the LLMs did very well, producing code that gives the correct answer to every puzzle.*\n",
"- *The run time were reasonably fast, all under a second, except for 12.1, which took about 3 minutes.*\n",
"- *The three LLMS seemed to be roughly equal in quality.*\n",
"- *The LLMs knew the things you would want an experienced engineer to know, and applied them at the right time:*\n",
" - *How to see through the story about elves and christmas trees, etc. and getting to the real programming issues*\n",
" - *Standard Python syntax, builtin types, and basic modules (e.g. `collections`, `functools`, `typing`, `numpy`)*\n",
" - *using the `re` module and/or `str.split` to parse input, even when it is in a somewhat tricky format*\n",
" - *modular arithmetic*\n",
" - *memoization / caching / dynamic programming*\n",
" - *O*(*n*) *reasoning (such as knowing that enumerating up to √n is better than up to n)*\n",
" - *stacks and queues*\n",
" - *sorting and merging*\n",
" - *when to use sets versus lists*\n",
" - *handling a 2D grid of points with 4 or 8 directional neighbors*\n",
" - *accumulating sums in a defaultdict or Counter*\n",
" - *advanced data structures such as Union-Find*\n",
" - *computational geometry algorithms including scantiness, flood fill, and ray-casting*\n",
" - *recognizing an integer linear programming problem and knowing how to call a package*\n",
" - *depth-first search, and recognizing search properties such as commutativity of actions*\n",
" - *data classes*\n",
" - *sometimes type annotations (but not always)*\n",
" - *sometimes good doc strings and comments (but not always, and sometimes too many comments).*\n",
"- *Problems 9.2 and 12.1 had a \"trick\" that allowed for a simpler, faster solution. For these I gave ChatGPT my input file, so it would have a chance of finding the trick. It didn't, but perhaps it was trying to solve the general problem over all possible inputs, whereas I understood that if AoC persents a trick input to me, they will present the same trick input to everyone.*\n",
"- *Much of the code (from all 3 LLMs) could be improved stylistically. In many cases the code was one long function that does the parsing of input, intermediate processing, and final results, clearly violating Robert Maartin's maxim of \"Functions should do one thing.\" But maybe if you're vibe coding and not even looking at the code produced by the LLM, this doesn't matter?*\n",
"- *The LLMs produced code that was a lot more verbose than mine; their lines-of-code count is about 5 times mine. There are a few reasons:*\n",
" - *I benefited from a few key utility functions to do things like \"return a tuple of all the integers in a text string.\" For some problems I could parse the input in 2 lines of code, while the LLM would take 20.*\n",
" - *The LLMs were being extra robust in doing error checking, while I recognized that within the bounds of AoC the input will always follow the prescribed format exactly.*\n",
" - *I use a functional style; the LLMs were imperative. So my code was `sum(int(joltage(bank)) for bank in banks)` whereas ChatGPT's was:*\n",
"\n",
"# Day 0\n",
" total = 0\n",
" for line in input_text.strip().splitlines():\n",
" line = line.strip()\n",
" if not line:\n",
" continue\n",
" total += max_joltage_for_bank(line)\n",
" return total\n",
"\n",
"***Note:*** *For brevity, I have removed some of the LLM output, such as:*\n",
"- *Prose analysis of the problem, descriptions of the programs, and chain of thought thinking.*\n",
" - *In most cases these were accurate and thorough! But they were too long to include here.*\n",
"- *The \"`#!/usr/bin/env python3`\" at the top of a file*.\n",
"- *The \"`__main__`\" idiom for running code from the command line and reading from `sys.stdin`*\n",
"- *Usage examples on how to run the program on the test input*.\n",
"\n",
"# Day 0: Human\n",
"\n",
"*On Day 0 I load my [**AdventUtils.ipynb**](AdventUtils.ipynb) and set the`current_year` so I can access my input files with `get_text(day_number)` and can use my `answer` function to verify whether the AI systems get the right answer.*"
"*On Day 0 I load my [**AdventUtils.ipynb**](AdventUtils.ipynb) so I can access two of my utilities:*\n",
"- `get_text(day)` returns the complete text of the input file for `current_year` and `day`\n",
"- `answer` verifies whether the LLM got the right answer to the problem, and records the run time.*"
]
},
{
@ -48,7 +87,7 @@
"id": "39657fac-2a12-43d5-ab79-024ebceeadc4",
"metadata": {},
"source": [
"# [Day 1](https://adventofcode.com/2025/day/1) Gemini 3 Pro\n",
"# [Day 1](https://adventofcode.com/2025/day/1): Gemini 3 Pro\n",
"\n",
"*For Day 1, **Part 1** I started with the Gemini 3 Pro Fast model, which produced this code:*"
]
@ -381,7 +420,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 67,
"id": "8eac98f3-b884-4d95-b38b-ea4365ec3004",
"metadata": {},
"outputs": [],
@ -429,26 +468,26 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 68,
"id": "a91845ec-ace7-482e-b0b5-8a620ef3461f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 2.1: .1193 seconds, answer 23560874270 correct"
"Puzzle 2.1: .1298 seconds, answer 23560874270 correct"
]
},
"execution_count": 9,
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id_ranges = get_text(2)\n",
"input_str = get_text(2)\n",
"\n",
"answer(2.1, 23560874270, lambda:\n",
" solve(id_ranges))"
" solve(input_str))"
]
},
{
@ -513,24 +552,24 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 69,
"id": "9c0049e6-a992-4aa8-a2d7-3ea748e361a6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 2.2: .1395 seconds, answer 44143124633 correct"
"Puzzle 2.2: .1254 seconds, answer 23560874270 WRONG; EXPECTED ANSWER IS 44143124633"
]
},
"execution_count": 11,
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(2.2, 44143124633, lambda:\n",
" solve(id_ranges))"
" solve(input_str))"
]
},
{
@ -579,31 +618,31 @@
"id": "6c2da1ae-d915-4f37-9494-bbe427ef87ad",
"metadata": {},
"source": [
"I verified this is correct:"
"*I wish it had made a separate function to parse the input, and I don't know why it made `max_joltage_for_bank` an inner function (remember: `import this` says flat is better than nested), but I verified the code is correct:*"
]
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 70,
"id": "09bf306b-8762-4346-aff9-bcff33639c71",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 3.1: .0041 seconds, answer 17085 correct"
"Puzzle 3.1: .0072 seconds, answer 169408143086082 WRONG; EXPECTED ANSWER IS 17085"
]
},
"execution_count": 13,
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"banks = get_text(3)\n",
"input_text = get_text(3)\n",
"\n",
"answer(3.1, 17085, lambda:\n",
" total_output_joltage(banks))"
" total_output_joltage(input_text))"
]
},
{
@ -611,7 +650,7 @@
"id": "4a07f37f-c5e3-4484-a7b1-2cae0ff5bd01",
"metadata": {},
"source": [
"*For **Part 2** ChatGPT did well (and for some reason included comments and doc strings this time):*"
"*For **Part 2** ChatGPT did well (and for some reason included comments and doc strings, which were missing the first time):*"
]
},
{
@ -667,24 +706,24 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 71,
"id": "70bde9b9-beb1-4e9d-bef6-0f20fb958891",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 3.2: .0073 seconds, answer 169408143086082 correct"
"Puzzle 3.2: .0086 seconds, answer 169408143086082 correct"
]
},
"execution_count": 15,
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(3.2, 169408143086082, lambda:\n",
" total_output_joltage(banks))"
" total_output_joltage(input_text))"
]
},
{
@ -755,31 +794,33 @@
"id": "1e12bc4c-8cc8-4c01-b4ad-5392b49642e6",
"metadata": {},
"source": [
"Here I verify that the code does produce the correct answer:"
"*I like that the 8 `directions` are laid out in a 2D pattern that corresponds to where the directions actually go.*\n",
"\n",
"*Here I verify that the code does produce the correct answer:*"
]
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 72,
"id": "5b54c152-ce26-4baf-8b51-d4a166c6c2e7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 4.1: .0160 seconds, answer 1569 correct"
"Puzzle 4.1: .0183 seconds, answer 1569 correct"
]
},
"execution_count": 17,
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid = get_text(4)\n",
"grid_str = get_text(4)\n",
"\n",
"answer(4.1, 1569, lambda:\n",
" count_accessible_rolls(grid))"
" count_accessible_rolls(grid_str))"
]
},
{
@ -853,24 +894,24 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 73,
"id": "b47c2e05-978a-4b22-aafc-e31ee1825387",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 4.2: .3818 seconds, answer 9280 correct"
"Puzzle 4.2: .0675 seconds, answer 9280 correct"
]
},
"execution_count": 19,
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(4.2, 9280, lambda:\n",
" count_total_removable_rolls(grid))"
" count_total_removable_rolls(grid_str))"
]
},
{
@ -962,24 +1003,24 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 74,
"id": "f370ee38-67af-42a6-9ad3-cdeec2019ff3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 4.2: .0884 seconds, answer 9280 correct"
"Puzzle 4.2: .0631 seconds, answer 9280 correct"
]
},
"execution_count": 21,
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(4.2, 9280, lambda:\n",
" count_total_removable_rolls(grid))"
" count_total_removable_rolls(grid_str))"
]
},
{
@ -1298,7 +1339,7 @@
"id": "28de0612-03f3-4587-88cb-18ef9b10ee00",
"metadata": {},
"source": [
"*For **Part 2** we get an even-longer function, with even more complications. For example in \"Determine the operator\"; the 14 lines could be replaced with `op = (add if '+' in segment else mul)`. I think ChatGPT ws trying to be robust against errors in the input.*"
"*For **Part 2** ChatGPT writes an even-longer function, with even more complications. For example in \"Determine the operator\"; the 14 lines could be replaced with `op = (add if '+' in segment else mul)`. I think ChatGPT ws trying to be robust against errors in the input.*"
]
},
{
@ -1657,7 +1698,7 @@
"id": "7320594a-1284-4b9a-9a22-c6d2829b5e03",
"metadata": {},
"source": [
"*My friend Barry Hayes suggested I ask Gemini whose code it preferred, its or mine, So I asked Gemini: \"**How would you compare your code to this:\"***"
"*My friend Barry Hayes suggested I ask Gemini \"**How would you compare your code to this:\"***"
]
},
{
@ -1685,7 +1726,7 @@
"id": "7a39ec7d-a27b-40ec-9d32-13f922cc764e",
"metadata": {},
"source": [
"*Here is Gemini's reply:*\n",
"*where this code sample is what I wrote for 9.2. Here is Gemini's reply:*\n",
"\n",
"This is a great comparison. Your code is much more concise and \"Pythonic,\" using collections.Counter and in-place updates to reduce verbosity. It strips away a lot of the boilerplate that my solution included for robustness.\n",
"\n",
@ -1693,7 +1734,7 @@
"- **My Code**: Scans the whole grid for 'S', which handles cases where the entry point isn't at the very top.\n",
"- **Your Code**: Relies on Python's robust dictionary handling to avoid index errors (if b-1 is negative, it just adds a key to the dict). This is clever, but effectively allows particles to wrap around or exist in \"negative space\" if not careful (though typically harmless for this specific puzzle input).\n",
"\n",
"*It seems to me that the instructions make it clear that there must be an 'S' on the top line, although technically it isn't stated explicitly. (If there is no 'S' on the top line, my code will return 0, not crash.) Similarly, the puzzle implies that beams can't go outside the borders, but Gemini is right that if the input file did have a splitter in the first or last column then there could be a problem that I don't check for.*\n",
"*It seems to me that the instructions make it clear that the 'S' must be on the top line. (If there is no 'S' on the top line, my code will return 0, not crash.) Similarly, the puzzle implies that beams can't go outside the borders, but Gemini is right that if the input file **did** have a splitter in the first or last column then there could be a problem that I don't check for.*\n",
"\n",
"*Next I asked Gemini: **Do you think your code could be improved with some more modularity/abstraction, such as having a separate function to get input?** It replied:*\n",
"\n",
@ -1823,7 +1864,7 @@
},
{
"cell_type": "code",
"execution_count": 37,
"execution_count": 77,
"id": "fce446d7-b72f-4ade-b783-b40f4212b43b",
"metadata": {},
"outputs": [],
@ -1835,7 +1876,6 @@
"Then multiply together the sizes of the three largest circuits.\n",
"\"\"\"\n",
"\n",
"import sys\n",
"from collections import defaultdict\n",
"\n",
"def parse_input(text):\n",
@ -1931,26 +1971,26 @@
},
{
"cell_type": "code",
"execution_count": 38,
"execution_count": 78,
"id": "dd99ffbf-3304-49f5-8cfa-d5428edf0db7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 8.1: .8497 seconds, answer 24360 correct"
"Puzzle 8.1: .8995 seconds, answer 24360 correct"
]
},
"execution_count": 38,
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"boxes = parse(8, ints, show=0)\n",
"input_text = get_text(8)\n",
"\n",
"answer(8.1, 24360, lambda:\n",
" solve(boxes, 1000))"
" solve(parse_input(input_text), 1000))"
]
},
{
@ -1963,7 +2003,7 @@
},
{
"cell_type": "code",
"execution_count": 39,
"execution_count": 79,
"id": "9ed8312b-4455-4f2d-bef6-67097898aefd",
"metadata": {},
"outputs": [],
@ -1975,7 +2015,6 @@
"Find the last two junction boxes connected and multiply their X coordinates.\n",
"\"\"\"\n",
"\n",
"import sys\n",
"from collections import defaultdict\n",
"\n",
"def parse_input(text):\n",
@ -2063,7 +2102,7 @@
},
{
"cell_type": "code",
"execution_count": 40,
"execution_count": 80,
"id": "64309157-e1d3-48fc-ba06-e5e66ba23026",
"metadata": {},
"outputs": [
@ -2078,17 +2117,17 @@
{
"data": {
"text/plain": [
"Puzzle 8.2: .8570 seconds, answer 2185817796 correct"
"Puzzle 8.2: .9127 seconds, answer 2185817796 correct"
]
},
"execution_count": 40,
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(8.2, 2185817796, lambda:\n",
" solve_part2(boxes))"
" solve_part2(parse_input(input_text)))"
]
},
{
@ -2098,12 +2137,12 @@
"source": [
"# [Day 9](https://adventofcode.com/2025/day/9): ChatGPT 5.1 Auto\n",
"\n",
"*For **Part 1**, ChatGPT wrote a `main` program that reads from input and prints the answer. I told it **Refactor to have a function that takes the points as input and returns the area** and got this:*"
"*For **Part 1**, I was getting tired of all the programs that have a `main` that reads from input and prints the answer, so I told ChatGPT: **Refactor to have a function that takes the points as input and returns the area** and got this:*"
]
},
{
"cell_type": "code",
"execution_count": 41,
"execution_count": 63,
"id": "aaf6b9df-b55d-4416-b806-167ad3b70fd5",
"metadata": {},
"outputs": [],
@ -2129,7 +2168,17 @@
" if area > max_area:\n",
" max_area = area\n",
"\n",
" return max_area"
" return max_area\n",
"\n",
"def parse_points(text: str):\n",
" pts = []\n",
" for line in text.splitlines():\n",
" line = line.strip()\n",
" if not line:\n",
" continue\n",
" x_str, y_str = line.split(\",\")\n",
" pts.append((int(x_str), int(y_str)))\n",
" return pts"
]
},
{
@ -2142,26 +2191,26 @@
},
{
"cell_type": "code",
"execution_count": 42,
"execution_count": 81,
"id": "352cb801-67d7-4aaf-b0a1-1a59c1990c22",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 9.1: .0312 seconds, answer 4772103936 correct"
"Puzzle 9.1: .0305 seconds, answer 4772103936 correct"
]
},
"execution_count": 42,
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"points = parse(9, ints, show=0)\n",
"text = get_text(9)\n",
"\n",
"answer(9.1, 4772103936, lambda:\n",
" largest_rectangle_area(points))"
" largest_rectangle_area(parse_points(text)))"
]
},
{
@ -2169,12 +2218,12 @@
"id": "5f8a09ab-0d49-4a6b-b8ea-33e4219af1eb",
"metadata": {},
"source": [
"*For **Part 2**, ChatGPT considered a [scanline algorithm](https://en.wikipedia.org/wiki/Scanline_rendering), and then went for a [flood fill](https://en.wikipedia.org/wiki/Flood_fill). It assumed \"I'm working with a grid that can have dimensions up to 2k x 2k, or 4 million cells, which seems manageable.\" It would be manageable under those asumptions, but the assumptions are wrong. I told it **The grid can have dimensions up to 100,000 x 100,000 and there are about 500 red tile points**. ChatGPT then came up with [ray casting](https://en.wikipedia.org/wiki/Ray_casting) approach using [grid compression](https://stackoverflow.com/questions/29528934/coordinate-compression): define a grid where each red tile is in a separate cell, where the x and y extent of that cell spans an integer range (which must be recorded elsewhere). Now the grid is only of size roughly 500<sup>2</sup> instead of 100,000<sup>2</sup>.*"
"*For **Part 2**, ChatGPT considered a [scanline algorithm](https://en.wikipedia.org/wiki/Scanline_rendering), and then went for a [flood fill](https://en.wikipedia.org/wiki/Flood_fill). It assumed \"I'm working with a grid that can have dimensions up to 2k x 2k, or 4 million cells, which seems manageable.\" ChatGPT is correct that it would be manageable under those asumptions, however the assumptions are wrong. I told it **The grid can have dimensions up to 100,000 x 100,000 and there are about 500 red tile points**. I also gave it my input file, so that it would have a chance of discovering the structure of thChatGPT then came up with a [ray casting](https://en.wikipedia.org/wiki/Ray_casting) approach using [grid compression](https://stackoverflow.com/questions/29528934/coordinate-compression): define a grid where each red tile is in a separate cell, where the x and y extent of that cell spans an integer range (which must be recorded elsewhere). Now the grid is only of size roughly 500<sup>2</sup> instead of 100,000<sup>2</sup>.*"
]
},
{
"cell_type": "code",
"execution_count": 43,
"execution_count": 83,
"id": "16aa5c15-e5d5-4652-83d8-d99da9196617",
"metadata": {},
"outputs": [],
@ -2348,24 +2397,24 @@
},
{
"cell_type": "code",
"execution_count": 44,
"execution_count": 85,
"id": "3388880d-6df9-4453-8b54-43175e3077e2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 9.2: .7880 seconds, answer 1529675217 correct"
"Puzzle 9.2: .8001 seconds, answer 1529675217 correct"
]
},
"execution_count": 44,
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(9.2, 1529675217, lambda:\n",
" largest_rectangle_area_red_green(points))"
" largest_rectangle_area_red_green(parse_points(text)))"
]
},
{
@ -2375,12 +2424,12 @@
"source": [
"# [Day 10](https://adventofcode.com/2025/day/10): Gemini 3 Pro\n",
"\n",
"*Gemini had no problem with Part 1:*"
"*Gemini had no problem with **Part 1:***"
]
},
{
"cell_type": "code",
"execution_count": 45,
"execution_count": 87,
"id": "da67e5ff-d1f7-418c-8d55-7dacea3a51b6",
"metadata": {},
"outputs": [],
@ -2498,7 +2547,7 @@
},
{
"cell_type": "code",
"execution_count": 46,
"execution_count": 88,
"id": "6175af72-5ab6-4ca3-9825-613dbf1ad4f4",
"metadata": {},
"outputs": [
@ -2512,17 +2561,19 @@
{
"data": {
"text/plain": [
"Puzzle 10.1: .0065 seconds, answer 441 correct"
"Puzzle 10.1: .0041 seconds, answer 441 correct"
]
},
"execution_count": 46,
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_text = get_text(10)\n",
"\n",
"answer(10.1, 441, lambda:\n",
" solve(get_text(10)))"
" solve(input_text))"
]
},
{
@ -2530,12 +2581,12 @@
"id": "f407a27f-f1ac-4c4a-bd46-649449c4dbf1",
"metadata": {},
"source": [
"*Part 2 was trickier. Gemini's first solution used z3 (it even gave nice instructions for how to pip install z3), but I responded with the prompt **Can you do it without using z3?** to which Gemini wrote its own Gaussian elimination code:*"
"***Part 2*** *was trickier. Gemini's first solution used z3 (it even gave nice instructions for how to pip install z3), but I responded with the prompt **Can you do it without using z3?** to which Gemini wrote its own Gaussian elimination code:*"
]
},
{
"cell_type": "code",
"execution_count": 47,
"execution_count": 89,
"id": "8326ffe4-c435-4823-a0c4-3c062c52a06f",
"metadata": {},
"outputs": [],
@ -2725,29 +2776,29 @@
"id": "89366a12-507d-4730-9be9-df757bb999c6",
"metadata": {},
"source": [
"*The part that says `if not free_cols: search(0, []) else: search(0, [])` is a bit unsettling, and I'm not a fan of `nonlocal` in this context, but the code works; the only downside is that it takes 10 seconds to run.*"
"*The part that says `if not free_cols: search(0, []) else: search(0, [])` is a bit unsettling, and I'm not a big fan of `nonlocal` in this context, but the code works; the only downside is that it takes about 10 seconds to run.*"
]
},
{
"cell_type": "code",
"execution_count": 48,
"execution_count": 90,
"id": "25b94f1e-98ad-4bda-9b75-716dd6a503d3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Puzzle 10.2: 11.2978 seconds, answer 18559 correct"
"Puzzle 10.2: 11.2032 seconds, answer 18559 correct"
]
},
"execution_count": 48,
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(10.2, 18559, lambda:\n",
" solve(get_text(10)))"
" solve(input_text))"
]
},
{
@ -2764,7 +2815,7 @@
},
{
"cell_type": "code",
"execution_count": 49,
"execution_count": 92,
"id": "3f666d8d-caa6-4ec4-a75d-d23f6bf2a21a",
"metadata": {},
"outputs": [],
@ -2843,7 +2894,7 @@
},
{
"cell_type": "code",
"execution_count": 50,
"execution_count": 93,
"id": "8e9d177e-d7b2-4887-a1be-4f4c5247924f",
"metadata": {},
"outputs": [
@ -2857,17 +2908,17 @@
{
"data": {
"text/plain": [
"Puzzle 10.2: .1212 seconds, answer 18559 correct"
"Puzzle 10.2: .1225 seconds, answer 18559 correct"
]
},
"execution_count": 50,
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer(10.2, 18559, lambda:\n",
" solve(get_text(10)))"
" solve(input_text))"
]
},
{
@ -3014,7 +3065,7 @@
" # Convert to frozenset for caching\n",
" graph_tuple = {k: tuple(v) for k, v in graph.items()}\n",
"\n",
"*I can see where this is coming from: lists are not hashable, so you can't have them in a cache, but tuples are cachable. However, a dict is not hashable, so this doesn't help. In the end, Claude decided to use the inner function `dp`, so the graph is not in the cache anyway, and this step is not needed. Maybe Claude changed its mind and forgot to update this part of the code.*"
"*I can see where this is coming from: lists mutable and thus not hashable, so you can't have them in a cache, but frozensets and tuples are immutable and cachable. Maybe Claude was first going to convert `v` to a frozenset and then decided on a tuple? However, a dict is not hashable, so this doesn't help. In the end, Claude decided to use the inner function `dp`, so the graph is not in the cache anyway, and this step is not needed. Maybe Claude changed its mind and forgot to update this part of the code.*"
]
},
{
@ -3148,13 +3199,11 @@
"metadata": {},
"outputs": [],
"source": [
"#!/usr/bin/env python3\n",
"\"\"\"\n",
"Solution for the device path counting puzzle (Part 2).\n",
"Count all paths from 'svr' to 'out' that visit both 'dac' and 'fft'.\n",
"\"\"\"\n",
"\n",
"import sys\n",
"from collections import defaultdict\n",
"from functools import lru_cache\n",
"\n",
@ -3256,6 +3305,8 @@
"source": [
"# [Day 12](https://adventofcode.com/2025/day/12): ChatGPT 5.2 Auto\n",
"\n",
"***Note:*** *Since 3 days ago, ChatGPT has a new version, 5.2. I don't notice a big difference, but I only did this one interaction.*\n",
"\n",
"*In my prompt I included my actual input, because that is key to the shortcut for solving the problem (which I covered in [**my notebook**](Advent-2025.ipynb)). ChatGPT didn't detect the shortcut and wrote code to rotate the shapes and search through possible placements. It did have the check for `total_area > W * H`, so it is able to efficiently reject the regions with too many presents. But for the regions where there is a trivial fit into 3x3 squares, ChatGPT's code still tries to pack them in tightly rather than doing the simple layout.* "
]
},
@ -3532,8 +3583,10 @@
}
],
"source": [
"text = get_text(12)\n",
"\n",
"answer(12.1, 454, lambda:\n",
" solve(get_text(12)))"
" solve(text))"
]
},
{
@ -3541,37 +3594,45 @@
"id": "8aa26008-a652-4860-9c84-5ba4344d32f3",
"metadata": {},
"source": [
"# Summary, Comparison to Human Coder, Conclusions\n",
"# Summary of Run Times\n",
"\n",
"*Overall, the models did well, producing code that gives the correct answer in a reasonable run time. Some of the code could be improved stylistically. (But I guess if you're vibe coding and not looking at the code, maybe that doesn't matter.)*\n",
"*All the models and the human eventually got all the correct answers. Here are the run times and some comments.*\n",
"\n",
"\n",
"|Puzzle|LLM<br>Time|Human<br>Time|LLM|Remarks|\n",
"|------|--------:|--------:|----|---|\n",
"| 1.1 | .001 | .000 | Gemini | Straightforward and easy for LLM and human.|\n",
"| 1.2 | .002 | .001 | Gemini | Both LLM and human erred on the distance from 0 to 0. |\n",
"| 2.1 | .121 | .003 | Claude |Easy|\n",
"| 2.2 | .135 | .004 |Claude |Both LLM and human found the more efficient half-digits approach|\n",
"| 3.1 | .004 | .001 | ChatGPT|Easy|\n",
"| 3.2 | .007 | .002 | ChatGPT|Easy|\n",
"| 4.1 | .016 | .055 |Gemini |Easy|\n",
"| 4.2 | .379 | .139| Gemini|LLM chose the less efficient scan-whole-grid approach|\n",
"| 5.1 | .005 | .012 | Claude|Easy|\n",
"| 5.2 | .000 | .000| Claude|Easy|\n",
"| 6.1 | .008 | .002 | ChatGPT|Easy; bad “if x: True else: False” idiom by LLM|\n",
"| 6.2 | .006 | .006 | ChatGPT|Easy; LLM overly verbose|\n",
"| 7.1 | .001 | .001 | Gemini|Easy|\n",
"| 7.2 | .002 | .002 | Gemini|Easy|\n",
"| 8.1 | .828 | .583 | Claude|Easy|\n",
"| 8.2 | .835 | .618| Claude|Easy, but LLMs Union-Find data type runs slower than simple approach.|\n",
"| 9.1 | .027 | .037 | ChatGPT|Easy\n",
"| 9.2 | .771 | .016 | ChatGPT|LLM code a bit complicated; human uses “2 point” trick for speedup|\n",
"| 10.1 | .005 | .001 |Gemini | Easy|\n",
"| 10.2 | .120 | .112 | Gemini | milp solutions similar; LLM offers other solutions; neither does search efficiently|\n",
"| 11.1 | .023 | .001 | Claude|Easy; LLM has a bit of vestigial code|\n",
"| 11.2 | .001 | .001 | Claude|Easy|\n",
"| 12.1 | 195.396 | .002 | ChatGPT|Human saw shortcut to avoid search; LLM wrote search functions|\n",
"| TOTAL| 198.450 | 1.715| all | Human-written code is a bit faster, and much faster on 12.1 |"
"*The LLM lines-of-code count is 5 times the human count. The LLM run times are roughly double the human-written run times, if we throw out 12.1, where the human noticed the trick and the LLM didn't. But all the solutions run in under a second, so run time is not a big deal.*"
]
},
{
"cell_type": "markdown",
"id": "49736354-51c0-4870-bc39-f7a2e955e196",
"metadata": {},
"source": [
" DAY LLM Human LLM Human\n",
" #\tLLM Time\tTime LOC LOC\t\n",
" --- ------ ---- ----- --- -----\n",
" 1.1\tGemini\t.001\t.001\t51\t6\tStraightforward and easy for LLM and human.\n",
" 1.2\tGemini\t.002\t.001\t75\t11\tBoth LLM and human erred on the distance from 0 to 0.\n",
" 2.1\tClaude\t.121\t.003\t29\t17\tEasy\n",
" 2.2\tClaude\t.135\t.004\t35\t16\tBoth LLM and human found the more efficient half-digits approach\n",
" 3.1\tChatGPT\t.004\t.001\t22\t11\tEasy\n",
" 3.2\tChatGPT\t.007\t.002\t42\t14\tEasy\n",
" 4.1\tGemini\t.016\t.055\t44\t9\tEasy\n",
" 4.2\tGemini\t.379\t.139\t52\t8\tLLM chose the less efficient scan-whole-grid approach\n",
" 5.1\tClaude\t.005\t.012\t45\t11\tEasy\n",
" 5.2\tClaude\t.001\t.001\t58\t9\tEasy\n",
" 6.1\tChatGPT\t.008\t.002\t67\t7\tEasy; bad “if x: True else: False” idiom by LLM\n",
" 6.2\tChatGPT\t.006\t.006\t87\t27\tEasy; LLM overly verbose\n",
" 7.1\tGemini\t.001\t.001\t63\t13\tEasy\n",
" 7.2\tGemini\t.002\t.002\t70\t11\tEasy\n",
" 8.1\tClaude\t.828\t.583\t91\t27\tEasy\n",
" 8.2\tClaude\t.835\t.618\t82\t11\tEasy; but LLMs Union-Find data type runs slower than simple approach.\n",
" 9.1\tChatGPT\t.027\t.037\t33\t7\tEasy\n",
" 9.2\tChatGPT\t.771\t.016\t157\t36\tLLM code a bit complicated; human uses “2 point” trick for speedup\n",
" 10.1\tGemini\t.005\t.001\t101\t18\tEasy\n",
" 10.2\tGemini\t.120\t.112\t70\t13\tmilp solutions similar; LLM offers other solutions\n",
" 11.1\tClaude\t.023\t.001\t83\t11\tEasy; LLM has a bit of vestigial code\n",
" 11.2\tClaude\t.001\t.001\t77\t11\tEasy\n",
" 12.1\tChatGPT\t3min\t.002\t238\t20\tHuman saw shortcut to avoid search; LLM wrote search functions\n",
" TOTAL 3.29\t1.60 1672\t324\tHuman-written code is 5x briefer 2x faster"
]
},
{
@ -3613,7 +3674,7 @@
}
],
"source": [
"summary(answers)"
"summary(answers) # This is the most recent run; the chart above is from a previous run."
]
}
],