Cheryl
This commit is contained in:
parent
c29cf10727
commit
7f591a2f44
File diff suppressed because it is too large
Load Diff
@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div align=\"right\" style=\"text-align: right\"><i>Peter Norvig<br>April 2015<br>Steve's bus: Apr 2020<br>Mad Cheryl: May 2020</i></div>\n",
|
||||
"<div align=\"right\" style=\"text-align: right\"><i>Peter Norvig<br>April 2015</i></div>\n",
|
||||
"\n",
|
||||
"# When is Cheryl's Birthday?\n",
|
||||
"\n",
|
||||
@ -22,7 +22,7 @@
|
||||
"6. **Albert**: \"Then I also know when Cheryl's birthday is.\"\n",
|
||||
"7. So when is Cheryl's birthday?\n",
|
||||
"\n",
|
||||
"Let's work through the puzzle line by line.\n",
|
||||
"This puzzle is designed for a paper-and-pencil solution, but I'm going to solve it with code; code is more flexible and can be used to solve other similar puzzles. Let's work through the puzzle line by line.\n",
|
||||
"\n",
|
||||
"## 1. Cheryl gives Albert and Bernard a list of 10 possible dates:\n",
|
||||
"\n",
|
||||
@ -197,7 +197,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def albert1(date) -> bool:\n",
|
||||
"def albert1(date: str) -> bool:\n",
|
||||
" \"\"\"Albert: I don't know when Cheryl's birthday is, and I know that Bernard does not know.\"\"\"\n",
|
||||
" dates = told(month(date))\n",
|
||||
" return not know(dates) and not satisfy(dates, lambda date: know(told(day(date))))"
|
||||
@ -247,7 +247,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def bernard1(date) -> bool:\n",
|
||||
"def bernard1(date: str) -> bool:\n",
|
||||
" \"Bernard: At first I don't know when Cheryl's birthday is, but I know now.\"\n",
|
||||
" at_first = told(day(date))\n",
|
||||
" now = satisfy(at_first, albert1)\n",
|
||||
@ -308,7 +308,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def albert2(date) -> bool:\n",
|
||||
"def albert2(date: str) -> bool:\n",
|
||||
" \"Albert: Then I also know when Cheryl's birthday is.\" \n",
|
||||
" now = satisfy(told(month(date)), bernard1)\n",
|
||||
" return know(now)"
|
||||
@ -348,106 +348,6 @@
|
||||
"source": [
|
||||
"**Success!** We have deduced that Cheryl's birthday is **July 16**. We know Cheryl's birthday:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"assert know(cheryls_birthday())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"___\n",
|
||||
"\n",
|
||||
"# New Puzzle: Steve's Bus\n",
|
||||
"\n",
|
||||
"Here's [another puzzle](https://www.reddit.com/r/riddles/comments/fw7h42/a_riddle_i_couldnt_solve/) that seems to have a very similar format:\n",
|
||||
"\n",
|
||||
"1. Steve tells Alice the hour of his bus departure and he tells Annie at which minute it leaves. He also tells them both that the bus leaves between 06:00 and 10:00.\n",
|
||||
"2. Alice and Annie consult the timetable and find the following services between those two time:\n",
|
||||
" - 06:32, 06:43, 06:50, 07:17, 07:46, 08:19, 08:32, 09:17, 09:19, 09:50.\n",
|
||||
"4. Alice then says “I don’t know when Steve’s bus leaves but I am sure that neither does Annie”\n",
|
||||
"5. Annie Replies “I didn’t know his bus, but now I do”\n",
|
||||
"6. Alice responds “Now I do as well!”\n",
|
||||
"7. When is Steve’s bus?\n",
|
||||
"\n",
|
||||
"Upon closer inspection, not only is it a similar format, it is **exactly** the same puzzle, except that months are changed to hours and days to minutes. If we change the colons in the times to spaces, we can solve the problem without changing the `cheryls_birthday` function:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'08 32'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"DATES = '06:32, 06:43, 06:50, 07:17, 07:46, 08:19, 08:32, 09:17, 09:19, 09:50'.replace(':', ' ').split(', ')\n",
|
||||
"cheryls_birthday()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Steve took the 8:32 bus."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Another New Puzzle: Evil Mad Scientist Cheryl\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Again, we can solve this problem just by changing the global variable `DATES`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'C 3'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"DATES = {'A 2', 'A 3', 'A 6', 'B 4', 'B 5', 'C 1', 'C 3', 'D 1', 'D 2', 'D 4'}\n",
|
||||
"\n",
|
||||
"cheryls_birthday()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The corret pad is \"C 3\". \n",
|
||||
"\n",
|
||||
"(But may I point out that this Cheryl is not actually a mad scientist, just a [mad engineer](https://www.evilmadscientist.com/2015/evil-mad-engineers/). A true mad scientist would kill 25 people and use the other 25 as a control group.)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
@ -5,46 +5,42 @@
|
||||
"id": "e21c9af1-0087-440b-8dfe-758e0361f6e9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<div align=\"right\"><i>Peter Norvig<br>Sept 25, 2024</i></div>\n",
|
||||
"<div align=\"right\"><i>Peter Norvig<br>Sept 25, 2024<br>Update May 21, 2025</i></div>\n",
|
||||
"\n",
|
||||
"# LLMs, Theory of Mind, and Cheryl's Birthday\n",
|
||||
"\n",
|
||||
"There has been [much](https://spectrum.ieee.org/theory-of-mind-ai) [debate](https://aclanthology.org/2023.conll-1.25/) [on](https://www.gsb.stanford.edu/faculty-research/working-papers/theory-mind-may-have-spontaneously-emerged-large-language-models) [the](https://arxiv.org/abs/2302.02083) [degree](https://www.nature.com/articles/s41562-024-01882-z) to which Large Language Models (LLMs) have a theory of mind: a way of understanding what other people know and don't know. In this notebook I explore one small part of the issue by asking nine LLM chatbots to solve the [Cheryl's Birthday Problem](https://en.wikipedia.org/wiki/Cheryl%27s_Birthday), a well-known logic puzzle in which different characters have different states of knowledge at different times. I gave the candidate solvers two tasks:\n",
|
||||
"1. Write a program to solve the problem.\n",
|
||||
"2. Solve a re-worded variant of the problem with different dates (so that they can't just retrieve a memorized answer).\n",
|
||||
"1. **Write a program** to solve the problem, allowing for any possible set of dates.\n",
|
||||
"2. **Solve** re-worded variants of the problem with different dates (so that they can't just retrieve a memorized answer).\n",
|
||||
"\n",
|
||||
"Here are the ten solvers:\n",
|
||||
"I did this originally in September 2024, and **all** of the LLMs failed.\n",
|
||||
"But in 2025, all the models had updated versions, and Claude 3.7 Sonnet, Google Gemini 2.5 Pro and You.com Compute passed the test.\n",
|
||||
"\n",
|
||||
"- [A human programmer](https://github.com/norvig/)\n",
|
||||
"- [ChatGPT 4o](https://chatgpt.com/)\n",
|
||||
"- [Microsoft Copilot](https://copilot.microsoft.com/)\n",
|
||||
"- [Gemini Advanced](https://gemini.google.com/app)\n",
|
||||
"- [Meta AI Llama 405B](https://www.meta.ai/)\n",
|
||||
"- [Anthropic Claude 3.5 Sonnet](https://claude.ai/new)\n",
|
||||
"- [Perplexity](https://www.perplexity.ai/)\n",
|
||||
"- [Cohere Chat](https://cohere.com/chat)\n",
|
||||
"- [HuggingFace Chat](https://huggingface.co/chat/)\n",
|
||||
"- [You.com](https://you.com/)\n",
|
||||
"|September 2024|May 2025|\n",
|
||||
"|----|----|\n",
|
||||
"|✅ [A human programmer](https://github.com/norvig/)|✅ A human programmer|\n",
|
||||
"|❌ [Anthropic Claude 3.5 Sonnet](https://claude.ai/new)|✅ Claude 3.7 Sonnet\n",
|
||||
"|❌ [Gemini Advanced](https://gemini.google.com/app)|✅ Gemini 2.5 Pro|\n",
|
||||
"|❌ [You.com](https://you.com/)|✅ You.com Compute|\n",
|
||||
"|❌ [ChatGPT 4o](https://chatgpt.com/)|❌ ChatGPT o4-mini-high|\n",
|
||||
"|❌ [Microsoft Copilot](https://copilot.microsoft.com|❌ Microsoft Copilot|\n",
|
||||
"|❌ [Meta AI Llama 405B](https://www.meta.ai/)|❌ Meta AI|\n",
|
||||
"|❌ [Perplexity](https://www.perplexity.ai/)|❌ Perplexity|\n",
|
||||
"|❌ [Cohere Chat](https://cohere.com/chat)|❌ Cohere Command A|\n",
|
||||
"|❌ [HuggingFace Chat](https://huggingface.co/chat/)|❌ HuggingChat v0.9.4|\n",
|
||||
"\n",
|
||||
"# TLDR: Conclusions\n",
|
||||
"\n",
|
||||
"1. The human solved both requests.\n",
|
||||
"2. None of the LLMs could reliably solve either request.\n",
|
||||
"\n",
|
||||
"The LLMs were all familiar with the problem, so I didn't have to describe it in the prompt, just name it. Most of them correctly recalled the answer to the original problem: July 16. But none of them were able to write a program. They all failed to distinguish the different knowledge states of the different characters over time, both in the program they wrote and in the resoning steps for the second request. At least with respect to this problem, they had no theory of mind. (Perhaps that is in part due to the fact that very few of the Python programs they were trained on deal with theory of mind.)\n",
|
||||
"The LLMs were all familiar with the problem, so I didn't have to describe it in the prompt, just name it. Most of them correctly recalled the answer to the original problem: July 16. But all of them in 2024, and the majority of them in 2025 failed to distinguish the different knowledge states of Alfred and Bernard. At least with respect to this problem, they had a poorly developed theory of mind. (Perhaps that is in part due to the fact that very few of the Python programs they were trained on deal with problems like this.)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# First Prompt\n",
|
||||
"# 2024: Writing a Program\n",
|
||||
"\n",
|
||||
"Here is the first prompt:\n",
|
||||
"Here is the prompt I used:\n",
|
||||
"\n",
|
||||
"___\n",
|
||||
"***What is the answer to the \"Cheryl's Birthday\" problem? Write a Python program to solve it. Make sure that the program will still work if the list of possible dates is changed.***\n",
|
||||
"___\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Responses to First Prompt\n",
|
||||
"\n",
|
||||
"Each LLM provided explanatory output along with a program; for brevity I only show the explanatory output from the first LLM, ChatGPT 4o. My comments are in *[bracketed italics]*. \n"
|
||||
]
|
||||
},
|
||||
@ -55,7 +51,7 @@
|
||||
"source": [
|
||||
"# Human\n",
|
||||
"\n",
|
||||
"An actual human (me) was able to write a program, shown in [**another notebook**](https://github.com/norvig/pytudes/blob/main/ipynb/Cheryl-and-Eve.ipynb), that correctly solves the original problem and also handles new sets of dates, and other variations on the problem. I introduced the idea of a *BeliefState*, a set of possible dates that a person believes might possibly be the birthday, and I modeled a character's *statement* as a function that takes a particular date as input, and returns true if the date is consistent with the statement."
|
||||
"An actual human programmer (me) was able to write a program, shown in [**another notebook**](https://github.com/norvig/pytudes/blob/main/ipynb/Cheryl-and-Eve.ipynb), that correctly solves the original problem and also handles different sets of dates. I introduced the idea of a *BeliefState*, a set of possible dates that a person believes might possibly be the birthday, and I modeled a character's *statement* as a function that takes a particular date as input, and returns true if the date is consistent with the statement."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -207,11 +203,13 @@
|
||||
"source": [
|
||||
"def cheryls_birthday(possible_dates):\n",
|
||||
" # Step 1: Filter out months with unique days\n",
|
||||
" unique_days = {date.split()[1] for date in possible_dates if sum(d.split()[1] == date.split()[1] for d in possible_dates) == 1}\n",
|
||||
" unique_days = {date.split()[1] for date in possible_dates if sum(d.split()[1] == date.split()[1] \n",
|
||||
" for d in possible_dates) == 1}\n",
|
||||
" possible_dates = [date for date in possible_dates if date.split()[1] not in unique_days]\n",
|
||||
"\n",
|
||||
" # Step 2: Filter out dates with unique months\n",
|
||||
" unique_months = {date.split()[0] for date in possible_dates if sum(d.split()[0] == date.split()[0] for d in possible_dates) == 1}\n",
|
||||
" unique_months = {date.split()[0] for date in possible_dates if sum(d.split()[0] == date.split()[0] \n",
|
||||
" for d in possible_dates) == 1}\n",
|
||||
" possible_dates = [date for date in possible_dates if date.split()[0] not in unique_months]\n",
|
||||
"\n",
|
||||
" # Step 3: Filter out remaining dates based on the logic of the puzzle\n",
|
||||
@ -686,9 +684,9 @@
|
||||
"id": "dad267ee-36c9-4133-ae22-a436c2024e47",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Second Prompt\n",
|
||||
"# 2024: Solving a Re-worded Variant\n",
|
||||
"\n",
|
||||
"I used [my program](https://github.com/norvig/pytudes/blob/main/ipynb/Cheryl-and-Eve.ipynb) to generate a new set of 10 dates that work, changed the wording, and used this as the prompt:\n",
|
||||
"I used [my program](https://github.com/norvig/pytudes/blob/main/ipynb/Cheryl-and-Eve.ipynb) to generate a new set of 10 dates that work. Then I changed the names of the chara cters and the wording of the puzzle, and used this as the prompt:\n",
|
||||
"\n",
|
||||
"___\n",
|
||||
"1. **Ali and Bo are friends with Cam. Cam told them that her anniversary is one of 10 possible dates:**\n",
|
||||
@ -701,8 +699,6 @@
|
||||
"___\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Responses to Second Prompt\n",
|
||||
"\n",
|
||||
"All the LLMs were generally headed in the right direction in their reasoning, but all made mistakes. For example, Claude says \"*Bo hears the day and realizes after Ali's statement. Since Bo did not initially know the date, the day number Bo heard must appear in more than one month. Therefore, the days 16, 18, and 19 must be eliminated since they have corresponding unique months.*\" But that's just not right; they don't have unique months. \n",
|
||||
"\n",
|
||||
"As it turns out, [http://you.com](you.com) did get the right answer, March 18! But some of the reasoning steps were wrong, so I tested it on another set of 10 dates, and it failed on that. Thus I declare that all the LLMs fail on this problem.\n",
|
||||
@ -722,6 +718,14 @@
|
||||
"|[HuggingFace Chat](https://huggingface.co/chat/)|July 17|\n",
|
||||
"|[You.com](https://you.com)|**March 18** (but wrong answer on follow-up problem)|\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7a069aeb-6fc1-4265-9f8a-28bd3eb0a520",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
Loading…
x
Reference in New Issue
Block a user