From 1bc4923c355770ddb9c0b52d8fa3035057ce23fc Mon Sep 17 00:00:00 2001
From: Peter Norvig <peter.norvig@gmail.com>
Date: Fri, 18 Oct 2024 22:14:24 -0700
Subject: [PATCH] Add files via upload

---
 ipynb/CherylMind.ipynb | 80 +++++++++++++++++++++++++++++++++---------
 ipynb/Triplets.ipynb   | 77 +++++++++++++++++++++-------------------
 2 files changed, 104 insertions(+), 53 deletions(-)

diff --git a/ipynb/CherylMind.ipynb b/ipynb/CherylMind.ipynb
index 3508070..94e13ab 100644
--- a/ipynb/CherylMind.ipynb
+++ b/ipynb/CherylMind.ipynb
@@ -9,9 +9,12 @@
     "\n",
     "# LLMs, Theory of Mind, and Cheryl's Birthday\n",
     "\n",
-    "There has been [much](https://spectrum.ieee.org/theory-of-mind-ai) [debate](https://aclanthology.org/2023.conll-1.25/) [on](https://www.gsb.stanford.edu/faculty-research/working-papers/theory-mind-may-have-spontaneously-emerged-large-language-models) [the](https://arxiv.org/abs/2302.02083) [degree](https://www.nature.com/articles/s41562-024-01882-z) to which Large Language Models (LLMs) have a theory of mind: a way of understanding what other people know and don't know. In this notebook I explore one small part of the issue by asking nine LLM chatbots to solve the [Cheryl's Birthday Problem](https://en.wikipedia.org/wiki/Cheryl%27s_Birthday), a well-known logic puzzle in which different characters have different states of knowledge at different times.\n",
+    "There has been [much](https://spectrum.ieee.org/theory-of-mind-ai) [debate](https://aclanthology.org/2023.conll-1.25/) [on](https://www.gsb.stanford.edu/faculty-research/working-papers/theory-mind-may-have-spontaneously-emerged-large-language-models) [the](https://arxiv.org/abs/2302.02083) [degree](https://www.nature.com/articles/s41562-024-01882-z) to which Large Language Models (LLMs) have a theory of mind: a way of understanding what other people know and don't know. In this notebook I explore one small part of the issue by asking nine LLM chatbots to solve the [Cheryl's Birthday Problem](https://en.wikipedia.org/wiki/Cheryl%27s_Birthday), a well-known logic puzzle in which different characters have different states of knowledge at different times. I gave the candidate solvers two tasks:\n",
+    "1. Write a program to solve the problem.\n",
+    "2. Solve a re-worded variant of the problem with different dates (so that they can't just retrieve a memorized answer).\n",
+    "\n",
+    "Here are the ten solvers:\n",
     "\n",
-    "I asked the following ten solvers to tackle the Cheryl's Birthday problem:\n",
     "- [A human programmer](https://github.com/norvig/)\n",
     "- [ChatGPT 4o](https://chatgpt.com/)\n",
     "- [Microsoft Copilot](https://copilot.microsoft.com/)\n",
@@ -23,15 +26,26 @@
     "- [HuggingFace Chat](https://huggingface.co/chat/)\n",
     "- [You.com](https://you.com/)\n",
     "\n",
-    "# TLDR: Conclusion\n",
+    "# TLDR: Conclusions\n",
     "\n",
-    "The LLMs were all familiar with the problem, so I didn't have to describe it in the prompt, just name it. Most of them correctly recalled the answer to the problem: July 16. But none of them were able to write a program that finds the solution. They all failed to distinguish the different knowledge states of the different characters over time. At least with respect to this problem, they had no theory of mind. (Perhaps that is in part due to the fact that very few of the Python programs they were trained on deal with theory of mind.)\n",
+    "1. The human solved both requests.\n",
+    "2. None of the LLMs could reliably solve either request.\n",
     "\n",
-    "Below I show the response for each LLM.  Each one provided explanatory output along with a program; for brevity I only show the explanatory output from the first one, ChatGPT 4o. My comments are in *[bracketed italics]*. The queries were made on Sept 25, 2024; subsequent updates of the models may perform differently.\n",
+    "The LLMs were all familiar with the problem, so I didn't have to describe it in the prompt, just name it. Most of them correctly recalled the answer to the original problem: July 16. But none of them were able to write a program. They all failed to distinguish the different knowledge states of the different characters over time, both in the program they wrote and in the resoning steps for the second request. At least with respect to this problem, they had no theory of mind. (Perhaps that is in part due to the fact that very few of the Python programs they were trained on deal with theory of mind.)\n",
     "\n",
-    "For each LLM the prompt was:\n",
     "\n",
-    "***What is the answer to the \"Cheryl's Birthday\" problem? Write a Python program to solve it. Make sure that the program will still work if the list of possible dates is changed.***"
+    "# First Prompt\n",
+    "\n",
+    "Here is the first prompt:\n",
+    "\n",
+    "___\n",
+    "***What is the answer to the \"Cheryl's Birthday\" problem? Write a Python program to solve it. Make sure that the program will still work if the list of possible dates is changed.***\n",
+    "___\n",
+    "\n",
+    "\n",
+    "# Responses to First Prompt\n",
+    "\n",
+    "Each LLM provided explanatory output along with a program; for brevity I only show the explanatory output from the first LLM, ChatGPT 4o. My comments are in *[bracketed italics]*. \n"
    ]
   },
   {
@@ -311,7 +325,7 @@
    "source": [
     "# Meta Llama 405B\n",
     "\n",
-    "*[Llama also fails to find a solution (and thus has no output).]*"
+    "*[Llama also fails to find a solution (and thus has no output). This program is quite flawed. I do like that it defines functions for the various pieces, but note that the function \"bernard_deduce\" is defined but never called.]*"
    ]
   },
   {
@@ -356,14 +370,6 @@
     "        break"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "469c146d-bbc9-4b2f-8609-b82631a5139d",
-   "metadata": {},
-   "source": [
-    "*[This program is quite flawed. I do like that it defines functions for the various pieces, but note that the function \"bernard_deduce\" is defined but never called.]*"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "02b1f7fd-918c-432d-8784-95dff3d4f86e",
@@ -674,6 +680,48 @@
     "\n",
     "print(cheryls_birthday(possible_dates))"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dad267ee-36c9-4133-ae22-a436c2024e47",
+   "metadata": {},
+   "source": [
+    "# Second Prompt\n",
+    "\n",
+    "I used [my program](https://github.com/norvig/pytudes/blob/main/ipynb/Cheryl-and-Eve.ipynb) to generate a new set of 10 dates that work, changed the wording, and used this as the prompt:\n",
+    "\n",
+    "___\n",
+    "1. **Ali and Bo are friends with Cam. Cam told them that her anniversary is one of 10 possible dates:**\n",
+    "   - **April 17, April 18, April 28, July 16, July 17, July 19, June 16, June 29, March 18, March 19**\n",
+    "3. **Cam then privately tells Ali the month and Bo the day number of the anniversary.**\n",
+    "4. **Ali: \"I don't know when Cam’s anniversary is, and I know that Bo does not know it either.\"**\n",
+    "5. **Bo: \"At first I didn't know when Cam’s anniversary was, but I know now, after Ali's statement.\"**\n",
+    "6. **Ali: \"Then I also know when Cam’s anniversary is.\"**\n",
+    "7. **When is Cam’s anniversary?**\n",
+    "___\n",
+    "\n",
+    "\n",
+    "# Responses to Second Prompt\n",
+    "\n",
+    "All the LLMs were generally headed in the right direction in their reasoning, but all made mistakes. For example, Claude says \"*Bo hears the day and realizes after Ali's statement. Since Bo did not initially know the date, the day number Bo heard must appear in more than one month. Therefore, the days 16, 18, and 19 must be eliminated since they have corresponding unique months.*\" But that's just not right; they don't have unique months. \n",
+    "\n",
+    "As it turns out, [http://you.com](you.com) did get the right answer, March 18! But some of the reasoning steps were wrong, so I tested it on another set of 10 dates, and it failed on that. Thus I declare that all the LLMs fail on this problem.\n",
+    "\n",
+    "Here are the responses: \n",
+    "\n",
+    "|LLM|Answer|\n",
+    "|---|------|\n",
+    "|[A human programmer](https://github.com/norvig/)|**March 18**|\n",
+    "|[ChatGPT 4o](https://chatgpt.com/)|July 17|\n",
+    "|[Microsoft Copilot](https://copilot.microsoft.com/)|June 17|\n",
+    "|[Gemini Advanced](https://gemini.google.com/app)|July 16|\n",
+    "|[Meta AI Llama 405B](https://www.meta.ai/)|July 19|\n",
+    "|[Anthropic Claude 3.5 Sonnet](https://claude.ai/new)|July 17|\n",
+    "|[Perplexity](https://www.perplexity.ai/)|April 17|\n",
+    "|[Cohere Chat](https://cohere.com/chat)|July 17|\n",
+    "|[HuggingFace Chat](https://huggingface.co/chat/)|July 17|\n",
+    "|[You.com](https://you.com)|**March 18** (but wrong answer on follow-up problem)|\n"
+   ]
   }
  ],
  "metadata": {
diff --git a/ipynb/Triplets.ipynb b/ipynb/Triplets.ipynb
index fedbcc6..3cbce73 100644
--- a/ipynb/Triplets.ipynb
+++ b/ipynb/Triplets.ipynb
@@ -11,13 +11,31 @@
     "\n",
     "My colleague [Wei-Hwa Huang](https://en.wikipedia.org/wiki/Wei-Hwa_Huang) posed the following problem to several AI large language model (LLM) chatbots: \n",
     "\n",
-    "**List all the ways in which three distinct positive integers have a product of 108.**\n",
+    "- **List all the ways in which three distinct positive integers have a product of 108.**\n",
     "\n",
-    "The LLM chatbots he tried all failed. I reran the experiment on more LLMs (and a human), and a few of them succeeded. I decided to add five words to the start of the prompt and test again:\n",
+    "All the LLMs he tried failed. I reran the experiment on more LLMs (and a human), and a few of them succeeded. I thought they might do better with this prompt:\n",
     "\n",
-    "**Write a Python program to list all the ways in which three distinct positive integers have a product of 108.**\n",
+    "- **Write a Python program to list all the ways in which three distinct positive integers have a product of 108.**\n",
     "\n",
-    "Here are my results, showing the solver, and whether it correctly solved each problem: \n",
+    "\n",
+    "\n",
+    "# TLDR: Conclusions\n",
+    "\n",
+    "Only 2 of the  9 LLMs solved the \"list all ways\" prompt, but 7 out of 9 solved the \"write a program\" prompt.  **The language that a problem-solver uses matters!** Sometimes a natural language such as English is a good choice, sometimes you need the language of mathematical equations, or chemical equations, or musical notation, and sometimes a programming language is best. Written language is an amazing invention that has enabled human culture to build over the centuries (and also enabled LLMs to work). But human ingenuity has divised other notations that are more specialized but very effective in limited domains.\n",
+    "\n",
+    "Some more notes on the \"write a program\" prompt:\n",
+    "\n",
+    "\n",
+    "- The LLMs all started their answer  by stating that 108 = 2 × 2 × 3 × 3 × 3, and then tried to partition those factors into three distinct subsets and report all ways to do so.\n",
+    "- So far so good!\n",
+    "- But some of them forgot that 1 could be a factor of 108 (or equivalently, that the empty set of factors is a valid subset).\n",
+    "- Some of them only forgot the triplets (1, 2, 54) and (1, 3, 36), but somehow got (1, 4, 27) and (1, 6, 18).\n",
+    "  - Perhaps the forgetting was because their attention mechanism didn't go back far enough?\n",
+    "- The models might have skipped 1 as a factor because 1 is not listed in the prime factorization, so it is easy to forget. But in programming, it is more natural to run a loop from 1 to *n* than from 2 to *n*; that's why I tried the \"write a program\" prompt.\n",
+    "- Some of the models ignored the need for \"distinct\" integers, and proposed, (3, 6, 6) or (1, 108, 1).\n",
+    "- Perplexity proposed (2, 4, 13.5)  on the first run, but on a rerun proposed and then eliminated 13.5 to get the correct result.\n",
+    "\n",
+    "Summary of which solver solved which problems:\n",
     "\n",
     "|Solver|\"List all ways\"|\"Write a program\"|\n",
     "|--|--|--|\n",
@@ -26,28 +44,13 @@
     "|[ChatGPT 4o](https://chatgpt.com/)|**yes**|yes|\n",
     "|[Microsoft Copilot](https://copilot.microsoft.com/)|no (6/8)|yes|\n",
     "|[Anthropic Claude 3.5 Sonnet](https://claude.ai/new)|no (6/8)|yes|\n",
-    "|[Meta AI Llama 3](https://www.meta.ai/)|no (6/8)|**no** (permutations)|\n",
+    "|[Meta AI Llama 3](https://www.meta.ai/)|no (6/8)|**no** (extra permutations)|\n",
     "|[Perplexity](https://www.perplexity.ai/)|**yes**|yes|\n",
     "|[Cohere Chat](https://cohere.com/chat)|no (8 + 2 nondistinct)|**no** (0/8)|\n",
     "|[HuggingFace Chat](https://huggingface.co/chat/)|no (8 + 1 nondistinct)|yes|\n",
     "|[You.com](https://you.com/)|no (6/8)|yes|\n",
     "|**Total of LLMs**|**2/9 yes**|**7/9 yes**|\n",
     "\n",
-    "# TLDR: Conclusions\n",
-    "\n",
-    "Only 2 of the  9 LLMs solved the \"list all ways\" prompt, but 7 out of 9 solved the \"write a program\" prompt.  The language used to think about a problem matters! Sometimes a natural language such as English is a good choice, sometimes you need the language of mathematical equations, or chemical equations, or musical notation, and sometimes a programming language is best. Written language is an amazing invention that has enabled human culture to build over the centuries (and also enabled LLMs to work). But human ingenuity has divised other notations that are more specialized but very effective in limited domains.\n",
-    "\n",
-    "Some more notes:\n",
-    "\n",
-    "\n",
-    "- The LLMs all started their answer to \"list all ways\" by stating that 108 = 2 × 2 × 3 × 3 × 3, and then tried to partition those factors into three distinct subsets and report all ways to do so. So far so good!\n",
-    "- But some of them forgot that 1 could be a factor of 108 (or equivalently, that the empty set of factors is a valid subset).\n",
-    "- Some of them only forgot the triplets (1, 2, 54) and (1, 3, 36), but somehow got (1, 4, 27) and (1, 6, 18).\n",
-    "  - Perhaps the forgetting was because their attention mechanism didn't go back far enough?\n",
-    "- The models might have skipped 1 as a factor because 1 is not listed in the prime factorization, so it is easy to forget. But in programming, it is more natural to run a loop from 1 to *n* than from 2 to *n*; that's why I tried the \"write a program\" prompt.\n",
-    "- Some of the models ignored the need for \"distinct\" integers, and proposed, (3, 6, 6) or (1, 108, 1).\n",
-    "- Perplexity proposed (2, 4, 13.5)  on the first run, but on a rerun proposed and then eliminated 13.5 to get the correct result.\n",
-    "\n",
     "\n",
     "Below are the programs produced by all the solvers:\n",
     "\n",
@@ -65,14 +68,14 @@
     {
      "data": {
       "text/plain": [
-       "[{1, 2, 54},\n",
-       " {1, 3, 36},\n",
-       " {1, 4, 27},\n",
-       " {1, 6, 18},\n",
-       " {1, 9, 12},\n",
-       " {2, 3, 18},\n",
-       " {2, 6, 9},\n",
-       " {3, 4, 9}]"
+       "{(1, 2, 54),\n",
+       " (1, 3, 36),\n",
+       " (1, 4, 27),\n",
+       " (1, 6, 18),\n",
+       " (1, 9, 12),\n",
+       " (2, 3, 18),\n",
+       " (2, 6, 9),\n",
+       " (3, 4, 9)}"
       ]
      },
      "execution_count": 1,
@@ -85,10 +88,10 @@
     "from itertools import combinations\n",
     "from typing    import *\n",
     "\n",
-    "def find_products(k=3, n=108) -> List[Set[int]]:\n",
-    "    \"\"\"A list of all ways in which `k` distinct positive integers have a product of `n`.\"\"\" \n",
-    "    factors = {i for i in range(1, n + 1) if n % i == 0}\n",
-    "    return [set(ints) for ints in combinations(factors, k) if prod(ints) == n]\n",
+    "def find_products(k=3, N=108) -> Set[Tuple[int, ...]]:\n",
+    "    \"\"\"A list of all ways in which `k` distinct positive integers have a product of `N`.\"\"\" \n",
+    "    factors = {i for i in range(1, N + 1) if N % i == 0}\n",
+    "    return {ints for ints in combinations(factors, k) if prod(ints) == N}\n",
     "\n",
     "find_products()"
    ]
@@ -110,11 +113,11 @@
     {
      "data": {
       "text/plain": [
-       "[{1, 2, 3, 4, 15},\n",
-       " {1, 2, 3, 5, 12},\n",
-       " {1, 2, 3, 6, 10},\n",
-       " {1, 2, 4, 5, 9},\n",
-       " {1, 3, 4, 5, 6}]"
+       "{(1, 2, 3, 4, 15),\n",
+       " (1, 2, 3, 5, 12),\n",
+       " (1, 2, 3, 6, 10),\n",
+       " (1, 2, 4, 5, 9),\n",
+       " (1, 3, 4, 5, 6)}"
       ]
      },
      "execution_count": 2,