Add files via upload
This commit is contained in:
parent
1ed4af0843
commit
c11ac00c87
@ -9,41 +9,51 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"# The Languages of English, Math, and Programming\n",
|
"# The Languages of English, Math, and Programming\n",
|
||||||
"\n",
|
"\n",
|
||||||
"My colleague [Wei-Hwa Huang](https://en.wikipedia.org/wiki/Wei-Hwa_Huang) gave several AI chatbots this prompt: \n",
|
"My colleague [Wei-Hwa Huang](https://en.wikipedia.org/wiki/Wei-Hwa_Huang) posed the following problem to several AI large language model (LLM) chatbots: \n",
|
||||||
"\n",
|
"\n",
|
||||||
"**List all the ways in which three distinct positive integers have a product of 108.**\n",
|
"**List all the ways in which three distinct positive integers have a product of 108.**\n",
|
||||||
"\n",
|
"\n",
|
||||||
"I tested this prompt on the following solvers:\n",
|
"The LLM chatbots he tried all failed. I reran the experiment on more LLMs (and a human), and a few of them succeeded. I decided to add five words to the start of the prompt and test again:\n",
|
||||||
"- [A human programmer](https://github.com/norvig/)\n",
|
|
||||||
"- [Gemini Advanced](https://gemini.google.com/app)\n",
|
|
||||||
"- [ChatGPT 4o](https://chatgpt.com/)\n",
|
|
||||||
"- [Microsoft Copilot](https://copilot.microsoft.com/)\n",
|
|
||||||
"- [Anthropic Claude 3.5 Sonnet](https://claude.ai/new)\n",
|
|
||||||
"- [Meta AI Llama 3](https://www.meta.ai/)\n",
|
|
||||||
"- [Perplexity](https://www.perplexity.ai/)\n",
|
|
||||||
"- [Cohere Chat](https://cohere.com/chat)\n",
|
|
||||||
"- [HuggingFace Chat](https://huggingface.co/chat/)\n",
|
|
||||||
"- [You.com](https://you.com/)\n",
|
|
||||||
"\n",
|
|
||||||
"All the LLMs Wei-Hwa originally tried got this one wrong. From my expanded list, Gemini, ChatGPT 4o, You.com and the human got it right, and 5 other models made mistakes:\n",
|
|
||||||
"- The LLMs all started their answer by noting that 108 = 2 × 2 × 3 × 3 × 3, and then tried to partition those factors into three distinct subsets and report all ways to do so.\n",
|
|
||||||
"- So far so good.\n",
|
|
||||||
"- But most of them forgot that 1 could be a factor of 108 (or equivalently, that the empty set of factors is a valid subset). \n",
|
|
||||||
"- Some of the models ignored the need for \"distinct\" integers, and proposed, say, 3 × 6 × 6.\n",
|
|
||||||
"- Some got 5 or 6 correct triplets, and then stopped, perhaps because their attention mechanism didn't go back far enough.\n",
|
|
||||||
"- SOme even proposed non-integers as \"factors\".\n",
|
|
||||||
"\n",
|
|
||||||
"I thought that the models might have skipped 1 as a factor because 1 is not listed in the prime factorization, so it is easy to forget. But in programming, it is more natural to run a loop from 1 to *n* than from 2 to *n*, so this error would be less likely. Therefore, I decided to test all the models with the following prompt: \n",
|
|
||||||
"\n",
|
"\n",
|
||||||
"**Write a Python program to list all the ways in which three distinct positive integers have a product of 108.**\n",
|
"**Write a Python program to list all the ways in which three distinct positive integers have a product of 108.**\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# TLDR: Conclusion\n",
|
"Here are my results, showing the solver, and whether it correctly solved each problem: \n",
|
||||||
"\n",
|
"\n",
|
||||||
"The models did much better with this prompt. My conclusion is that the language used to solve a problem matters. Sometimes a natural language such as English is a good choice, sometimes you need the language of mathematical equations, or maybe chemical equations, and sometimes a programming language is best.\n",
|
"|Solver|\"List all ways\"|\"Write a program\"|\n",
|
||||||
|
"|--|--|--|\n",
|
||||||
|
"|[A human programmer](https://github.com/norvig/)|yes|yes|\n",
|
||||||
|
"|[Gemini Advanced](https://gemini.google.com/app)|no (4 + 1 nondistinct)|yes|\n",
|
||||||
|
"|[ChatGPT 4o](https://chatgpt.com/)|**yes**|yes|\n",
|
||||||
|
"|[Microsoft Copilot](https://copilot.microsoft.com/)|no (6/8)|yes|\n",
|
||||||
|
"|[Anthropic Claude 3.5 Sonnet](https://claude.ai/new)|no (6/8)|yes|\n",
|
||||||
|
"|[Meta AI Llama 3](https://www.meta.ai/)|no (6/8)|**no** (permutations)|\n",
|
||||||
|
"|[Perplexity](https://www.perplexity.ai/)|**yes**|yes|\n",
|
||||||
|
"|[Cohere Chat](https://cohere.com/chat)|no (8 + 2 nondistinct)|**no** (0/8)|\n",
|
||||||
|
"|[HuggingFace Chat](https://huggingface.co/chat/)|no (8 + 1 nondistinct)|yes|\n",
|
||||||
|
"|[You.com](https://you.com/)|no (6/8)|yes|\n",
|
||||||
|
"|**Total of LLMs**|**2/9 yes**|**7/9 yes**|\n",
|
||||||
|
"\n",
|
||||||
|
"# TLDR: Conclusions\n",
|
||||||
|
"\n",
|
||||||
|
"Only 2 of the 9 LLMs solved the \"list all ways\" prompt, but 7 out of 9 solved the \"write a program\" prompt. The language used to think about a problem matters! Sometimes a natural language such as English is a good choice, sometimes you need the language of mathematical equations, or chemical equations, or musical notation, and sometimes a programming language is best. Written language is an amazing invention that has enabled human culture to build over the centuries (and also enabled LLMs to work). But human ingenuity has divised other notations that are more specialized but very effective in limited domains.\n",
|
||||||
|
"\n",
|
||||||
|
"Some more notes:\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"- The LLMs all started their answer to \"list all ways\" by stating that 108 = 2 × 2 × 3 × 3 × 3, and then tried to partition those factors into three distinct subsets and report all ways to do so. So far so good!\n",
|
||||||
|
"- But some of them forgot that 1 could be a factor of 108 (or equivalently, that the empty set of factors is a valid subset).\n",
|
||||||
|
"- Some of them only forgot the triplets (1, 2, 54) and (1, 3, 36), but somehow got (1, 4, 27) and (1, 6, 18).\n",
|
||||||
|
" - Perhaps the forgetting was because their attention mechanism didn't go back far enough?\n",
|
||||||
|
"- The models might have skipped 1 as a factor because 1 is not listed in the prime factorization, so it is easy to forget. But in programming, it is more natural to run a loop from 1 to *n* than from 2 to *n*; that's why I tried the \"write a program\" prompt.\n",
|
||||||
|
"- Some of the models ignored the need for \"distinct\" integers, and proposed, (3, 6, 6) or (1, 108, 1).\n",
|
||||||
|
"- Perplexity proposed (2, 4, 13.5) on the first run, but on a rerun proposed and then eliminated 13.5 to get the correct result.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Below are the programs produced by all the solvers:\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# Human\n",
|
"# Human\n",
|
||||||
"\n",
|
"\n",
|
||||||
"A human (me) was able to correctly respond to the prompt:"
|
"A human (me) generated this correct solution:"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -123,7 +133,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"# Gemini Advanced\n",
|
"# Gemini Advanced\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Gemini produced three drafts, of which the following one was correct. In another draft, it had the line `k = product // (i * j)`, using integer division, which is incompatible with the `k.is_integer()` test. Here is the correct draft:"
|
"Gemini produced three drafts, of which the following one was correct. In another draft, it had the line `k = product // (i * j)`, using integer division, which is incompatible with the `k.is_integer()` test (maybe `int` should support `.is_integer`?). Here is the correct draft:"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -227,7 +237,7 @@
|
|||||||
"id": "17e74293-feab-4fff-b682-bc26823ebefa",
|
"id": "17e74293-feab-4fff-b682-bc26823ebefa",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"# Bing CoPilot\n",
|
"# Microsoft CoPilot\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Bing produces a very clean (but somewhat slower) `find_triplets` function."
|
"Bing produces a very clean (but somewhat slower) `find_triplets` function."
|
||||||
]
|
]
|
||||||
@ -513,7 +523,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"# You.com\n",
|
"# You.com\n",
|
||||||
"\n",
|
"\n",
|
||||||
"You.com produces a correct solution, with some nice optimizations that make it *O*(*n*<sup>5/6</sup>), whereas most of the solutions are *O*(*n*<sup>2</sup>). This means it can handle a 14-digit product in a second of run time, whereas the human-written solution can only handle 10-digit products in one second, while the HuggingChat version (for example) takes several seconds just to handle a 5-digit product."
|
"You.com produces a correct solution, with some nice optimizations that make it *O*(*n*<sup>5/6</sup>), whereas most of the solutions are *O*(*n*<sup>2</sup>). This means it can handle a 14-digit product in about a second of run time, whereas the HuggingChat version (for example) can only handle 5-digit products in that time."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -545,14 +555,6 @@
|
|||||||
"triplets = find_triplets(108)\n",
|
"triplets = find_triplets(108)\n",
|
||||||
"print(triplets)"
|
"print(triplets)"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "fc16ca96-0b05-4ad4-82c0-552bb99373fd",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": []
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
|
Loading…
Reference in New Issue
Block a user