Putnam, Proof Assistants, and the Moving Boundary of Intelligence
Our idea of intelligence has never been static.
There was a time when being able to compute large numbers quickly was considered a sign of exceptional intelligence. Then calculators — and later computers — made that skill trivial. Chess was once held up as a pinnacle of human strategic thinking, until machines surpassed world champions. Go followed. More recently, AI systems have achieved gold-medal–level performance on International Mathematical Olympiad problems, particularly in geometry.
Each time, the story is the same: we declare a task as a marker of intelligence, automate it, and then quietly move the goalposts.
Seen in this light, the question is not whether AI will eventually solve Putnam problems, but what solving them will actually mean.
This is why a recent paper introducing PUTNAMBENCH caught my attention:
https://arxiv.org/abs/2407.11214
What is PUTNAMBENCH?
PUTNAMBENCH is a benchmark built from 640 problems from the William Lowell Putnam Mathematical Competition, formally stated in proof assistants. In total, it contains 1,692 hand-written formalizations across Lean 4, Isabelle, and Coq.
You can explore the dataset directly here:
https://github.com/trishullab/PutnamBench
That detail matters.
Unlike informal math benchmarks where a model only needs to output a correct final answer, PUTNAMBENCH requires fully formal, machine-verifiable proofs. Every step must type-check. Every inference must be explicit. There is no room for hand-waving or “this is obvious.”
The benchmark is also carefully designed to avoid data leakage: these formal proofs did not exist before, and none of the solutions are included. Models must genuinely construct proofs, not recall them.
In short, PUTNAMBENCH is not testing whether AI can recognize mathematical patterns — it is testing whether AI can reason formally about undergraduate-level mathematics.
The Results: A Reality Check
The authors evaluate a range of state-of-the-art systems, including large language models, neural-symbolic provers, and powerful symbolic automation tools.
The outcome is stark.
| System / Method | Proof Assistant | Problems Solved |
|---|---|---|
| GPT-4 | Lean 4 | 1 / 640 |
| COPRA | Lean 4 | 1 / 640 |
| ReProver (with or without retrieval) | Lean 4 | 0 / 640 |
| Draft-Sketch-Prove | Isabelle | 4 / 640 |
| Sledgehammer | Isabelle | 3 / 640 |
| GPT-4 | Coq | 1 / 412 |
| COPRA | Coq | 1 / 412 |
| CoqHammer / Tactician | Coq | 0 / 412 |
Across all languages and all methods combined, only six problems in PUTNAMBENCH are solved — well under 1% success.
Even more telling: the problems that are solved are among the easiest ever asked on the Putnam exam, often admitting very short natural-language proofs. Harder problems, involving deeper constructions or multi-step insight, remain completely out of reach.
This is not a failure of arithmetic, algebraic manipulation, or syntax.
It is a failure of strategy, insight, and lemma discovery.
Why Putnam Problems Are Different
Putnam problems are not long or computationally heavy. Their difficulty usually comes from one or two key ideas:
- choosing the right construction,
- introducing the right auxiliary object,
- spotting a hidden symmetry,
- or knowing which lemma needs to exist before you can prove anything at all.
The paper identifies two core weaknesses in current systems:
- Failure to synthesize new lemmas
- Failure to orchestrate long-horizon proof strategies using existing libraries
Formal systems are incredible amplifiers once the structure is in place — but deciding what structure to build remains hard.
Will AI Eventually Solve Putnam Problems?
Almost certainly — yes.
Nothing in PUTNAMBENCH suggests a fundamental barrier. These failures are not impossibility results; they are engineering limitations:
- search efficiency,
- representation,
- lemma discovery,
- and better integration with mathematical libraries.
History strongly suggests that these obstacles will be overcome.
But that is not the most interesting takeaway.
Intelligence as a Moving Interface
PUTNAMBENCH does not define intelligence.
It marks a current boundary between what humans do internally and what we have managed to externalize into tools.
When AI clears this benchmark — and it likely will — Putnam problems will not stop mattering. They will become:
- training data,
- scaffolding,
- infrastructure,
- and a shared substrate for human–AI collaboration.
This is exactly what happened in chess. Computers did not kill the game; they transformed it. Human players became stronger, theory deepened, and the nature of expertise changed.
The same will happen in mathematics.
AI as Amplifier, Not Driver
Even when AI can produce Putnam-level proofs, it will not:
- decide which problems are worth studying,
- invent new mathematical frameworks,
- choose research directions,
- or care about elegance, meaning, or explanation.
Those choices remain human.
AI is an amplifier — a force multiplier — not a driver.
PUTNAMBENCH matters because it shows us where we are right now, not because it draws a permanent line.
Aside: Why Isabelle Does Slightly Better Here
Isabelle’s marginally better performance in PUTNAMBENCH is not due to deeper mathematical reasoning, but to its design philosophy. Isabelle emphasizes declarative proofs and strong symbolic automation, most notably through Sledgehammer, which can call external automated theorem provers to discharge certain goals. In PUTNAMBENCH, this advantage appears only in a narrow class of problems — typically short algebraic statements about sets with binary operations.
The authors stress that this advantage does not scale: even with Sledgehammer, overall success rates remain below 1%, and the vast majority of problems remain unsolved across all systems.