Are AI Benchmarks Measuring Anything?

In the autumn of 1904, a horse named Hans stood in a Berlin courtyard and multiplied twelve by three.

He tapped his hoof thirty-six times on the cobblestones. He stopped. The crowd erupted. Journalists filed stories under headlines that sound, to modern ears, like satire. "The First Thinking Animal," one paper called him.

But here's the detail that matters: Hans wasn't performing for rubes at a county fair. The German Board of Education convened a formal scientific panel to investigate. Thirteen men—zoologists, psychologists, veterinarians, a cavalry officer, a circus manager—watched the horse work. They concluded his abilities were genuine. No tricks. No signals. The horse could think.

It took another three years and a psychologist named Oskar Pfungst to figure out what was actually happening. And what Pfungst discovered was not that Hans was stupid. Hans was brilliant. Just not at arithmetic.

He was brilliant at something far more interesting—and it's the same something that the most powerful AI systems on earth are brilliant at right now.

The Trillion-Dollar Score Nobody Checked

One hundred and twenty years after Clever Hans tapped his hooves in Berlin, a piece of software called o3 solved 87.5 percent of the problems on a benchmark called ARC—the Abstraction and Reasoning Corpus—a test its creator, François Chollet, designed to measure "a human-like form of general fluid intelligence."

87.5 percent. The AI press treated this the way the German newspapers treated Hans. A triumph. A threshold crossed. General intelligence on the horizon.

The cost of that triumph: roughly five thousand dollars per puzzle. The puzzles themselves are deceptively simple—colored grids where you spot a pattern in three examples and apply it to a fourth. A bright ten-year-old can do most of them in seconds. For free.

An ARC-style puzzle. The rule is a left-right flip — a child names it on sight. The test isn't whether a system gets the fourth grid right, but whether it found that rule or some irrelevant pattern that happens to fit.

But that's not the interesting part. The interesting part is what happens when you stop marveling at the score and start asking how the score was achieved. Because when you do that—when you apply the kind of rigor that Oskar Pfungst brought to a courtyard in Berlin—the story changes completely.

It changes because o3 wasn't solving the puzzles the way a human solves them. And the difference isn't a technicality. It's the difference between a pilot and a passenger.

What the Right Answer Hides

Here's how benchmark testing works in the AI industry. You take a model. You run it on a standardized test. You report the accuracy. You move on to the next press release.

You do not, typically, ask the model to explain its reasoning. You don't run the same problem twice to see if the answer changes. You don't tweak the problem in small ways that shouldn't matter—ways that wouldn't fool a child—to see if the model collapses. You don't test whether the skill the model demonstrated on Tuesday transfers to a slightly different context on Wednesday.

Psychologists have a term for this. They call it "construct validity"—whether a test actually measures the thing it claims to measure. An IQ test has construct validity to the extent that scoring well on it predicts the broader ability we call intelligence. A math benchmark has construct validity to the extent that acing it predicts the ability to do math.

Most AI benchmarks have terrible construct validity. They measure something. But the something they measure isn't the something on the label.

There are many reasons for this. Training data contamination—the answers were in the homework. Approximate retrieval—a term coined by the computer scientist Subbarao Kambhampati, meaning the model can interpolate from similar problems it has seen without possessing the general capability the test was designed to assess. Spurious shortcuts—the model exploits unintended patterns in the data to get the right answer for the wrong reason, the way a student notices that on a multiple-choice exam, the longest option is usually correct.

But none of these failure modes are as revealing as what happens when you simply ask the model: "Tell me what rule you used."

The Rule That Wasn’t There

On a simplified version of ARC called ConceptARC, researchers asked o3, Claude, and Gemini not only to produce the correct output grid but to describe, in plain language, the rule governing the transformation.

The results were striking. When humans got the right answer, they stated the intended rule about ninety percent of the time. The horizontal bars become vertical. The inside color swaps with the outside color. The kind of rule a child would articulate.

When the AI models got the right answer, they stated the intended rule about seventy percent of the time. The other thirty percent, they described rules that were either wrong or "correct but unintended"—rules that happened to produce the right output on this particular puzzle but captured none of the underlying concept. Rules about the numerical properties of the integers used to encode colors, for instance—a feature designed to be irrelevant, and one that's literally invisible to the humans taking the test on a visual display.

Why does that matter? Three out of ten times the machine scored full marks, it was doing something unrecognizable as reasoning. It was pattern-matching on artifacts of the test format rather than grasping the idea the test was built to assess.

Clever Hans, tapping his hooves in Berlin, couldn't add twelve and three. He could read micro-expressions on human faces with a sensitivity that surpassed the humans themselves. He got the right answer. He got it by doing something completely different from what his audience assumed.

The question is whether anyone building their corporate strategy on these benchmarks has bothered to check which kind of "right" they're getting. And that question leads somewhere darker—because the AI industry has constructed an elaborate set of reasons not to look.

The Six Questions Nobody Is Asking

Cognitive science has spent more than a century developing the experimental toolkit for exactly this situation—evaluating alien intelligences whose mechanisms you don't understand. Babies. Chimpanzees. Corvids. Organisms that can't tell you how they think, and whose cognitive architecture looks nothing like yours. The methodological rigor is hard-won. It was purchased with decades of embarrassment.

Developmental psychologist Michael Frank put it simply: "Imagine first contact with an alien intelligence. A scientist might ask, do the aliens have the same concepts as humans? Do they understand other minds? Can they reason about cause and effect?" Frank pointed out that developmental psychologists have been running this exact experiment for decades—on babies. And the methods work.

From this toolkit, six principles emerge. None of them are routinely applied to AI evaluation. Together they constitute something like a diagnostic for intellectual honesty.

First: Know your biases. Humans project intention onto everything. We see a smiling robot and feel warmth. We read a fluent paragraph and assume understanding. Psychologists call this the Eliza effect. It's the Clever Hans problem turned inside out—not the machine fooling us, but us fooling ourselves.

Second: Design adversarial controls. If your model aces a test, your next move should be to figure out how it could have aced the test without actually possessing the skill you think it proved. Pfungst did this with Hans. He blocked the horse’s view of the questioner’s face. Performance collapsed. In AI evaluation, this kind of adversarial testing is the exception. It should be the default.

Third: Test robustness through variation. A system that can solve letter-string analogies with the standard alphabet but falls apart when you swap three letters isn't doing analogy. It's doing lookup. Researchers tested GPT models this way—creating "counterfactual alphabets" and other minor variations. Humans handled them fine. The models cratered.

Fourth: Investigate mechanisms. The stated rules from ConceptARC are a crude version of this—asking not just "Did you get it right?" but "How did you get it right?" The AI industry has remarkably little curiosity about this question. It publishes accuracy numbers the way a gambler reports only wins.

Fifth: Distinguish performance from competence. A baby who doesn't search for a hidden toy may not lack the concept of object permanence. She may lack the motor coordination to lift the blanket. A model that fails a visual reasoning task may understand the rule perfectly but can't parse the grid dimensions from an image. Accuracy alone can't tell you which problem you are observing.

Sixth: Embrace failure analysis. The twelve percent a model gets wrong is more informative than the eighty-eight percent it gets right. A 2024 study with the evocative title "Embers of Autoregression" showed that GPT-4’s error patterns traced directly to biases inherited from its training process—like vestigial organs from an earlier evolutionary stage. The errors weren't random. They were signatures.

Six principles. None of them exotic. All of them standard practice in the sciences that study minds professionally. And yet the industry that claims to be building minds has adopted almost none of them. The reason has less to do with methodology than with incentive.

The Bouncing Ball Problem

In 2007, a team of psychologists published a landmark study in Nature claiming that six-month-old infants could distinguish helpful from harmful behavior—that they preferred puppets who pushed a wooden circle up a hill over puppets who pushed it down.

Five years later, another team noticed something the original researchers had missed. In the "helper" episodes, the circle bounced at the top of the hill when it arrived. In the "hinderer" episodes, there was no bounce. Was the baby choosing the kind puppet, or the fun one?

They ran the experiment again with the bounce controlled for. When both episodes featured a bounce, infants chose helper and hinderer equally. The preference vanished.

The original authors fired back with new experiments. A third group ran a massive replication across thirty-seven labs with hundreds of infants. No significant preference for helpers over hinderers.

This is what science looks like when it's working. Hypothesis. Challenge. Revision. Counter-challenge. Large-scale replication. It took seventeen years. It required multiple teams willing to publish results that made the original finding less exciting, not more. In psychology, they call these "killjoy explanations." They're the discipline’s immune system.

The AI research community has no equivalent immune system. Its publication culture, centered on conferences rather than journals, rewards novelty and penalizes replication. Calling a study "incremental" is a death sentence for a submission. Calling a result "negative" means the reviewer stops reading. The incentives select for announcements over investigation, for benchmarks over understanding.

Which brings us back to the courtyard in Berlin, and to the question that connects Clever Hans to o3 to the babies on the hillside. It's the question the entire AI industry is structured to avoid.

What Kind of Smart

In 2022, the neuroscientist Terrence Sejnowski posed the question plainly: "Some aspects of their behavior appear to be intelligent, but if it’s not human intelligence, what is the nature of their intelligence?"

Oskar Pfungst answered this question about Clever Hans in 1907. The horse couldn't do arithmetic. But he could read faces with a precision that embarrassed the scientists who studied him. Even when Pfungst tried to suppress his own micro-expressions, the horse caught them. Hans was genuinely brilliant. He was brilliant at something no one was testing for.

The modern question isn't whether AI models are smart. They clearly are. The question is what kind of smart. And answering that question requires exactly the thing the industry has the least incentive to do: slow down, design adversarial experiments, test for robustness, examine mechanisms, analyze failures, and publish the results even when—especially when—they're unflattering.

The developmental psychologists studying infant cognition understand this. As one researcher wrote, "The study of infant cognition requires a high level of creativity in the creation and testing of alternative explanations." Swap "infant" for "machine" and the sentence loses nothing.

We don't need harder benchmarks. We need better questions. The kind of questions that Pfungst asked in 1907, that Scarf asked in 2012, that the thirty-seven labs asked in 2024. The kind of questions that treat a high score not as a conclusion but as the beginning of an investigation.

Because the alternative—the alternative is to keep applauding the horse.