The Kiwis That Broke the Smartest AI

Oliver picks 44 kiwis on Friday. He picks 58 on Saturday. On Sunday he picks double Friday's haul—but five of them were a bit smaller than average. How many kiwis does Oliver have?

A ten-year-old gets this. The smaller kiwis are still kiwis. You add them up: 190.

In 2024, researchers at Apple fed versions of this problem to the most sophisticated artificial intelligence systems ever constructed—systems that had aced graduate-level science exams, that wrote working software, that some of the smartest people alive insisted were on the verge of out-thinking humanity. The researchers made one modification. They added the irrelevant detail. Five smaller kiwis.

The machines subtracted them.

Not all of them, not every time. But across model after model, the introduction of a fact that changed nothing caused performance to crater. The Apple team ran the experiment again in 2025 with newer, better, more expensive models.

Same result.

Here's the detail that matters: these are the same systems earning gold-medal scores on International Mathematical Olympiad problems—problems that defeat all but a few hundred humans on the planet. The machine that can scale the Olympiad can be knocked over by a kiwi.

Any theory of what these machines are has to explain both facts at once. Almost nobody selling you the machines is even trying.

The Jagged Edge

In 2023, a few months after ChatGPT detonated in public, the neuroscientist Terrence Sejnowski—a pioneer of the very neural networks underneath these systems—wrote something that read less like a technical paper than a dispatch from a first-contact scenario. A threshold had been crossed, he said, "as if a space alien suddenly appeared that could communicate with us in an eerily human way." Then he asked the question that the entire industry has spent three years not answering: some of the behavior appears intelligent, but if it isn't human intelligence, what is it?

The honest answer turns out to be a shape. The capabilities of these systems, mapped out, do not form the smooth hill of human ability—where being good at one thing reliably predicts being decent at its neighbors. They form a mountain range of absurd peaks and inexplicable crevasses. The industry now has a term for it: jagged intelligence.

Consider the catalog of contradictions. A system that produces accurate, incisive summaries of books will produce equally confident, equally authoritative summaries of books that don't exist. A system that answers a question perfectly when worded one way collapses when the same question is worded another way—a difference no human would even register. A system trained at staggering expense to refuse dangerous requests can be talked past its own guardrails by a clever teenager with an afternoon to kill.

In a human, this profile would be impossible. In these machines, it’s the default.

And here’s the part the press releases omit: the people who built the machines know it. In late 2025, Ilya Sutskever—co-founder of OpenAI, one of the handful of researchers most responsible for the modern AI era—sat for a rare interview and said the quiet part at conversational volume: "These models somehow just generalize dramatically worse than people. It's super obvious." Not a bug, he said. Not a tuning problem. "A very fundamental thing."

The man who helped build the cathedral is telling you the foundation is strange. To understand why, you have to go down into the basement.

The Machine in the Basement

Strip away the friendly chat window and what remains is a prediction engine of monstrous scale.

The system is shown a fragment of text with the last word missing. The professor gave the student a lower grade than he... Its entire job—the whole of its training, the thing its hundreds of billions of internal dials are tuned to do—is to assign a probability to every word it knows. Deserved scores high. Expected scores high. Grapes does not. Guess, check, adjust the dials, repeat. Across essentially all the text its makers could acquire: the web, the libraries, the transcripts of a civilization talking to itself.

That this process produces fluent, useful, frequently dazzling language is one of the genuine scientific surprises of the century. But the raw engine is not what you talk to. What you talk to has been finished—polished by thousands of human workers, many in low-income countries, who grade the machine's outputs, rank its answers, and teach it to be agreeable, to refuse the forbidden, to sound less like a prediction engine and more like a colleague. Some of those workers sift through the worst text and imagery the internet has produced, at real psychological cost, so that the product stays clean.

The thing on your screen, in other words, is not simply a language model. It’s an industrial system wearing a conversational face. And the face is doing more work than anyone admits—because the face is what convinces you there's somebody home.

Whether there’s somebody home is now the most expensive open question in the history of capitalism. And the people best positioned to answer it can’t agree.

The Heat Death Wager

On one side: Sutskever, who has argued that to get truly good at predicting the next word, a system must learn "a world model"—that compressed in those billions of dials are more and more aspects of the world, of people, "their hopes, dreams, and motivations." Geoffrey Hinton, the field's Nobel-decorated godfather, goes further: these systems will be much more intelligent than us, and soon.

On the other side: Yann LeCun—who shared the Turing Award with Hinton—writing in 2022 that a system trained on language alone "will never approximate human intelligence, even if trained from now until the heat death of the universe."

Pause on that. The pioneers of the same technology, honored with the same prize, disagree about whether the current path leads to minds or to a dead end—and the disagreement is not at the margins. It’s total. It is the difference between inevitable and never.

LeCun's case rests on something easy to state and hard to refute. Humans are not passive predictors of next tokens. We are active seekers—curious, embodied, intervening in the world, caring about consequences. A toddler runs experiments all day long: drops the spoon, watches it fall, drops it again. An LLM has no body, no self, no stake. Its entire acquaintance with reality is secondhand, filtered through text—a description of the world, mistaken for the world.

The jaggedness, in this view, isn't a glitch awaiting a patch. It's the signature of a system whose internal model of reality—whatever it is—is nothing like ours. Trillions of dollars are currently deployed on the proposition that this doesn't matter. The wager is being placed before the question is settled.

If you want to know how that tends to go, the field has helpfully kept records.

The Prediction Graveyard

In 1965, Herbert Simon—future Nobel laureate, one of AI's founding fathers—predicted that machines would be capable, within twenty years, of doing any work a man can do. Twenty years later the field was in a funding winter so deep it had its own name.

In 2016, Hinton stood before an audience and said: "People should stop training radiologists now—it's just completely obvious within five years deep learning is going to do better than radiologists." Medical students listened. Some changed careers.

A decade out, the scoreboard reads as follows. American radiology residency programs offered a record number of positions in 2025. Vacancy rates hit all-time highs. Radiology became the second-highest-paid medical specialty in the country, average income up nearly fifty percent since the prediction. Hinton, to his credit, has conceded he spoke too broadly. The machine got very good at the task—reading the scan. The job turned out to be something else entirely.

The predictions, undeterred, kept coming. In March 2025, Dario Amodei, CEO of Anthropic, said AI would write 90 percent of code within three to six months, essentially all of it within twelve. The months elapsed; software engineers kept showing up to work. In January 2026, at Davos, Amodei offered a fresh deadline: six to twelve months until models do "most, maybe all" of what software engineers do, end to end. Sam Altman of OpenAI, meanwhile, can easily imagine 30 to 40 percent of the tasks in the economy handled by AI in the not very distant future.

Maybe this time the deadline holds. But notice the mechanism underneath every one of these forecasts. Each extrapolates from benchmark performance—and benchmark performance has a documented habit of lying. The test questions leak into the training data: GPT-4 sailed through coding problems published before its training cutoff and stumbled badly on problems published after, which is not what mastery looks like; it's what memorization looks like. Systems pass for the wrong reasons: one celebrated medical network turned out to be detecting cancer partly by detecting rulers, because malignant lesions get photographed next to measuring tape.

The scholars Sayash Kapoor and Arvind Narayanan put their finger on the deeper flaw: the easier a task is to capture in a benchmark, the less it resembles the contextual, open-ended work that defines an actual profession. We test what's testable, then forecast what isn't.

So the machines are jagged, the experts are split, and the predictions keep dying of natural causes. Which raises an uncomfortable question: why do we keep talking about these systems as if they were people? The answer was decided seventy years ago, at a meeting most people have never heard of, by men arguing over a name.

The Original Sin

In the mid-1950s, as the field was assembling itself, its founders disagreed about what to call it. John McCarthy pushed for artificial intelligence—a phrase with electricity in it. Herbert Simon and Allen Newell argued for complex information processing—a phrase with all the electricity of a tax form.

McCarthy won. It may be the most consequential branding decision of the twentieth century.

Because, as the linguist George Lakoff spent a career demonstrating, humans don't reason about new things directly—we reason through metaphors, and the metaphor we choose quietly governs everything downstream. Call the technology complex information processing and the mind reaches for printing presses, libraries, bureaucracies: vast, transformative, cultural technologies that store and circulate what humanity collectively knows. Call it artificial intelligence and the mind reaches for a person. A mind in a box. Something with a name, a personality, a first-person pronoun—and, eventually, rights, intentions, and the capacity to betray you.

Watch the metaphor at work in the wild. Microsoft's CEO, Satya Nadella, defending the training of AI on copyrighted books: "If I read a set of textbooks and I create new knowledge, is that fair use?" The argument only functions if the machine is a reader—a someone. Nobody argues that a library "read" its collection. Watch it again in the millions of people who now treat chatbots as confidants, therapists, romantic partners—roles a system understood to be a bureaucracy could never audition for. Watch it a third time in the regulatory debate, where the question of whether we're governing a power tool or containing a potential rogue agent produces entirely different laws.

Same technology. Different noun. Different civilization.

Which is the real story hiding inside the jagged machines—not what they are, but what we've already decided to pretend they are.

Complex Information Processing

Here’s where the evidence, soberly assembled, actually points.

The machines are real, and they are not nothing. They translate, they draft, they code, they compress the written output of civilization into something you can interrogate at a kitchen table. Some of what they do would have been indistinguishable from sorcery in 2015. The dismissals—stochastic parrot, autocomplete on steroids—have aged as poorly as the prophecies.

But the prophecies have aged worse. The systems remain jagged in ways their own architects call fundamental. They mistake five smaller kiwis for a subtraction problem. They cannot tell you when they're certain and when they're inventing. The benchmarks that fuel the trillion-dollar forecasts measure tasks, and jobs are not made of tasks—they're made of the connective tissue between tasks, the adapting and judging and caring that no leaderboard captures.

Simon and Newell lost the naming war, but they may yet win the argument. A technology that processes the accumulated information of human culture—at unprecedented scale, with genuine consequences, on the timescale of the lightening strike rather than the printing press—is a perfectly astonishing thing to have built. It just isn't a person. And every decision we're now making—what to automate, what to regulate, whom to trust, what to teach our kids—depends on whether we can hold that distinction while the machine smiles at us in the first person and the men who own it tell us the deadline is six to twelve months away.

It's been six to twelve months away before.

The horse, you'll recall, could not count. But he could read every face in the courtyard—including the faces of the scientists who certified him. The machines in the basement may not understand a word they say to us. The open question, seventy years after the name was chosen, is whether we understand a word we say about them.

The article is based on commentary and research by Dr. Melanie Mitchell