Prior to now few days, Apple’s provocatively titled paper, The Phantasm of Pondering, has sparked recent debate in AI circles. The declare is stark: right this moment’s language fashions don’t actually “purpose”. As an alternative, they simulate the looks of reasoning till complexity reveals the cracks of their logic. Not surprisingly, the paper has triggered a rebuttal – entitled, The Phantasm of the Phantasm of Pondering, credited to “C. Opus”, a nod to Anthropic’s Claude Opus mannequin, and
Alex Lawsen, who initially printed the commentary on the arXiv distribution service as a joke, apparently. The joke obtained out of hand and the response has been extensively circulated. Joke or not – does the LLM truly debunk Apple’s thesis? Not fairly.
What Apple reveals

The Apple workforce got down to probe whether or not AI fashions can really purpose – or whether or not they’re simply mimicking problem-solving based mostly on memorized examples. To do that, the workforce designed duties the place complexity could possibly be scaled in managed increments: extra disks within the Tower of Hanoi, extra checkers in Leaping Checkers, extra characters in River Crossing, extra blocks in Blocks World.
The idea is easy: if a mannequin has mastered reasoning in easier instances, it ought to have the ability to prolong those self same rules to extra advanced ones – particularly when ample compute and context size stay out there. However that’s not what occurs. The Apple paper finds that even when working effectively inside their token budgets and inference capabilities, fashions don’t rise to the problem.
As an alternative, they generate shorter, much less structured outputs as complexity will increase. This means a type of “giving up,” not a battle in opposition to onerous constraints. Much more telling, the paper finds that fashions typically scale back their reasoning effort simply when extra effort is required. As additional proof, Apple references 2024 and 2025 benchmark questions from the American Invitational Arithmetic Examination (AIME), a prestigious US arithmetic competitors for top-performing high-school college students.
Whereas human efficiency improves year-on-year, mannequin scores decline for extra the unseen 2025 batch – supporting the concept that AI success remains to be closely reliant on memorized patterns, and never versatile problem-solving.
The place Claude fails
The counterargument hinges on the concept that language fashions truncate responses not as a result of they fail to purpose, however as a result of they “know” the output is turning into too lengthy. One cited instance reveals a mannequin halting mid-solution with a self-aware remark: “The sample continues, however to keep away from making this too lengthy, I’ll cease right here.”
That is introduced as proof that fashions perceive the duty however select brevity.
However it’s anecdotal at greatest – drawn from a single social media publish – and makes a big inferential leap. Even the engineer who initially posted the instance doesn’t absolutely endorse rebuttal’s conclusion. They level out that greater era randomness (“temperature”) results in collected errors, particularly on longer sequences – so stopping early could not point out understanding, however entropy avoidance.
The rebuttal additionally invokes a probabilistic framing: that each transfer in an answer is sort of a coin flip, and finally even a small per-token error fee will derail a protracted sequence. However reasoning isn’t simply probabilistic era; it’s sample recognition and abstraction. As soon as a mannequin identifies an answer construction, later steps shouldn’t be impartial guesses – they need to be deduced. The rebuttal doesn’t account for this.
However the actual miss for the rebuttal is its argument that fashions can succeed if prompted to generate code. However this misses the entire level. Apple’s objective was to not check whether or not fashions might retrieve canned algorithms; it was to judge their capability to purpose via the construction of the issue on their very own. If a mannequin solves an issue by merely recognizing it ought to name or generate a particular device or piece of code, then it’s not actually reasoning – it’s simply recalling an answer or a sample.
In different phrases, if an AI mannequin sees the Tower of Hanoi puzzle and responds by outputting Lua code it has ‘seen’ earlier than, it’s simply matching the issue to a identified template and retrieving the corresponding device. It’s not ‘pondering’ via the issue; it’s simply subtle library search.
The place this leaves us
To be clear, the Apple paper isn’t bulletproof. Its remedy of the River Crossing puzzle is a weak level. As soon as sufficient individuals are added to the puzzle, the issue turns into unsolvable. And but Apple’s benchmark marks a “no resolution” response as mistaken. That’s an error. However the factor is, the mannequin’s efficiency has already collapsed earlier than the issue turns into unsolvable – which suggests the drop-off occurs not on the fringe of purpose, however lengthy earlier than it.
In conclusion, the rebuttal’s response, whether or not AI assisted or AI generated, raises vital questions, particularly round analysis strategies and mannequin self-awareness. However the rebuttal rests extra on anecdote and hypothetical framing than on rigorous counter-evidence. Apple’s authentic declare – that present fashions simulate reasoning with out scaling it – stays largely intact. And it’s not truly new; information scientists have been saying this for a very long time.
But it surely at all times helps, in fact, when huge firms like Apple help the prevailing science. Apple’s paper could sound confrontational, at occasions – within the title, alone. However its evaluation is considerate and well-supported. What it reveals is a fact the AI group should grapple with: reasoning is greater than token era, and with out deeper architectural shifts, right this moment’s fashions could stay trapped on this phantasm of pondering.
Maria Sukhareva has been working within the discipline of AI for 15 years – in AI mannequin coaching and product administration. She is principal key knowledgeable in AI at Siemens. The views expressed are above are her’s, and never her employer’s. Her Substack weblog web page is right here; her web site is right here.