Apple Research finds “inference” that AI models are not actually inference

Markse270 9th June 2025

0 29 2 minutes read

Apple Research finds “inference” that AI models are not actually inference

Compared with its large-scale technology competitors, Apple's AI development speed is cautiously slow. Justin Sullivan/Getty Images

Just as the hype of Artificial General Intelligence (AGI) reached a craze for propaganda, Apple also provided a striking reality check to the industry. In a June 6 research paper titled “Illusion of Thoughts” published, the company noted that the most advanced AI models available today, which are billed as “human-level reasoning” posed by “inferences at the human level” are experiencing complex logical problems. Instead of thinking really like humans, these models rely on pattern recognition: drawing inspiration from familiar tips from training data and predicting the next step. When faced with an unfamiliar or challenging task, the model either provides a weak response or fails completely.

In a controlled study, Apple researchers tested large language models (LLMS), such as Claude 3.7 sonnets for humans, DeepSeek-V3 and its “inference-optimized” version (Claude 3.7 with thinking and DeeldSeek-R1). The team employed classic logic puzzles such as Hanoi and River Cross Towers, i.e., determined benchmarks for AI algorithms, planning and reasoning capabilities. The towers tested in Hanoi recursively and step by step problem solving, while the river crosses the puzzle evaluates the ability of AI to plan and execute multi-step solutions.

Apple researchers divided the puzzle into three difficulties: low (3 steps), medium (4-10 steps), and high (11-20 steps). Although most models handle simpler tasks with reasonable success, their performance significantly declines as the puzzle grows more complex without model size, training method or computing power. Even if the correct algorithm is given or the use of up to 64,000 tokens (a huge computing budget) is allowed, the model only provides shallow responses, and even explicit access to the solution algorithm does not improve performance.

Through this study, Apple researchers believe that in fact, what we usually say about “reasoning” may be just complex pattern matching. They describe this phenomenon as “counter-intuitive scaling limitation”, and despite sufficient computing resources, the model’s efforts are less effective as complexity increases.

“The current evaluation focuses primarily on established mathematical and coding benchmarks, emphasizing the accuracy of the final answer,” Apple wrote in a blog post. “However, this paradigm is often affected by data contamination and cannot provide insights into the structure and quality of inference traces. Our setup not only analyzes the final answer, but also allows an analysis of internal inference traces, leading to an in-depth understanding of the Large Inference Model (LRMS) thinking.

This study brings much-needed rigor into areas that are often dominated by marketing hype, especially when technology giants tout AGI. It may also explain Apple's more cautious approach to AI development.

Apple reports its AI progress at WWDC

The research paper was deleted a few days ago at Apple's annual WWDC Developers Conference, which began today. During the opening keynote, Apple executives launched the basic model framework. The framework will enable developers to integrate AI models into their applications, facilitating features such as image generation, text creation, and natural language search.

Apple also launched a major update to its developer toolkit, Xcode 26, which now includes built-in support for integrating AI models like Chatgpt and Claude via API keys. This update enables developers to leverage the AI model for AI models that write code, generate tests and documentation, and debug tasks. Together, these announcements mark an important step in Apple's AI strategy to enable developers to build smart applications without relying on cloud infrastructure.

Apple Research finds