
Apple study reveals AI's struggle with complex reasoning
What's the story
Ahead of the much-anticipated Worldwide Developers Conference (WWDC), Apple has published a study titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity."
The research tested several 'reasoning' artificial intelligence (AI) models, including Anthropic's Claude, OpenAI's models, DeepSeek R1, and Google's Thinking models.The goal was to see how well these systems could replicate human reasoning in complex problem-solving scenarios.
Evaluation critique
The study's focus and methodology
The study takes issue with the standard practice of evaluating Large Reasoning Models (LRMs) using established mathematical and coding benchmarks.
It argues these methods suffer from data contamination and fail to provide insights into reasoning trace structure and quality.
Instead, it proposes a controlled experimental testbed using algorithmic puzzle environments as a more effective way to assess AI reasoning capabilities.
Research results
AI models struggle with complex problem-solving scenarios
The study found that state-of-the-art LRMs like o3-mini, DeepSeek-R1, and Claude-3.7-Sonnet-Thinking still struggle to develop generalizable problem-solving capabilities.
Their accuracy collapses to zero beyond certain complexities across different environments.
This stark finding highlights a major limitation of current AI systems: the inability to handle complex problems consistently.
Future prospects
Need to evolve AI evaluation methods, says study
The research highlights the need to evolve AI evaluation methods and understand the fundamental benefits and limitations of LRMs.
It raises critical questions about these models' capabilities for generalizable reasoning and their performance scaling with increasing problem complexity.
These findings could inform Apple's future AI strategies, possibly focusing on specific use cases where current AI methods are not reliable.
Scaling insights
Models actually reduce their reasoning effort as complexity increases
The study also found a counter-intuitive scaling limit in the reasoning effort of these models, which is measured by inference token usage during the "thinking" phase.
Initially, these models spend more tokens, but as complexity increases, they actually reduce their reasoning effort closer to an inevitable accuracy collapse.
This observation provides further insight into the limitations of current AI systems in handling complex problems.