Apple’s study reveals that even the most advanced AIs give up when problems get too hard

Ever been in a situation where you roll your eyes and mentally shut off when asked to "provide solutions" that are beyond your pay grade and job scope? If you had, that means AI can replace that part of you as well. #ai #apple #deepseek #claude #openai

We figured it was only appropriate given that it was an Apple study on AI models. This image was AI-generated using Imagen 4, via Google's Gemini.

We figured it was only appropriate given that it was an Apple study on AI models. This image was AI-generated using Imagen 4, via Google's Gemini.

Made with Google AI

Ever been in a situation where you roll your eyes and mentally shut off when asked to "provide solutions" that are beyond your pay grade and job scope? If you had, that means AI can replace that part of you as well.

Machine Learning Research at Apple, an R&D arm of Apple that dabbles in artificial intelligence and machine learning innovations, recently published the results of an AI study. It looked at four popular AI tools with reasoning capabilities: OpenAI o1/o3, DeepSeek-R1/V3, Claude 3.7 Sonnet Thinking, and Google’s Gemini Thinking.

The paper found that the current state of large reasoning models (LRM) lacks actual problem-solving skills, even when these AI tools could previously solve simpler versions of the same problems. The summarised details are below.

The Illusion of Thinking

Apple said the study was conducted using four puzzle games. These puzzle games were chosen because they are fully controllable — the AI tools are forced to show their reasoning, and these puzzles can be scaled up and down in difficulty, with clear, winnable outcomes:

  • Tower of Hanoi
  • Checkers Jumping
  • River Crossing
  • Blocks World

In the course of Apple’s study, it was found that these AI tools excelled at solving simple versions of these puzzles (low complexity). They used their tokens efficiently and gave accurate answers. 

With “medium complexity”, the LRM models with reasoning built-in performed better than their LLM counterparts (e.g. DeepSeek-R1 versus DeepSeek-V3, or Claude 3.7 Sonnet versus Claude 3.7 Sonnet Thinking model).

Oddly enough, as the puzzles approached “high complexity”, these reasoning models would reduce their reasoning effort and eventually “collapse”. This is despite having sufficient computing resources and budget, which are referred to as tokens.

In high complexity versions of the puzzles, the AI models would first spend tokens to proceed normaly, before suddenly "giving up" when it decides that the problem is too hard. And that's even if you tell them the answer, too. Image: Machine Learning Research at Apple.

In high complexity versions of the puzzles, the AI models would first spend tokens to proceed normaly, before suddenly "giving up" when it decides that the problem is too hard. And that's even if you tell them the answer, too. Image: Machine Learning Research at Apple.

This finding led Apple’s research team to infer that AI models have “...a fundamental scaling limitation in the thinking capabilities of current reasoning models relative to problem complexity” (i.e. can technically solve, doesn’t mean they actually will, and would end up spitting out wrong answers). 

Another noticeable trait was that these LRMs would “overthink”. These AI would sometimes arrive at the correct answers, but continued to burn more tokens to “explore” incorrect paths.

Another oddity is that the AI models would still be unable to solve complex problems even when the answer (algorithm) is provided to the AI via prompts. Despite needing to consume fewer tokens to execute the answer, these models begin to mess up midway through the instructions.

Per Apple’s claims, these AI models, therefore, lack “generalisable reasoning capabilities” beyond a certain complexity, which means that AI is not likely to have generalisable intelligence (critically analysing and processing its own output), even if they can “reason” or have “learnt” from previously trained data.

The paper also cautions that there are limitations to its unique puzzle-game study. These four games they tried are, obviously, not representative of all AI models or problems in the real world. Additionally, the researchers lacked access to the architectural components of the LRMs’ APIs.

Source: Apple (Machine Learning Research at Apple)

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.

Share this article