Research

Apple exposes major AI weakness

Apple has used a new benchmark test to establish that major AI models are not capable of genuine logical reasoning

Martin Crowley
October 15, 2024

The common benchmark test GSM8K), which most firms run their large language models (LLMs) through to establish how capable they are of solving real-world problems with logical reasoning, is fundamentally flawed. 

This is because, due to the popularity of this test, these LLMs have likely been pre-trained on the answers, meaning their responses aren’t based on genuine ‘intelligence’. Therefore, there is a concern that LLMs rely more on sophisticated pattern matching, rather than genuine logical reasoning. 

To resolve this, Apple researchers created a new benchmark test—GSM-symbolic—that tests the LLM's ability to handle reasoning tasks, when variables within the questions are altered. 

For example, if irrelevant details are added to the question or the names or numbers within the question are changed, will the LLM be able to use logical reasoning intelligence to correctly interpret and answer the question or, because it’s only capable of answering questions using pattern recognition, will it fail to answer correctly?

Apple ran the new test on 20 popular LLMs— including OpenAI's o1 and GPT-4o, Google's Gemma 2, and Meta's Llama 3—and found that performance dropped by a few percentage points, for all the LLMs, when the variables were changed. 

For example, with this question: 

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

They added “five of them were a bit smaller than average,” as this was irrelevant information that shouldn’t have changed the end-answer. 

So how did the LLMs fair? 

All of them dropped in performance, for instance, OpenAI's o1 Preview (although the best out of the bunch) dropped by 17.5% in accuracy and Microsoft's Phi 3 model dropped by 65%.

“We found no evidence of formal reasoning in language models. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%”

This is concerning, as it demonstrates that current AI models don’t use genuine logical reasoning to solve problems, they use pattern recognition, which means we can’t rely on them for consistent and accurate answers for real-life situations.