TechingToday

AI models are getting better at elementary school math - but they may be cheating, according to a new study.

General

Large-scale language models (LLMs) that drive chatbots like ChatGPT may be getting better at answering benchmark questions that measure mathematical reasoning. However, this may actually be a bad thing.

A preprint research paper published Wednesday by researchers at Scale AI details how LLMs are performing great on math benchmark tests, but there is growing concern that contamination of the dataset is contributing to the high scores.

This is when data similar to benchmark problems leak into the training data. LLMs might then be trained to prioritize passing such standardized tests over truly understanding the mathematical problem they are trying to solve.

This is akin to memorizing answers for a math quiz, rather than learning how to solve the problem. This problem is called overfitting.

However, the authors of the paper state that their results do not support this theory, suggesting that this does not mean that AI is bad at reasoning, but that it may not be as good as the benchmarks suggest.

In the paper, the authors write:" The fact that a model is overfit does not mean that it is a poor reasoner, it simply means that it is not as good as the benchmarks suggest." They found that many of the most overfit models are still capable of reasoning and solving problems that have never been encountered in the training set.

To perform these evaluations, they developed their own mathematical benchmark test (GSM1k).

The problem is at an elementary school math level, and a typical GSM1k problem might look like this: Jim wants to spend 15% of his monthly income on groceries. Jim has a monthly income of $2,500. How much money will he have left over? The correct answer is $2125.

These questions are similar in difficulty to the industry's gold standard test (GSM8k), but different enough to test whether an LLM can handle an unseen mathematical puzzle.

Using the new test, the Scale AI research team evaluated leading open and closed source LLMs and reported up to a 13% drop in accuracy; other models on the frontier such as Gemini, GPT, and Claude, minimized signs of overfitting.

This "problem" may be resolved over time, as the authors predict that by 2025 elementary school math will likely not be challenging enough to benchmark new LLMs. Nevertheless, they state that improving the reasoning power of LLMs is "one of the most important directions of current research."

Jim Huang, a senior research scientist at NVIDIA, stated in X that academic benchmarks are losing their potency.

He said that there will be three types of LLM evaluations that will become important in the future: private tests like Scale AI, public comparative benchmarks like Chatbot Arena where models can be tested side-by-side, and private, curated benchmarks for each company's unique use cases He stated that there would be.