Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

5/27/2026

·~1 min·5/27/2026·en·1

Quick Answer

This paper shows that Large Language Models (LLMs) like Claude Haiku 4.5 show varying robustness in math problem-solving.

Quick Take

Large Language Models (LLMs) like Claude Haiku 4.5 show varying robustness in math problem-solving. In a study of 1,000 GSM-Symbolic problems, chain-of-thought (CoT) prompting outperformed code execution methods, with only a 1.3% accuracy drop, while Program-Aided Language models (PAL) had a 1.7% drop, indicating code execution does not enhance reasoning robustness.

Key Points

CoT prompting showed the highest robustness with a 1.3% accuracy drop.
PAL had the lowest robustness, with a 1.7% accuracy drop.
SBSC's performance fell between CoT and PAL in robustness.
All methods were tested on original and modified math problems.
Statistical significance was not achieved in performance differences.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 26414v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers.

Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested.

This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4. 5. CoT was the most robust method, with an accuracy drop of 1. 3 percentage points and 1. 8% of problems breaking under perturbation.

PAL was the least robust at 1. 7 percentage points and 3. 1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p =. 096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Mihnea C. Moldoveanu, Joel A. C. Baum

4d ago

FeaturedOriginal

Adversarial Social Epistemology for Assemblies of Humans and Large Language Models

AI Summary

The paper introduces Adversarial Social Epistemology (ASE) to analyze how agents manipulate trust in public communications, highlighting mechanisms that undermine the reliability of testimony and inference. It critiques existing frameworks like epistemic bubbles and misinformation diffusion, proposing a new language for understanding trust breaches and auditing inferential chains in densely interactive environments involving humans and large language models.

#LLM #Agent #Inference #Policy

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.AI

Adversarial Social Epistemology for Assemblies of Humans and Large Language Models

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

Onnes: A Physics-Grounded LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.AI

Adversarial Social Epistemology for Assemblies of Humans and Large Language Models

Information Limits and Attractor Dynamics in Economies of Frontier LLM Agents: A Pre-Registered Test

Onnes: A Physics-Grounded Multi-Agent LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure

Onnes: A Physics-Grounded LLM Simulator for Cryogenic Fault Diagnosis in Quantum Computing Infrastructure