Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
Quick Take
Large Language Models (LLMs) like Claude Haiku 4.5 show varying robustness in math problem-solving. In a study of 1,000 GSM-Symbolic problems, chain-of-thought (CoT) prompting outperformed code execution methods, with only a 1.3% accuracy drop, while Program-Aided Language models (PAL) had a 1.7% drop, indicating code execution does not enhance reasoning robustness.
Key Points
- CoT prompting showed the highest robustness with a 1.3% accuracy drop.
- PAL had the lowest robustness, with a 1.7% accuracy drop.
- SBSC's performance fell between CoT and PAL in robustness.
- All methods were tested on original and modified math problems.
- Statistical significance was not achieved in performance differences.
Article Content
From source RSS / original summaryarXiv:2605. 26414v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers.
Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested.
This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4. 5. CoT was the most robust method, with an accuracy drop of 1. 3 percentage points and 1. 8% of problems breaking under perturbation.
PAL was the least robust at 1. 7 percentage points and 3. 1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = . 096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.