LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

arXiv cs.AI·Shradha Agarwal, Deepak Rajbhar, Tariq J

1d ago

·~2 min·5/19/2026·en·2

Quick Take

LinAlg-Bench evaluates LLMs on linear algebra tasks, revealing structured failure modes based on matrix dimensions.

Key Points

Benchmark tests 10 LLMs on 660 structured problems.
Identifies 1,156 failure types with a forensic pipeline.
Finds a critical behavioral shift at 4x4 matrix size.

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

View PDF HTML (experimental)

Abstract:We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.

Comments:	42 pages, 3 figures, 12 tables. NeurIPS 2026 Evaluations and Datasets Track submission. Dataset: this https URL
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.16675 [cs.AI]
	(or arXiv:2605.16675v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.16675 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Shradha Agarwal [view email]
[v1] Fri, 15 May 2026 22:30:57 UTC (1,331 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.AI

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?