IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs
Quick Answer
This paper shows that ISOSCI benchmark reveals that 91.3% of reasoning-mode gains in LLMs are knowledge-dependent, challenging the assumption that chain-of-thought reasoning enhances scientific problem-solving.
Quick Take
ISOSCI benchmark reveals that 91.3% of reasoning-mode gains in LLMs are knowledge-dependent, challenging the assumption that chain-of-thought reasoning enhances scientific problem-solving. Notably, the reasoning-specialized model o3-mini outperformed on but underperformed on ISOSCI, indicating benchmark choice significantly influences conclusions about reasoning utility.
Key Points
- ISOSCI benchmarks separate reasoning ability from domain knowledge retrieval in LLM evaluation.
- 91.3% of reasoning gains are knowledge-dependent, not structure-invariant.
- Reasoning toggles provide less than 5 percentage points accuracy gain across all domains.
- The o3-mini model outperforms on GPQA Diamond but underperforms on ISOSCI.
- Benchmark choice critically influences conclusions about LLM reasoning utility.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2607. 01431v1 Announce Type: new Abstract: We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.
3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82. 3%, 96. 0%]), directly challenging the assumption that chain-of-thought reasoning improves short-horizon procedural scientific problem-solving. Reasoning toggles on highly capable models provide less than 5 percentage points accuracy gain across all domains, and a reasoning-specialized model (o3-mini) that outperforms its standard counterpart on (+19.
2 percentage points) underperforms on ISOSCI (-24. 7 percentage points), showing that benchmark choice determines conclusions about reasoning utility. We release ISOSCI at https://huggingface. co/datasets/isosci/isosci
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.