IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

arXiv cs.CL·Samir Abdaljalil, Erchin Serpedin, Hasan Kurban

3h ago

·~1 min·7/3/2026·en·0

Quick Answer

This paper shows that ISOSCI benchmark reveals that 91.3% of reasoning-mode gains in LLMs are knowledge-dependent, challenging the assumption that chain-of-thought reasoning enhances scientific problem-solving.

Quick Take

ISOSCI benchmark reveals that 91.3% of reasoning-mode gains in LLMs are knowledge-dependent, challenging the assumption that chain-of-thought reasoning enhances scientific problem-solving. Notably, the reasoning-specialized model o3-mini outperformed on but underperformed on ISOSCI, indicating benchmark choice significantly influences conclusions about reasoning utility.

Key Points

ISOSCI benchmarks separate reasoning ability from domain knowledge retrieval in LLM evaluation.
91.3% of reasoning gains are knowledge-dependent, not structure-invariant.
Reasoning toggles provide less than 5 percentage points accuracy gain across all domains.
The o3-mini model outperforms on GPQA Diamond but underperforms on ISOSCI.
Benchmark choice critically influences conclusions about LLM reasoning utility.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2607. 01431v1 Announce Type: new Abstract: We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.

3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82. 3%, 96. 0%]), directly challenging the assumption that chain-of-thought reasoning improves short-horizon procedural scientific problem-solving. Reasoning toggles on highly capable models provide less than 5 percentage points accuracy gain across all domains, and a reasoning-specialized model (o3-mini) that outperforms its standard counterpart on (+19.

2 percentage points) underperforms on ISOSCI (-24. 7 percentage points), showing that benchmark choice determines conclusions about reasoning utility. We release ISOSCI at https://huggingface. co/datasets/isosci/isosci

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems