SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Quick Take
The Sci-PRM model enhances scientific reasoning in complex domains by utilizing a new dataset, SCIPRM70K, which integrates tool usage with reasoning. It improves foundation models through effective test-time scaling and provides dense reward signals in Reinforcement Learning, addressing hallucination issues and performance limitations.
Key Points
- Introduces SCIPRM70K, a dataset with Chain-of-Tool trajectories for scientific reasoning.
- Sci-PRM model improves tool selection and execution accuracy during inference.
- Enables Best-of-N selection for effective test-time scaling in foundation models.
- Provides dense reward signals in Reinforcement Learning to mitigate advantage disappearance.
- Significantly enhances performance ceilings in scientific reasoning tasks.
Article Content
From source RSS / original summaryarXiv:2606. 04579v1 Announce Type: new Abstract: While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification.
In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference.
Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.