Benchmarking Agentic Review Systems
Quick Answer
A study evaluates agentic review systems, finding OpenAIReview + GPT-5.5 achieves 83.0% accuracy in assessing paper quality and detects 71.6% of injected errors.
Quick Take
A study evaluates agentic review systems, finding OpenAIReview + GPT-5.5 achieves 83.0% accuracy in assessing paper quality and detects 71.6% of injected errors. Real user feedback indicates positive reception but highlights issues with false positives.
Key Points
- OpenAIReview + GPT-5.5 outperforms other systems with 83.0% accuracy in peer review.
- The system detects 71.6% of errors in papers with injected perturbations.
- Combined detection across six models reaches 83.3% recall, indicating varied error detection capabilities.
- User feedback shows a positive vote ratio of 1.44 to 1, but highlights false positives.
- The study suggests improvements are needed for AI review systems despite their current effectiveness.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19749v1 Announce Type: new Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models.
First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5. 5 at 83. 0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall.
The strongest configuration (OpenAIReview + GPT-5. 5) catches 71. 6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83. 3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.
44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.