How Far Are We From True Auto-Research?

arXiv cs.AI·Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie

17h ago

·~2 min·5/20/2026·en·1

Quick Take

Current auto-research systems produce papers, but quality and acceptance remain significant challenges.

Key Points

ResearchArena evaluates agent-generated papers across multiple criteria.
Manual reviews reveal significant gaps in experimental rigor.
No papers met acceptance standards of top-tier venues.

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

Subjects:	Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Cite as:	arXiv:2605.19156 [cs.AI]
	(or arXiv:2605.19156v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.19156 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sainyam Galhotra [view email]
[v1] Mon, 18 May 2026 22:20:33 UTC (3,647 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

How Far Are We From True Auto-Research?

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.AI

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models