Can Generalist Agents Automate Data Curation?
Quick Take
Generalist coding agents can automate data curation loops, achieving strong data-selection baselines in vision-language tasks with *Curation-Bench*. However, they primarily refine existing policies rather than innovate, necessitating scaffolded methods for effective exploration. The scaffolded agents autonomously developed superior data-selection policies at a fraction of the data budget.
Key Points
- Agents achieved strong data-selection baselines within ten iterations using *Curation-Bench*.
- Execution-research gap revealed agents mainly tuning local policy variants.
- Scaffolded methods shifted agents towards method-guided exploration.
- Autonomous composition of data-selection policy outperformed baselines at one-tenth the data budget.
- Code and benchmark are available as open-source.
Article Content
From source RSS / original summaryarXiv:2606. 04261v1 Announce Type: new Abstract: Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop.
We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations.
However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget.
Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.