Can Generalist Agents Automate Data Curation?

arXiv cs.AI·Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia

6/4/2026

·~1 min·6/4/2026·en·4

Quick Answer

This paper shows that Generalist coding agents can automate data curation loops, achieving strong data-selection baselines in vision-language tasks with *Curation-Bench*.

Quick Take

However, they primarily refine existing policies rather than innovate, necessitating scaffolded methods for effective exploration. The scaffolded agents autonomously developed superior data-selection policies at a fraction of the data budget.

Key Points

Agents achieved strong data-selection baselines within ten iterations using *Curation-Bench*.
Execution-research gap revealed agents mainly tuning local policy variants.
Scaffolded methods shifted agents towards method-guided exploration.
Autonomous composition of data-selection policy outperformed baselines at one-tenth the data budget.
Code and benchmark are available as open-source.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

From the original publisher, up to about 700 characters

arXiv:2606. 04261v1 Announce Type: new Abstract: Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop.

We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

3d ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup

Can Generalist Agents Automate Data Curation?

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System

The Emerging Paradigm of Geospatial Foundation Models: From Pre-Training to Agentic Reasoning

Adversarial Social Epistemology for Assemblies of Humans and

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

The Emerging Paradigm of Geospatial Foundation Models: From Pre-Training to Agentic Reasoning

Adversarial Social Epistemology for Assemblies of Humans and Large Language Models

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System

Adversarial Social Epistemology for Assemblies of Humans and