Data and Evaluation Closed-Loop for Model Capability Enhancement
Quick Answer
The study introduces the 'capability slice' to bridge the gap between model evaluation and data optimization, demonstrating its effectiveness in two case studies.
Quick Take
The study introduces the 'capability slice' to bridge the gap between model evaluation and data optimization, demonstrating its effectiveness in two case studies. In one, targeted data intervention improved BBH performance by 66.44% without altering the dataset, while in another, a focused sampling strategy enhanced math-reasoning scores from 0.00 to 26.67.
Key Points
- Capability slice localizes model weaknesses for targeted data interventions.
- BBH performance improved by 66.44% through diagnosis of a single loss.
- Math-reasoning scores increased from 0.00 to 26.67 with targeted sampling.
- Evaluation-to-data inference can be routine, auditable, and validated.
- Study demonstrates the effectiveness of a closed-loop evaluation system.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Model capability is the central variable in LLM pre-training, yet is never observed directly: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into one noisy score. Practical optimization runs this backward: a failure is observed first, and the engineer must infer the corpus fix. The two sides speak incompatible vocabularies -- benchmark names and per-sample correctness versus data sources, domains, and quality labels -- so this inference is usually intuition, not method. We close this gap with the \emph{capability slice}: a group of evaluation samples sharing background condition, task type, solving operation, and output constraint -- precise enough to localize a single weakness yet stable enough to survive aggregation, unlike a benchmark name, too coarse, or a single sample, too noisy. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention. We test this loop on two case studies pulling in opposite directions. First, the loop rules the data out: continued pre-training drives BBH down by $-46.82\%$, but diagnosis traces this to a single masked \texttt{\textless EOS\textgreater} loss rather than weakened reasoning; restoring it recovers BBH to $66.44$, above the original checkpoint, without changing the data. Second, the loop rules the data in: a persistent math-reasoning weakness is decomposed by solving operation into specific failing combinations, and a weakness-targeted sampling procedure built from it lifts AIME2025/AIME2026 Pass@128 from $6.67$/$0.00$ to $26.67$ each. The same unmodified loop reaches opposite, correct verdicts in both cases, showing the evaluation-to-data inference can be routine, auditable, and experimentally validated rather than intuitive.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.28471 [cs.AI] |
| (or arXiv:2606.28471v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28471 arXiv-issued DOI via DataCite |
Submission history
From: Zhixuan Li [view email]
[v1]
Fri, 26 Jun 2026 14:45:57 UTC (1,600 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.