CodeAlchemy: Synthetic Code Rewriting at Scale
Quick Answer
CodeAlchemy introduces a synthetic data generation framework that produces over 500B tokens of semantically-rich code data, significantly outperforming existing models.
Quick Take
CodeAlchemy introduces a synthetic data generation framework that produces over 500B tokens of semantically-rich code data, significantly outperforming existing models. Notably, models like Claude Sonnet 4.5 achieve only 5.6% on the new TraceEval benchmark, while smaller models surpass larger ones, indicating critical gaps in semantic understanding.
Key Points
- CodeAlchemy employs five strategies for synthetic data generation, enhancing code quality and diversity.
- The framework processes 3 corpora in 15 languages, generating 500B+ tokens and 350B reasoning tokens.
- CodeTrace executes 1.3M+ files across 14 languages, capturing control flow and state tracking.
- DevEval and TraceEval benchmarks reveal significant performance gaps in current frontier models.
- Smaller models outperform larger ones, achieving 83.5% on HumanEval and 15.36 ROUGE-2 on TraceEval.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10087v1 Announce Type: new Abstract: Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements.
We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts.
CodeTrace instruments and executes 1. 3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4. 5 achieve only 5. 6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83. 5% on HumanEval, 63. 2% on MBPP, 8. 09% win rate on DevEval, and 15.
36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4. 0.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.