CodeAlchemy: Synthetic Code Rewriting at Scale

arXiv cs.CL·Ankit Gupta, Aditya Prasad, Rameswar Panda

6/10/2026

·~1 min·6/10/2026·en·0

Quick Answer

CodeAlchemy introduces a synthetic data generation framework that produces over 500B tokens of semantically-rich code data, significantly outperforming existing models.

Quick Take

Notably, models like Claude Sonnet 4.5 achieve only 5.6% on the new TraceEval benchmark, while smaller models surpass larger ones, indicating critical gaps in semantic understanding.

Key Points

CodeAlchemy employs five strategies for synthetic data generation, enhancing code quality and diversity.
The framework processes 3 corpora in 15 languages, generating 500B+ tokens and 350B reasoning tokens.
CodeTrace executes 1.3M+ files across 14 languages, capturing control flow and state tracking.
DevEval and TraceEval benchmarks reveal significant performance gaps in current frontier models.
Smaller models outperform larger ones, achieving 83.5% on HumanEval and 15.36 ROUGE-2 on TraceEval.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 10087v1 Announce Type: new Abstract: Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements.

We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

5d ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

CodeAlchemy: Synthetic Code Rewriting at Scale

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis