EmbGen: Teaching with Reassembled Corpora

arXiv cs.CL·Arun K Lenin, Kai Rouse, Andrea Nicastro, Anna Leontjeva

17h ago

·~2 min·5/20/2026·en·1

Quick Take

EmbGen enhances synthetic data generation for instruction-tuning by leveraging semantic structures from entity-description pairs.

Key Points

Generates QA pairs from reassembled entity-description pairs.
Improves Binary Accuracy on heterogeneous datasets significantly.
Evaluated against multiple baselines under fixed token budgets.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF

Abstract:Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

Comments:	8 pages, 4 images (32 pages with appendix)
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
MSC classes:	68T05: Learning and adaptive systems
Cite as:	arXiv:2605.19394 [cs.CL]
	(or arXiv:2605.19394v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19394 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Anna Leontjeva [view email]
[v1] Tue, 19 May 2026 05:40:12 UTC (2,573 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

EmbGen: Teaching with Reassembled Corpora

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets