EmbGen: Teaching with Reassembled Corpora
Quick Take
EmbGen enhances synthetic data generation for instruction-tuning by leveraging semantic structures from entity-description pairs.
Key Points
- Generates QA pairs from reassembled entity-description pairs.
- Improves Binary Accuracy on heterogeneous datasets significantly.
- Evaluated against multiple baselines under fixed token budgets.
📖 Reader Mode
~2 min readAbstract:Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.
| Comments: | 8 pages, 4 images (32 pages with appendix) |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| MSC classes: | 68T05: Learning and adaptive systems |
| Cite as: | arXiv:2605.19394 [cs.CL] |
| (or arXiv:2605.19394v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19394 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Anna Leontjeva [view email]
[v1]
Tue, 19 May 2026 05:40:12 UTC (2,573 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.