Labeling Training Data for Entity Matching Using Large Language Models
Quick Answer
This paper explores using large language models (LLMs) for labeling training data in entity matching, demonstrating that models like GPT-5.2 can label datasets for benchmarks such as Abt-Buy and Walmart-Amazon at a cost of $28.31 to $40.88, significantly reducing manual labeling time from 470 hours.
Quick Take
This paper explores using large language models (LLMs) for labeling training data in entity matching, demonstrating that models like GPT-5.2 can label datasets for benchmarks such as Abt-Buy and Walmart-Amazon at a cost of $28.31 to $40.88, significantly reducing manual labeling time from 470 hours. The resulting student models perform comparably to those trained on benchmark data, achieving performance differences below two F1 points.
Key Points
- LLMs like GPT-5.2 label training data faster and cheaper than manual methods.
- Student models trained on machine-labeled data match benchmark-trained models' performance.
- Cost of labeling five benchmarks is between $28.31 and $40.88.
- Manual labeling is estimated to take 470 hours, highlighting efficiency gains.
- Inference speed of Ditto is 41.5 to 534 times faster than using LLMs directly.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Recent large language models (LLMs) achieve strong performance on entity matching without requiring task-specific training data. However, applying these models to large sets of candidate pairs remains slow and costly. In contrast, entity matchers using traditional machine learning methods or small language models (SLMs), such as RoBERTa, offer much faster inference but require task-specific training data.
This paper investigates whether the need to provide task-specific training data can be avoided by using knowledge-distillation workflows, in which an LLM serves as a teacher model to label training pairs that are subsequently used to train a smaller student model. We investigate knowledge distillation for entity matching along the following dimensions: pair-selection strategy, teacher model, label post-processing method, and student model. We evaluate the workflows using the Abt-Buy, Walmart-Amazon, WDC Products, DBLP-ACM, and DBLP-Scholar benchmarks, and compare the performance of student models trained with machine-labeled data to the performance of the same models trained using the benchmark training sets.
Our experiments show that student models trained using the machine-labeled sets perform approximately on par with models trained on the benchmark training sets, with the remaining differences in both directions staying below two F1 points. Using GPT-5.2 to label the training sets for all five benchmarks costs US\$28.31 to US\$40.88, whereas manually labeling the same training sets is estimated to require 470 hours of work. At inference time, Ditto is 41.5 to 534 times faster than directly using an LLM to perform the matching tasks.
These results indicate that current LLMs, when combined with a suitable pair-selection method, can substantially reduce or even eliminate the manual effort required to label use case-specific training data for entity matching.
| Comments: | 13 pages, 5 figures, 9 tables |
| Subjects: | Computation and Language (cs.CL) |
| ACM classes: | H.2.8; I.2.7; I.2.6 |
| Cite as: | arXiv:2606.28823 [cs.CL] |
| (or arXiv:2606.28823v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28823 arXiv-issued DOI via DataCite |
Submission history
From: Aaron Steiner [view email]
[v1]
Sat, 27 Jun 2026 09:15:09 UTC (1,189 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.