Labeling Training Data for Entity Matching Using Large Language Models

arXiv cs.CL·Aaron Steiner, Christian Bizer

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

Quick Take

This paper explores using large language models (LLMs) for labeling training data in entity matching, demonstrating that models like GPT-5.2 can label datasets for benchmarks such as Abt-Buy and Walmart-Amazon at a cost of $28.31 to $40.88, significantly reducing manual labeling time from 470 hours. The resulting student models perform comparably to those trained on benchmark data, achieving performance differences below two F1 points.

Key Points

LLMs like GPT-5.2 label training data faster and cheaper than manual methods.
Student models trained on machine-labeled data match benchmark-trained models' performance.
Cost of labeling five benchmarks is between $28.31 and $40.88.
Manual labeling is estimated to take 470 hours, highlighting efficiency gains.
Inference speed of Ditto is 41.5 to 534 times faster than using LLMs directly.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

View PDF

Abstract:Recent large language models (LLMs) achieve strong performance on entity matching without requiring task-specific training data. However, applying these models to large sets of candidate pairs remains slow and costly. In contrast, entity matchers using traditional machine learning methods or small language models (SLMs), such as RoBERTa, offer much faster inference but require task-specific training data.
This paper investigates whether the need to provide task-specific training data can be avoided by using knowledge-distillation workflows, in which an LLM serves as a teacher model to label training pairs that are subsequently used to train a smaller student model. We investigate knowledge distillation for entity matching along the following dimensions: pair-selection strategy, teacher model, label post-processing method, and student model. We evaluate the workflows using the Abt-Buy, Walmart-Amazon, WDC Products, DBLP-ACM, and DBLP-Scholar benchmarks, and compare the performance of student models trained with machine-labeled data to the performance of the same models trained using the benchmark training sets.
Our experiments show that student models trained using the machine-labeled sets perform approximately on par with models trained on the benchmark training sets, with the remaining differences in both directions staying below two F1 points. Using GPT-5.2 to label the training sets for all five benchmarks costs US\$28.31 to US\$40.88, whereas manually labeling the same training sets is estimated to require 470 hours of work. At inference time, Ditto is 41.5 to 534 times faster than directly using an LLM to perform the matching tasks.
These results indicate that current LLMs, when combined with a suitable pair-selection method, can substantially reduce or even eliminate the manual effort required to label use case-specific training data for entity matching.

Comments:	13 pages, 5 figures, 9 tables
Subjects:	Computation and Language (cs.CL)
ACM classes:	H.2.8; I.2.7; I.2.6
Cite as:	arXiv:2606.28823 [cs.CL]
	(or arXiv:2606.28823v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2606.28823 arXiv-issued DOI via DataCite

Submission history

From: Aaron Steiner [view email]
[v1] Sat, 27 Jun 2026 09:15:09 UTC (1,189 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Labeling Training Data for Entity Matching Using Large Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems