Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
Quick Take
This survey addresses Pretraining Data Exposure (PDE) in Large Language Models (LLMs), highlighting its implications for privacy and data integrity. It unifies the study of membership inference and data contamination, presenting formal definitions, attack and defense strategies, and outlining future research challenges in the context of LLMs.
Key Points
- PDE determines if specific data was included in an LLM's training corpus.
- The paper synthesizes empirical findings and highlights open research challenges.
- Membership inference and data contamination are critical areas intersecting with PDE.
- Unified survey provides a comprehensive framework for understanding PDE.
- Attack and defense methods against PDE are reviewed and formalized.
Article Excerpt
From source RSS / original summaryarXiv:2605. 26133v1 Announce Type: new Abstract: Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus.
It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.
