Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics
Quick Answer
The study introduces a bilayer SIR/SIRS model to analyze synthetic data contamination in AI, revealing supercritical dynamics ($R_0 > 1$) and highlighting detection-based filtering as a key intervention strategy.
Quick Take
The study introduces a bilayer SIR/SIRS model to analyze synthetic data contamination in AI, revealing supercritical dynamics ($R_0 > 1$) and highlighting detection-based filtering as a key intervention strategy. Experiments with GPT-2 demonstrate dose-response degradation, emphasizing the risk of model collapse due to cross-contamination.
Key Points
- Proposes a bilayer model treating data and AI models as interacting populations.
- Identifies synthetic-text detection as the highest-leverage parameter in collapse dynamics.
- Experiments show that multi-source mixing slightly mitigates collapse effects.
- Detection-based filtering and herd immunity are recommended intervention strategies.
- Mean-field consistency confirmed with $R^2 > 0.96$ for dense networks.
Article Content
From source RSS / original summaryarXiv:2606. 05168v1 Announce Type: new Abstract: Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora.
We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination.
We derive the basic reproduction number $R_0 = \sqrt{\beta_D \beta_M / [(\gamma_D+\mu_D)(\gamma_M+\mu_M)]}$ via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system. Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ($R_0 > 1$) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter.
A bipartite-network agent-based model confirms mean-field consistency ($R^2 > 0. 96$) for dense networks but degrades under heterogeneity. GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions.
Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.