Epidemiology of Model Collapse | AI Deep Signal

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

6/5/2026

·~2 min·6/5/2026·en·2

Quick Answer

The study introduces a bilayer SIR/SIRS model to analyze synthetic data contamination in AI, revealing supercritical dynamics ($R_0 > 1$) and highlighting detection-based filtering as a key intervention strategy.

Quick Take

Experiments with GPT-2 demonstrate dose-response degradation, emphasizing the risk of model collapse due to cross-contamination.

Key Points

Proposes a bilayer model treating data and AI models as interacting populations.
Identifies synthetic-text detection as the highest-leverage parameter in collapse dynamics.
Experiments show that multi-source mixing slightly mitigates collapse effects.
Detection-based filtering and herd immunity are recommended intervention strategies.
Mean-field consistency confirmed with $R^2 > 0.96$ for dense networks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 05168v1 Announce Type: new Abstract: Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora.

We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Yueqi Xing, Houbo He, Jolie Wang, Erin Ni, Shikai Wang, Qiufeng Li, Weidong Cao, Taiyun Chi

6h ago

FeaturedOriginal

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

AI Summary

RF-Agent introduces a novel framework for RF circuit design using , creating a unique RF-domain reasoning dataset with over 11,000 samples. The study reveals that domain-specific supervised fine-tuning and semantic retrieval strategies significantly enhance RF reasoning performance, particularly for smaller models.

#LLM #Agent #AI Coding #AI Startup

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Quantifying Prior Dominance in Systems