Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

arXiv cs.CL·Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen, Runcong Zhao

4h ago

·~1 min·6/1/2026·en·0

Quick Take

Watermarking in AI-generated text is fundamentally vulnerable, as averaging outputs from multiple models can eliminate watermarks. The WASH method improves quality by 27.5% and runs 6 times faster than the best baseline, but detection remains challenging without coordination among model providers.

Key Points

Averaging outputs from 3-5 models cancels watermark perturbations effectively.
WASH addresses vocabulary misalignment and tokenization issues in ensemble generation.
Detection z-scores dropped from 5-300 to below 2 with model averaging.
Quality improved by 27.5% while running 6 times faster than the best baseline.
Robust AI-text detection requires coordination among model providers or acceptance of vulnerabilities.

Article Content

From source RSS / original summary

arXiv:2605. 30501v1 Announce Type: new Abstract: Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers.

We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models.

Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27. 5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

Related in this space

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

FORT Robotics Acquires Mapless AI to Expand Its Trust Platform with Remote Supervision and Active Safety Capabilities