Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review
Quick Answer
This paper shows that The integration of Large Language Models (LLMs) into peer review exposes vulnerabilities to targeted attacks, prompting the introduction of PaperGuard, a benchmark designed to evaluate and defend against these multimodal adversarial manipulations.
Quick Take
The integration of Large Language Models (LLMs) into peer review exposes vulnerabilities to targeted attacks, prompting the introduction of PaperGuard, a benchmark designed to evaluate and defend against these multimodal adversarial manipulations. The framework includes a multimodal dataset, a suite of targeted attacks, and a defense mechanism using chunk-based embedding search, revealing that AI reviewers are significantly susceptible to manipulation.
Key Points
- Current AI peer-review studies focus primarily on text, neglecting multimodal vulnerabilities.
- PaperGuard features a comprehensive dataset across various scientific domains.
- The framework includes black-box and white-box attack methodologies targeting both text and figures.
- Experiments confirm that AI reviewers are widely vulnerable to domain-specific attacks.
- PaperGuard establishes essential protocols for resilient AI-assisted scholarly reviewing.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12716v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only.
Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e. g. , "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks.
Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions.
Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.