Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models
Quick Take
The paper introduces BEAP, a black-box adversarial prompting attack that effectively exploits vulnerabilities in text-to-image diffusion models, achieving over 60% improvement in Attack Success Rate (ASR) with only fifteen prompts needed per attack. Unlike previous methods, BEAP generates undetectable prompts while maintaining high image quality, addressing gaps in existing threat models.
Key Points
- BEAP leverages a large language model for iterative adversarial prompt generation.
- Combines multiple reward signals for effective embedding-aware search in text space.
- Achieves over 60% improvement in Attack Success Rate compared to prior methods.
- Requires only an average of fifteen prompts for successful attacks.
- Prompts remain undetectable to safety filters while producing high-quality images.
Article Content
From source RSS / original summaryarXiv:2605. 26332v1 Announce Type: new Abstract: Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i. e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding.
We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts.
Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
