Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation
Quick Answer
The paper introduces DAVE, a training-free method that mitigates early trajectory lock-in in text-to-image models by selectively attenuating the zero-frequency spatial average component, enhancing diversity without significant overhead.
Quick Take
The paper introduces DAVE, a training-free method that mitigates early trajectory lock-in in text-to-image models by selectively attenuating the zero-frequency spatial average component, enhancing diversity without significant overhead.
Key Points
- DAVE improves prompt-consistent diversity in text-to-image generation.
- The method requires negligible overhead compared to existing diversity-enhancement techniques.
- Early trajectory lock-in is identified as a key factor limiting variation.
- DAVE operates at the representation level, avoiding costly sampling.
- The approach maintains competitive image quality while enhancing diversity.
Article Excerpt
From source RSS / original summaryarXiv:2606. 06813v1 Announce Type: new Abstract: Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead.
To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime.
DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.