CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers
Quick Answer
CoReDiT introduces a token pruning framework for Diffusion Transformers, achieving up to 55% reduction in self-attention FLOPs and 1.72x inference speedup on mobile NPUs.
Quick Take
CoReDiT introduces a token pruning framework for Diffusion Transformers, achieving up to 55% reduction in self-attention FLOPs and 1.72x inference speedup on mobile NPUs. This method enhances on-device memory capacity, enabling higher-resolution image generation while maintaining visual quality across various diffusion models like PixArt-α and MagicDrive-V2.
Key Points
- CoReDiT uses a linear-time spatial coherence score for token pruning.
- Achieves 1.33x speedup on cloud GPUs and 1.72x on mobile NPUs.
- Maintains high visual quality while reducing computational costs.
- Enables higher-resolution generation by increasing on-device memory headroom.
- Progressive pruning schedule allocates resources based on token redundancy.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2605. 14191v1 Announce Type: new Abstract: Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention.
To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy.
Across state-of-the-art diffusion backbones including PixArt-{\alpha} and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1. 33x on cloud GPUs and 1. 72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.