Chorus II: Cross-Request Sparsity Reuse for Efficient Image-to-Video Generation
Quick Answer
The Chorus II framework introduces cross-request sparsity reuse for image-to-video generation, achieving a 2.16× speedup by leveraging shared sparse masks from historical requests, minimizing online mask prediction overhead.
Quick Take
The Chorus II framework introduces cross-request sparsity reuse for image-to-video generation, achieving a 2.16× speedup by leveraging shared sparse masks from historical requests, minimizing online mask prediction overhead. This method enhances efficiency while maintaining generation quality, addressing the computational challenges of diffusion models in large-scale deployments.
Key Points
- Chorus II uses shared sparse masks to enhance image-to-video generation efficiency.
- Achieves a 2.16× speedup compared to traditional methods with minimal overhead.
- Guidance enhancement mitigates semantic drift and improves condition adherence.
- Feature reuse is optional and focuses on redundant spatiotemporal regions.
- Addresses computational challenges in large-scale diffusion model deployments.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Serving diffusion models for image-to-video generation is computationally expensive, posing significant challenges for large-scale deployment. Real I2V workloads often contain similar requests, such as repeated effect templates, related subjects, and recurring shot layouts. Existing cross-request acceleration methods mainly exploit this redundancy through feature reuse. We observe that similar I2V requests also share highly consistent sparse attention patterns, enabling historical sparse masks to serve as request-conditioned priors with almost no online mask-prediction overhead. We propose a cross-request reuse framework centered on \textbf{sparsity reuse}, with \textbf{feature reuse} as an optional extension safeguarded by a lightweight \textbf{guidance enhancement}. Our sparsity reuse is implemented as shared sparse mask reuse, which reuses high-quality sparse masks from similar historical requests to avoid per-request online mask prediction. Optional feature reuse applies downsampled computation to highly redundant spatiotemporal regions, mitigating boundary artifacts while preserving efficiency gains. Guidance enhancement reinforces image/text conditioning after reuse, mitigating semantic drift and condition-adherence issues. Experiments show that default sparsity reuse configuration preserves generation quality with a \textbf{2.16$\times$} speedup.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2606.25040 [cs.CV] |
| (or arXiv:2606.25040v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.25040 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Hao Liu [view email]
[v1]
Tue, 23 Jun 2026 18:00:55 UTC (4,228 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.