Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance
Quick Take
A tuning-free, instruction-based video editing framework improves visual quality using structural noise initialization and guidance.
Key Points
- Introduces Structural Noise Initialization Strategy for better editing.
- Employs Noise Guidance Mechanism to enhance denoising.
- Achieves state-of-the-art performance in video editing.
📖 Reader Mode
~2 min readAbstract:Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.
| Comments: | Accepted by ICIP 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.15533 [cs.CV] |
| (or arXiv:2605.15533v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15533 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Song Wu [view email]
[v1]
Fri, 15 May 2026 02:09:06 UTC (866 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.