CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation
Quick Answer
CineOrchestra introduces a unified video diffusion model that simultaneously controls subjects, events, cameras, and shot transitions, outperforming six specialized models on new benchmarks for dense caption following and shot-transition timing.
Quick Take
CineOrchestra introduces a unified video diffusion model that simultaneously controls subjects, events, cameras, and shot transitions, outperforming six specialized models on new benchmarks for dense caption following and shot-transition timing. The model employs innovative parameter-free rotary embeddings to address temporal and spatial challenges, achieving consistent user study gains.
Key Points
- CineOrchestra integrates multi-subject personalization, temporal control, and camera movement.
- Utilizes parameter-free rotary embeddings for consistent attention across varying event durations.
- Outperformed six per-axis specialists in dense caption following and shot-transition timing.
- Achieved significant user study gains and component ablations on two new benchmarks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 13768v1 Announce Type: new Abstract: Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four.
We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities.
This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region.
On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.