Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.
翻译:电影视频描绘了多个主体在特定时刻行动或互动,其捕捉伴随着精心设计的镜头运动,并通过镜头切换进行拼接。这些元素共同要求超越当前文生视频模型所能提供的细粒度控制水平。现有工作孤立地处理每个轴向:多主体个性化、时间控制、多镜头合成或相机控制;尚无先验框架能联合集成所有四个方面。我们提出CineOrchestra,一个统一的视频扩散模型,可同时控制主体、事件、相机和镜头切换。我们的关键洞察在于,这些异构的电影制作元素共享一个基本结构:每个元素都是在特定时间间隔内作用的实体,因此都可以通过一个共享的实体中心条件基元结构来表达,并辅以视觉实体的参考图像。该设定将架构挑战简化为单一的位置编码问题,我们通过两个无参数协调旋转嵌入来解决这一问题:(a) 区间采样时序旋转位置嵌入(RoPE),可在时长差异显著的事件间产生一致的注意力行为;(b) 二维实体-时序交叉注意力旋转位置嵌入(RoPE),用于解耦每个实体的条件并将每个条件路由到其对应的时空区域。在两个新基准上,CineOrchestra在密集描述跟随和镜头切换时序方面优于六个单轴专家模型,在成对用户研究和组件消融实验中均取得一致提升。