CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

翻译：电影视频描绘了多个主体在特定时刻行动或互动，其捕捉伴随着精心设计的镜头运动，并通过镜头切换进行拼接。这些元素共同要求超越当前文生视频模型所能提供的细粒度控制水平。现有工作孤立地处理每个轴向：多主体个性化、时间控制、多镜头合成或相机控制；尚无先验框架能联合集成所有四个方面。我们提出CineOrchestra，一个统一的视频扩散模型，可同时控制主体、事件、相机和镜头切换。我们的关键洞察在于，这些异构的电影制作元素共享一个基本结构：每个元素都是在特定时间间隔内作用的实体，因此都可以通过一个共享的实体中心条件基元结构来表达，并辅以视觉实体的参考图像。该设定将架构挑战简化为单一的位置编码问题，我们通过两个无参数协调旋转嵌入来解决这一问题：(a) 区间采样时序旋转位置嵌入（RoPE），可在时长差异显著的事件间产生一致的注意力行为；(b) 二维实体-时序交叉注意力旋转位置嵌入（RoPE），用于解耦每个实体的条件并将每个条件路由到其对应的时空区域。在两个新基准上，CineOrchestra在密集描述跟随和镜头切换时序方面优于六个单轴专家模型，在成对用户研究和组件消融实验中均取得一致提升。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【ICML 2026】MotiMotion：用视觉推理增强运动可控视频生成

专知会员服务

5+阅读 · 5月23日

【AAAI2026】MoFu：用于多主体视频生成的尺度感知调制与傅里叶融合架构

专知会员服务

9+阅读 · 1月3日

【NeurIPS 2025】稳定电影度量：面向专业视频生成的结构化分类与评测体系

专知会员服务

7+阅读 · 2025年10月6日

【CVPR2025】《VideoMage：文本到视频扩散模型的多主体与运动定制》

专知会员服务

12+阅读 · 2025年3月28日