While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.
翻译:尽管大规模扩散模型已彻底改变了视频合成领域,但实现对多主体身份与多粒度运动的精确控制仍是一项重大挑战。近期为弥合这一差距的尝试常受限于运动粒度不足、控制模糊性以及身份退化等问题,导致在身份保持与运动控制方面表现欠佳。本研究提出DreamVideo-Omni,这是一个通过渐进式两阶段训练范式实现和谐多主体定制与全运动控制的统一框架。在第一阶段,我们整合了涵盖主体外观、全局运动、局部动态及摄像机运动的综合控制信号进行联合训练。为确保鲁棒且精确的可控性,我们引入了条件感知的3D旋转位置编码以协调异构输入,并采用分层运动注入策略以增强全局运动引导。此外,为解决多主体模糊性问题,我们引入了组别与角色嵌入,将运动信号显式锚定至特定身份,从而有效将复杂场景解耦为独立可控的实例。在第二阶段,为缓解身份退化,我们设计了一种潜在身份奖励反馈学习范式,通过在预训练视频扩散主干网络上训练潜在身份奖励模型,在潜在空间中提供运动感知的身份奖励,优先保障符合人类偏好的身份保持。基于我们构建的大规模数据集及用于多主体与全运动控制评估的综合DreamOmni基准,DreamVideo-Omni在生成具有精确可控性的高质量视频方面展现出卓越性能。