Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.
翻译:基础模型的近期进展彻底改变了联合音视频生成领域。然而,现有方法通常将人本任务——包括基于参考的音视频生成(R2AV)、视频编辑(RV2AV)和音频驱动视频动画(RA2V)——视为孤立的目标。此外,在单一框架内实现对多角色身份和语音音色的精确解耦控制仍是一个开放挑战。本文提出DreamID-Omni,一个用于可控人本音视频生成的统一框架。具体而言,我们设计了一个对称条件扩散Transformer,它通过对称条件注入方案整合异构条件信号。为解决多人场景中普遍存在的身份-音色绑定失败和说话者混淆问题,我们引入了双级解耦策略:在信号层面采用同步RoPE以确保严格的注意力空间绑定,在语义层面采用结构化描述文本来建立明确的属性-主体映射。此外,我们设计了一种多任务渐进式训练方案,该方案利用弱约束生成先验来正则化强约束任务,从而防止过拟合并协调不同目标。大量实验表明,DreamID-Omni在视频、音频及音视频一致性方面均实现了全面的最先进性能,甚至超越了领先的专有商业模型。我们将公开代码,以弥合学术研究与商业级应用之间的鸿沟。