Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.


翻译:由于缺乏有效的跨模态建模,现有开源音视频生成方法常存在口型同步性不足与语义一致性欠佳的问题。为克服这些缺陷,我们提出UniAVGen,一种用于联合音频与视频生成的统一框架。UniAVGen以双分支联合合成架构为基础,通过两个并行的扩散Transformer(DiT)构建统一的跨模态潜在空间。其核心在于非对称跨模态交互机制,该机制支持双向、时序对齐的交叉注意力,从而确保精确的时空同步与语义一致性。此外,该跨模态交互通过面部感知调制模块增强,该模块在交互过程中动态突出关键区域。为提升推理过程中的生成保真度,我们额外引入了模态感知的无分类器引导策略,这是一种显式增强跨模态关联信号的新方法。值得注意的是,UniAVGen鲁棒的联合合成设计使得关键音视频任务能在单一模型中无缝统一,例如联合音视频生成与续写、视频到音频的配音以及音频驱动的视频合成。综合实验验证,在训练样本量显著减少(130万对比3010万)的情况下,UniAVGen在音视频同步性、音色一致性与情感一致性方面均展现出整体优势。

0
下载
关闭预览

相关内容

IFIP TC13 Conference on Human-Computer Interaction是人机交互领域的研究者和实践者展示其工作的重要平台。多年来,这些会议吸引了来自几个国家和文化的研究人员。官网链接:http://interact2019.org/
Top
微信扫码咨询专知VIP会员