As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at https://uniform-t2av.github.io/.
翻译:作为一种天然的多模态内容,可听视频提供了沉浸式的感官体验。因此,音视频生成系统具有巨大的潜力。然而,现有的基于扩散模型的研究主要采用相对独立的模块来生成每种模态,缺乏对共享权重生成模块的探索。这种方法可能未能充分利用音频和视觉模态之间的内在关联,可能导致生成质量欠佳。为解决此问题,我们提出了UniForm,一种旨在增强跨模态一致性的统一扩散Transformer。通过拼接听觉和视觉信息,UniForm学习在统一的潜在空间中同时生成音频和视频,从而促进高质量且对齐良好的音视频对的创建。大量实验表明,我们的方法在联合音视频生成、音频引导视频生成和视频引导音频生成任务中均表现出优越的性能。我们的演示可在 https://uniform-t2av.github.io/ 查看。