This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.
翻译:本文提出JavisDiT,一种新颖的联合音视频扩散Transformer,专为同步音视频生成(JAVG)任务而设计。该模型基于强大的扩散Transformer(DiT)架构构建,能够根据开放式用户提示同时生成高质量的音视频内容。为确保最佳同步效果,我们通过层次化时空同步先验(HiST-Sypo)估计器引入细粒度时空对齐机制。该模块提取全局与细粒度时空先验,引导视觉与听觉组件的同步过程。此外,我们提出了新的基准数据集JavisBench,包含10,140个带文本描述的高质量有声视频,涵盖多样化场景及复杂现实情境。我们还专门设计了一种鲁棒性指标,用于评估现实复杂内容中生成音视频对的同步质量。实验结果表明,JavisDiT在保证高质量生成与精确同步方面显著优于现有方法,为JAVG任务树立了新标准。我们的代码、模型及数据集将在https://javisdit.github.io/公开提供。