This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically devise a robust metric for measuring the synchrony between generated audio-video pairs in real-world content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and data are available at https://javisverse.github.io/JavisDiT-page/.
翻译:本文提出JavisDiT,一种用于同步音视频生成(JAVG)的新型联合音频-视频扩散Transformer。基于强大的扩散Transformer(DiT)架构,JavisDiT能够在统一框架中根据开放式用户提示同时生成高质量音频与视频内容。为确保音视频同步,我们通过分层时空同步先验(HiST-Sypo)估计器引入细粒度时空对齐机制。该模块提取全局与细粒度时空先验,指导视觉与听觉组件的同步过程。此外,我们构建了包含10,140个高质量文本标注有声视频的新基准JavisBench,专注于多样复杂现实场景中的同步性评估。进一步,我们专门设计了用于衡量现实内容中生成音视频对同步性的鲁棒度量标准。实验结果表明,JavisDiT在保证高质量生成与精确同步方面显著优于现有方法,为JAVG任务树立了新标准。我们的代码、模型与数据公开于https://javisverse.github.io/JavisDiT-page/。