This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
翻译:本文提出JavisGPT,首个面向音视频联合理解与生成的统一多模态大语言模型。JavisGPT采用简洁的编码器-LLM-解码器架构,其中包含用于时空音视频融合的SyncFusion模块,以及通过同步感知可学习查询桥接预训练JAV-DiT生成器。该设计实现了基于多模态指令的时序一致音视频理解与生成。我们设计了三阶段训练流程,包含多模态预训练、音视频微调和大规模指令微调,以渐进方式在现有视觉语言模型基础上构建多模态理解与生成能力。针对指令微调,我们构建了JavisInst-Omni高质量指令数据集,包含超过20万条由GPT-4o生成的音视频文本对话,涵盖多样化、多层次的理解与生成场景。在音视频理解与生成基准测试中,实验表明JavisGPT优于现有多模态大语言模型,尤其在复杂时序同步场景中表现突出。