In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
翻译:本文提出VideoLLaMA 2,这是一组旨在增强面向视频与音频任务中时空建模与音频理解能力的视频大语言模型。该模型在其前代基础上,引入了一个定制化的时空卷积连接器,能有效捕捉视频数据中复杂的时空动态。此外,我们通过联合训练将音频分支集成到模型中,从而通过无缝融合音频线索来增强模型的多模态理解能力。在多项选择视频问答、开放式视频问答以及视频描述任务上的综合评估表明,VideoLLaMA 2在开源模型中持续取得具有竞争力的结果,甚至在若干基准测试中接近某些专有模型的性能。此外,与现有模型相比,VideoLLaMA 2在纯音频及音视频问答基准测试中展现出合理的改进。这些进展凸显了VideoLLaMA 2在多模态理解方面的卓越性能,为智能视频分析系统树立了新标准。所有模型均已公开,以促进进一步研究。