Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
翻译:音视频大语言模型(LLM)已引起广泛关注,然而两种输入流的细粒度组合方式仍鲜有探索——这对LLM理解通用视频输入而言既具挑战性又不可或缺。为此,本文提出一种面向多模态LLM的细粒度音视频联合表征(FAVOR)学习框架,该框架将基于文本的LLM扩展为在帧级别同步感知音频输入流中的语音和音频事件,以及视觉输入流中的图像或视频。为融合音频与视觉特征流形成联合表征,并将联合空间对齐至LLM输入嵌入空间,我们提出因果Q-Former结构,其因果注意力模块可增强对音视频帧跨时间因果关系捕获能力。同时构建了音视频评估基准(AVEB),包含六项代表性单模态任务与五项反映音视频协同推理能力的跨模态任务。在AVEB的音频、语音和图像任务中保持竞争性单模态性能的同时,FAVOR在需细粒度信息或时序因果推理的视频问答任务中实现了超过20%的准确率提升。此外,FAVOR在多项其他多模态LLM无法企及的任务中展现出卓越的视频理解与推理能力。FAVOR交互演示已发布于https://github.com/BriansIDP/AudioVisualLLM.git,训练代码与模型检查点即将开源。