Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at https://github.com/the-anonymous-bs/FAVOR.git, and the training code and model checkpoints will be released upon acceptance.
翻译:音视频大语言模型(LLM)已引起广泛关注,然而两种输入流的细粒度组合研究仍相对不足——这对LLM理解通用视频输入而言既具挑战性又不可或缺。为此,本文提出一种面向多模态LLM的细粒度音视频联合表示(FAVOR)学习框架,该框架将基于文本的LLM扩展至帧级别同步感知音频输入流中的语音与音频事件,以及视觉输入流中的图像或视频。为将音频与视觉特征流融合为联合表示,并使该联合空间与LLM输入嵌入空间对齐,我们提出带有因果注意力模块的因果Q-Former结构,以增强跨时间音视频帧间因果关系的捕捉能力。本文同时构建音视频评估基准(AVEB),该基准包含六项代表性单模态任务及五项反映音视频共同推理能力的跨模态任务。在AVEB的音频、语音与图像任务中取得竞争性单模态性能的同时,FAVOR在需要细粒度信息或时序因果推理的视频问答任务上提升了超过20%的准确率。此外,FAVOR在其他多模态LLM无法完成的任务中展现出卓越的视频理解与推理能力。FAVOR的交互演示可于https://github.com/the-anonymous-bs/FAVOR.git获取,训练代码与模型检查点将在论文接收后发布。