Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
翻译:摘要:近年来,基于语言模型的视频理解研究,随着大语言模型(LLMs)的引入而取得了显著进展。然而,先前研究主要聚焦于设计将视频特征映射为令牌的投影层,这种方法既基础又低效。在本研究中,我们提出了一种名为 VaQuitA 的创新框架,旨在优化视频与文本信息之间的协同作用。在数据层面,我们不采用均匀采样帧的方式,而是基于 CLIP 分数排名实现了一种采样方法,从而能够根据给定问题更精准地选择对齐的帧。在特征层面,我们集成了一种可训练的视频感知器与视觉查询转换器(简称 VQ-Former),这增强了输入问题与视频特征之间的交互。此外,我们还发现,在大语言模型输入中加入简单的提示“请保持批判性”,能够显著提升其视频理解能力。实验结果表明,VaQuitA 在零样本视频问答任务中持续树立了新的标杆,并且能够与用户生成高质量的多轮视频对话。