The remarkable natural language understanding, reasoning, and generation capabilities of large language models (LLMs) have made them attractive for application to video question answering (Video QA) tasks, utilizing video tokens as contextual input. However, employing LLMs for long video understanding presents significant challenges and remains under-explored. The extensive number of video tokens leads to considerable computational costs for LLMs while using aggregated tokens results in loss of vision details. Moreover, the presence of abundant question-irrelevant tokens introduces noise to the video QA process. To address these issues, we introduce a simple yet effective retrieval-based video language model (R-VLM) for efficient and interpretable long video QA. Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks and uses their associated visual tokens to serve as context for the LLM inference. This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance. Our experimental results validate the effectiveness of our framework for comprehending long videos. Furthermore, based on the retrieved chunks, our model is interpretable that provides the justifications on where we get the answers.
翻译:大型语言模型(LLMs)在自然语言理解、推理和生成方面的卓越能力,使其通过视频令牌作为上下文输入,成为视频问答(Video QA)任务中颇具吸引力的应用方案。然而,将LLMs用于长视频理解仍面临重大挑战且研究尚不充分。大量视频令牌会导致LLMs产生可观的计算开销,而使用聚合令牌又会导致视觉细节丢失。此外,大量与问题无关的令牌会给视频问答过程引入噪声。为解决这些问题,我们提出一种简单而有效的基于检索的视频语言模型(R-VLM),用于高效且可解释的长视频问答。具体而言,给定一个问题(查询)和一段长视频,我们的模型能够识别并选择最相关的$K$个视频片段,将其关联的视觉令牌作为LLM推理的上下文输入。这有效减少了视频令牌数量,消除了噪声干扰,并提升了系统性能。实验结果验证了该框架在理解长视频方面的有效性。此外,基于检索到的片段,我们的模型具有可解释性,能够提供答案来源的推理依据。