We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.
翻译:我们提出LLoVi,一种基于语言的长程视频问答(LVQA)框架。与以往通常成本高昂且需要专门长程视频建模设计(如记忆队列、状态空间层等)的方法不同,我们的方法结合帧/片段级视觉描述器(如BLIP2、LaViLa、LLaVA)与大语言模型(GPT-3.5、GPT-4),构建出简单却异常有效的LVQA框架。具体而言,我们将LVQA的短期与长期建模分解为两个阶段。首先,使用短期视觉描述器为从长输入视频中密集采样的短视频片段(时长0.5-8秒)生成文本描述。随后,大语言模型聚合提取的密集短期描述,执行理解整段视频并回答问题所需的长期时序推理。为分析该简单框架的有效性根源,我们对系统各组件进行了全面评估。实证分析表明,视觉描述器与大语言模型的选择对LVQA性能至关重要。此外,我们发现采用专门提示策略——先要求大语言模型总结含噪的短期视觉描述,再回答给定问题——可显著提升LVQA性能。在公认的长视频问答基准EgoSchema上,我们的方法取得50.3%准确率,较此前最佳方法提升18.1%(绝对值)。同时,在NeXT-QA和IntentQA数据集上分别以4.1%和3.1%的优势超越先前最先进方法。我们还将LLoVi扩展至基于定位的LVQA,并在NeXT-GQA数据集上超越所有先前方法。代码将在https://github.com/CeeZh/LLoVi 开源。