A Simple LLM Framework for Long-Range Video Question-Answering

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.

翻译：我们提出LLoVi，一种基于语言的长程视频问答（LVQA）框架。与以往通常成本高昂且需要专门长程视频建模设计（如记忆队列、状态空间层等）的方法不同，我们的方法结合帧/片段级视觉描述器（如BLIP2、LaViLa、LLaVA）与大语言模型（GPT-3.5、GPT-4），构建出简单却异常有效的LVQA框架。具体而言，我们将LVQA的短期与长期建模分解为两个阶段。首先，使用短期视觉描述器为从长输入视频中密集采样的短视频片段（时长0.5-8秒）生成文本描述。随后，大语言模型聚合提取的密集短期描述，执行理解整段视频并回答问题所需的长期时序推理。为分析该简单框架的有效性根源，我们对系统各组件进行了全面评估。实证分析表明，视觉描述器与大语言模型的选择对LVQA性能至关重要。此外，我们发现采用专门提示策略——先要求大语言模型总结含噪的短期视觉描述，再回答给定问题——可显著提升LVQA性能。在公认的长视频问答基准EgoSchema上，我们的方法取得50.3%准确率，较此前最佳方法提升18.1%（绝对值）。同时，在NeXT-QA和IntentQA数据集上分别以4.1%和3.1%的优势超越先前最先进方法。我们还将LLoVi扩展至基于定位的LVQA，并在NeXT-GQA数据集上超越所有先前方法。代码将在https://github.com/CeeZh/LLoVi 开源。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日