Building models that generate textual responses to user instructions for videos is a practical and challenging topic, as it requires both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos aligned with brief descriptions. In this paper, we introduce BiLL-VTG, a fast adaptive framework that leverages large language models (LLMs) to reasoning on videos based on essential lightweight visual tools. Specifically, we reveal the key to response specific instructions is the concentration on relevant video events, and utilize two visual tools of structured scene graph generation and descriptive image caption generation to gather and represent the events information. Thus, a LLM equipped with world knowledge is adopted as the reasoning agent to achieve the response by performing multiple reasoning steps on specified video events.To address the difficulty of specifying events from agent, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm based on the efficient Hungarian matching to localize corresponding video events using linguistic instructions, enabling LLMs to interact with long videos. Extensive experiments on two typical video-based texts generations tasks show that our tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance.
翻译:构建能够根据视频用户指令生成文本响应的模型是一个既实用又具挑战性的课题,因为它同时需要视觉理解与知识推理。与语言和图像模态相比,训练效率仍是一个严重问题——现有研究在大量与简短描述对齐的稀疏视频上训练模型。本文提出BiLL-VTG,一种快速自适应框架,利用大型语言模型(LLMs)基于必要的轻量级视觉工具对视频进行推理。具体而言,我们揭示出响应特定指令的关键在于聚焦相关视频事件,并利用结构化场景图生成和描述性图像字幕生成两种视觉工具来收集和表示事件信息。因此,配备世界知识的LLM作为推理主体,通过对指定视频事件执行多步推理来实现响应生成。为解决主体难以指定事件的问题,我们进一步提出基于高效匈牙利匹配的指令导向视频事件识别(InsOVER)算法,利用语言指令定位相应视频事件,使LLM能够与长视频交互。在两个典型视频文本生成任务上的大量实验表明,我们的免调优框架优于包括Flamingo-80B在内的预训练模型,达到了最先进的性能。