BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Building models that generate textual responses to user instructions for videos is a practical and challenging topic, as it requires both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos aligned with brief descriptions. In this paper, we introduce BiLL-VTG, a fast adaptive framework that leverages large language models (LLMs) to reasoning on videos based on essential lightweight visual tools. Specifically, we reveal the key to response specific instructions is the concentration on relevant video events, and utilize two visual tools of structured scene graph generation and descriptive image caption generation to gather and represent the events information. Thus, a LLM equipped with world knowledge is adopted as the reasoning agent to achieve the response by performing multiple reasoning steps on specified video events.To address the difficulty of specifying events from agent, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm based on the efficient Hungarian matching to localize corresponding video events using linguistic instructions, enabling LLMs to interact with long videos. Extensive experiments on two typical video-based texts generations tasks show that our tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance.

翻译：构建能够根据视频用户指令生成文本响应的模型是一个既实用又具挑战性的课题，因为它同时需要视觉理解与知识推理。与语言和图像模态相比，训练效率仍是一个严重问题——现有研究在大量与简短描述对齐的稀疏视频上训练模型。本文提出BiLL-VTG，一种快速自适应框架，利用大型语言模型（LLMs）基于必要的轻量级视觉工具对视频进行推理。具体而言，我们揭示出响应特定指令的关键在于聚焦相关视频事件，并利用结构化场景图生成和描述性图像字幕生成两种视觉工具来收集和表示事件信息。因此，配备世界知识的LLM作为推理主体，通过对指定视频事件执行多步推理来实现响应生成。为解决主体难以指定事件的问题，我们进一步提出基于高效匈牙利匹配的指令导向视频事件识别（InsOVER）算法，利用语言指令定位相应视频事件，使LLM能够与长视频交互。在两个典型视频文本生成任务上的大量实验表明，我们的免调优框架优于包括Flamingo-80B在内的预训练模型，达到了最先进的性能。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日