Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.
翻译:视频是高度冗余的数据源,通常只需识别少数关键时刻即可解决给定任务。本文提出一种文本条件视频重采样器(TCR)模块,该模块利用预训练且冻结的视觉编码器与大语言模型(LLM)处理长视频序列以完成特定任务。TCR根据文本条件定位视频中的相关视觉特征,并将其输入LLM生成文本响应。得益于其轻量级设计和交叉注意力机制,TCR可同时处理超过100帧画面,使模型能够利用远超先前工作的更长的视频片段。我们的贡献包括:(i)设计基于Transformer的采样架构,可基于任务条件处理长视频,并配套训练方法实现预训练视觉与语言模型的桥接;(ii)在多种评估任务上实证验证其有效性,在NextQA、EgoSchema及EGO4D-LTA挑战赛中创下新最先进性能;(iii)确定需要更长视频上下文的任务类型,为长视频模型的进一步评估提供有效基准。