Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
翻译:长视频问答是一项具有挑战性的任务,涉及识别短期活动并推理其细粒度关系。最先进的视频大语言模型(vLLMs)由于其在新任务中展现出涌现能力而有望成为可行的解决方案。然而,尽管在数百万个秒级短视频上训练,vLLMs仍无法理解数分钟长的视频并准确回答相关问题。为解决这一局限,我们提出一种轻量级自监督方法——基于关键帧条件的长视频大语言模型(Koala),通过引入可学习的时空查询,使预训练vLLMs能泛化到更长视频。我们的方法引入两个新的分词器,它们基于从稀疏视频关键帧计算的视觉令牌,分别用于理解短视频片段和长视频时段。在HowTo100M上训练后,我们在零样本长视频理解基准上验证了其有效性,在所有任务中绝对准确率比最先进的大模型高出3%至6%。令人惊讶的是,实验还表明我们的方法不仅帮助预训练vLLMs理解长视频,还能提升其在短期动作识别上的准确率。