TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.

翻译：多模态大语言模型（MLLMs）在短视频理解方面已展现出令人印象深刻的性能。然而，对于长视频的理解，MLLMs 仍面临挑战。本文提出了 TimeSuite，这是一系列旨在使现有短视频 MLLMs 适应长视频理解的新设计，包括一个处理长视频序列的简单高效框架、一个用于 MLLMs 基于时间的调优的高质量视频数据集，以及一个精心设计的指令调优任务，以在传统的问答格式中明确融入时间定位监督。具体而言，基于 VideoChat，我们提出了我们的长视频 MLLM，命名为 VideoChat-T，通过实施令牌混排来压缩长视频令牌，并引入时间自适应位置编码（TAPE）以增强视觉表征的时间感知能力。同时，我们引入了 TimePro，这是一个全面的以时间定位为中心的指令调优数据集，由 9 个任务和 349k 个高质量的时间定位标注组成。值得注意的是，我们设计了一种新的指令调优任务类型，称为时间定位描述，用于执行带有对应时间戳预测的详细视频描述。这种明确的时间位置预测将引导 MLLM 在生成描述时正确关注视觉内容，从而降低由 LLMs 引起的幻觉风险。实验结果表明，我们的 TimeSuite 为增强短视频 MLLM 的长视频理解能力提供了一个成功的解决方案，在 Egoschema 和 VideoMME 基准测试上分别实现了 5.6% 和 6.8% 的性能提升。此外，VideoChat-T 展现出强大的零样本时间定位能力，显著优于现有的最先进 MLLMs。经过微调后，其性能可与传统的监督式专家模型相媲美。