VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query, playing a vital role in downstream tasks such as video browsing and editing. While Video Large Language Models (video LLMs) have made significant progress in understanding video content, they often face challenges in accurately pinpointing timestamps within videos, which limits their performance on VTG tasks. Therefore, to improve video LLMs' ability to effectively locate timestamps, we argue that two critical aspects need to be enhanced. First, it is essential to have high-quality instructional tuning datasets that encompass mainstream VTG tasks. Second, directly incorporating timestamp knowledge into video LLMs is crucial, as it enables models to efficiently comprehend timestamp information. To address these needs, we first introduce VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval, dense video captioning, video summarization, and video highlight detection. Furthermore, we propose a specially designed video LLM model for VTG tasks, VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames. Comprehensive experiments showcase the superior performance of VTG-LLM in comparison to other video LLM methods across various VTG tasks. Our code and datasets are available at \url{https://github.com/gyxxyg/VTG-LLM}.

翻译：视频时序定位旨在根据语言查询，在特定视频中准确识别事件的时间戳，在视频浏览与编辑等下游任务中扮演着关键角色。尽管视频大语言模型在理解视频内容方面取得了显著进展，但其在准确定位视频内时间戳方面仍面临挑战，这限制了其在VTG任务上的性能。因此，为提升视频大语言模型有效定位时间戳的能力，我们认为需增强两个关键方面。首先，必须构建涵盖主流VTG任务的高质量指令微调数据集。其次，将时间戳知识直接融入视频大语言模型至关重要，这能使模型高效理解时间戳信息。为满足这些需求，我们首先提出了VTG-IT-120K，一个高质量且全面的指令微调数据集，覆盖了时刻检索、密集视频描述、视频摘要和视频亮点检测等VTG任务。此外，我们提出了一种专为VTG任务设计的视频大语言模型VTG-LLM，该模型（1）有效地将时间戳知识整合到视觉令牌中；（2）引入了专门处理时间戳知识的绝对时间令牌，从而避免概念偏移；（3）采用了一种轻量级、高性能的基于槽位的令牌压缩方法，以促进更多视频帧的采样。综合实验表明，在各种VTG任务上，VTG-LLM相较于其他视频大语言模型方法均展现出优越性能。我们的代码与数据集公开于 \url{https://github.com/gyxxyg/VTG-LLM}。