Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.
翻译:视频大语言模型(Vid-LLMs)在理解视频内容以进行问答对话方面取得了显著进展。然而,它们难以将这种视觉理解能力扩展到需要精确时序定位的任务中,即视频时序定位(VTG)。为弥补这一不足,我们提出了编号提示(NumPro)这一新方法,它通过为视频的每一帧添加唯一的数字标识符,使 Vid-LLMs 能够将视觉理解与时序定位联系起来。将视频视为一系列带编号的帧图像,NumPro 将 VTG 转化为一个直观的过程:像按顺序翻阅漫画分镜一样。这使得 Vid-LLMs 能够“阅读”事件时间线,准确地将视觉内容与相应的时序信息关联起来。我们的实验表明,NumPro 在不增加额外计算成本的情况下,显著提升了顶尖 Vid-LLMs 的 VTG 性能。此外,在 NumPro 增强的数据集上进行微调,为 VTG 确立了新的最先进水平,在片段检索的 mIoU 上超越先前最佳方法高达 6.9%,在高光检测的 mAP 上超越高达 8.5%。代码将在 https://github.com/yongliang-wu/NumPro 提供。