VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% [email protected] on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: https://github.com/yellow-binary-tree/MMDuet.

翻译：近期关于视频大语言模型（VideoLLM）的研究主要集中于模型架构与训练数据集，而对用户与模型之间的交互格式探索不足。现有工作中，用户通常以完整视频和查询作为输入与 VideoLLM 交互，随后模型生成响应。这种交互格式限制了 VideoLLM 在直播理解等视频持续播放且需实时响应的场景中的应用，同时在需要定位视频片段的时序敏感任务上表现欠佳。本文聚焦于一种视频-文本二重交互格式。该交互格式的特征在于视频持续播放，用户与模型皆可在视频播放过程中的任意时刻插入文本信息。当一条文本信息结束时，视频继续播放，宛如二重奏中两位表演者的交替呈现。我们构建了 MMDuetIT——一个专为适配视频-文本二重交互格式而设计的视频-文本训练数据集，并提出了多答案接地视频问答（MAGQA）任务以评估 VideoLLM 的实时响应能力。基于 MMDuetIT 训练的 MMDuet 模型表明，采用视频-文本二重交互格式能使模型在各类时序敏感任务上（YouCook2 密集视频描述任务中 CIDEr 达 76%，QVHighlights 亮点检测任务中 mAP 达 90%，Charades-STA 时序视频定位任务中 [email protected] 达 25%）以极少的训练成本实现显著性能提升，同时使 VideoLLM 能够在视频播放过程中实时响应。代码、数据及演示可见：https://github.com/yellow-binary-tree/MMDuet。