The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Recent studies have shown promising results in utilizing multimodal large language models (MLLMs) for computer vision tasks such as object detection and semantic segmentation. However, many challenging video tasks remain under-explored. Video-language tasks necessitate spatial and temporal comprehension and require significant compute. Therefore, prior works have developed complex, highly specialized architectures or leveraged additional input signals such as video transcripts to best encode contextual and temporal information, which limits their generality and can be impractical. One particularly challenging task is video moment retrieval, which requires precise temporal and contextual grounding. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We introduce Mr. BLIP (Mr. as in Moment Retrieval), a multimodal, single-stage model that requires no expensive video-language pretraining, no additional input signal (e.g., no transcript or audio), and has a simpler and more versatile design than prior state-of-the-art methods. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions and illustrate our method's versatility with a new state-of-the-art in temporal action localization on ActivityNet. Notably, we attain over 9% (absolute) higher Recall (at 0.5 and 0.7 IoU) on the challenging long-video multi-moment QVHighlights benchmark. Our code is publicly available.

翻译：近期研究表明，利用多模态大语言模型（MLLMs）处理计算机视觉任务（如目标检测和语义分割）已取得显著成效。然而，许多具有挑战性的视频任务仍待深入探索。视频-语言任务需要空间与时间理解能力，且对计算资源要求较高。因此，先前研究或开发了复杂的高度专业化架构，或借助视频字幕等额外输入信号以优化上下文与时间信息的编码，这些方法普遍性有限且往往实用性不足。视频片段检索作为一项极具挑战性的任务，要求精确的时间与上下文定位。本研究揭示了利用图像-文本预训练MLLMs进行片段检索的惊人效果。我们提出了Mr. BLIP（Mr.代表片段检索），这是一个多模态单阶段模型，无需昂贵的视频-语言预训练，无需额外输入信号（如字幕或音频），且相比先前最先进方法具有更简洁、更通用的设计。我们在广泛使用的基准数据集Charades-STA、QVHighlights和ActivityNet Captions上实现了片段检索的最新最优性能，并通过在ActivityNet上取得时序动作定位的新最优结果，展示了方法的通用性。值得注意的是，在具有挑战性的长视频多片段QVHighlights基准测试中，我们在0.5和0.7 IoU阈值下的召回率绝对值提升了超过9%。我们的代码已公开提供。