Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval

from arxiv, Code and dataset: https://github.com/dymm9977/generalized-moment-retrieval. Keywords: video moment retrieval, temporal grounding, benchmark, multi-modal learning

Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol with complementary metrics tailored for null-set rejection, positive-query localization, and end-to-end GMR performance. Finally, we establish strong baselines across two modeling paradigms: a lightweight plug-and-play GMR adapter for discriminative VMR models, and a GMR-tailored GRPO reward for fine-tuning multimodal large language models (MLLMs). Extensive experiments show consistent gains across all metrics and expose key limitations of current methods, positioning GMR as a more realistic and challenging benchmark for video-language understanding.

翻译：视频时刻检索（Video Moment Retrieval, VMR）旨在定位视频中与自然语言查询相对应的时间片段，但通常假设每个查询仅有一个匹配时刻。这一假设在现实场景中并不总是成立，因为查询可能对应多个时刻或没有时刻。为此，我们定义了广义时刻检索（Generalized Moment Retrieval, GMR），这是一种统一的任务设定，要求检索完整的相关时刻集合或预测空集。为了系统研究GMR，我们引入了Soccer-GMR，这是一个基于挑战性足球视频构建的大规模基准数据集，反映了通用的GMR场景，包含真实的负样本和正样本查询。该基准通过一种时长灵活的半自动化流程构建，并辅以人工验证，从而在保持高标注质量的同时实现可扩展的数据生成。我们进一步设计了一套统一的评估协议，采用互补的指标，分别针对空集拒绝、正样本查询定位以及端到端GMR性能进行评价。最后，我们跨两种建模范式建立了强基线：一种用于判别式VMR模型的轻量级即插即用GMR适配器，以及一种专为GMR设计的GRPO奖励函数，用于微调多模态大语言模型（MLLMs）。大量实验表明，在所有指标上均取得了一致提升，并揭示了当前方法的关键局限性，从而将GMR定位为视频-语言理解领域中一个更现实且更具挑战性的基准。