Video moment retrieval (VMR) aims to identify the specific moment in an untrimmed video for a given natural language query. However, this task is prone to suffer the weak visual-textual alignment problem from query ambiguity, potentially limiting further performance gains and generalization capability. Due to the complex multimodal interactions in videos, a query may not fully cover the relevant details of the corresponding moment, and the moment may contain misaligned and irrelevant frames. To tackle this problem, we propose a straightforward yet effective model, called Background-aware Moment DEtection TRansformer (BM-DETR). Given a target query and its moment, BM-DETR also takes negative queries corresponding to different moments. Specifically, our model learns to predict the target moment from the joint probability of the given query and the complement of negative queries for each candidate frame. In this way, it leverages the surrounding background to consider relative importance, improving moment sensitivity. Extensive experiments on Charades-STA and QVHighlights demonstrate the effectiveness of our model. Moreover, we show that BM-DETR can perform robustly in three challenging VMR scenarios, such as several out-of-distribution test cases, demonstrating superior generalization ability.
翻译:视频时刻检索(VMR)旨在针对给定的自然语言查询,从未经剪辑的视频中定位特定的时刻。然而,该任务易因查询歧义而面临弱视觉-文本对齐问题,这可能限制性能的进一步提升及泛化能力。由于视频中复杂的多模态交互,查询可能无法完全覆盖对应时刻的相关细节,且该时刻可能包含不匹配及无关的帧。为解决此问题,我们提出一种简洁而有效的模型,称为背景感知时刻检测变换器(BM-DETR)。给定目标查询及其对应时刻,BM-DETR还引入对应不同时刻的负查询。具体而言,我们的模型通过学习从给定查询与每个候选帧的负查询补集的联合概率来预测目标时刻。通过这种方式,它利用周围背景考虑相对重要性,从而提升时刻敏感性。在Charades-STA和QVHighlights上的大量实验证明了我们模型的有效性。此外,我们展示了BM-DETR在三种具有挑战性的VMR场景(如若干分布外测试案例)中能稳健执行,展现了卓越的泛化能力。