Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper proposes a novel 2-Dimensional Pointer-based Machine Reading Comprehension for Moment Retrieval Choice (2DP-2MRC) model to address the issue of imprecise localization in clip-based methods while maintaining lower computational complexity than moment-based methods. Specifically, we introduce an AV-Encoder to capture coarse-grained information at moment and video levels. Additionally, a 2D pointer encoder module is introduced to further enhance boundary detection for target moment. Extensive experiments on the HiREST dataset demonstrate that 2DP-2MRC significantly outperforms existing baseline models.
翻译:片段检索旨在根据给定的自然语言查询,在未修剪的视频中定位最相关的片段。现有解决方案大致可分为基于片段的方法和基于片段的方法。前者通常涉及大量计算,而后者由于忽略了粗粒度信息,其性能通常逊于基于片段的方法。因此,本文提出了一种新颖的基于二维指针的片段检索机器阅读理解选择(2DP-2MRC)模型,以解决基于片段的方法中定位不精确的问题,同时保持比基于片段的方法更低的计算复杂度。具体而言,我们引入了一个AV编码器来捕获片段级别和视频级别的粗粒度信息。此外,还引入了一个二维指针编码器模块,以进一步增强目标片段边界检测。在HiREST数据集上进行的大量实验表明,2DP-2MRC显著优于现有的基线模型。