Frame-wise Cross-modal Matching for Video Moment Retrieval

Video moment retrieval targets at retrieving a moment in a video for a given language query. The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, early approaches adopt the sliding window or uniform sampling to collect video clips first and then match each clip with the query. Obviously, these strategies are time-consuming and often lead to unsatisfied accuracy in localization due to the unpredictable length of the golden moment. To avoid the limitations, researchers recently attempt to directly predict the relevant moment boundaries without the requirement to generate video clips first. One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a regression approach upon the multimodal feature vector for boundary detection. Although some progress has been achieved by this approach, we argue that those methods have not well captured the cross-modal interactions between the query and video frames. In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM) model which predicts the temporal boundaries based on an interaction modeling. In addition, an attention module is introduced to assign higher weights to query words with richer semantic cues, which are considered to be more important for finding relevant video contents. Another contribution is that we propose an additional predictor to utilize the internal frames in the model training to improve the localization accuracy. Extensive experiments on two datasets TACoS and Charades-STA demonstrate the superiority of our method over several state-of-the-art methods. Ablation studies have been also conducted to examine the effectiveness of different modules in our ACRM model.

翻译：为了解决这些问题,早期方法采用滑动窗口或统一取样方法来首先收集视频剪辑,然后将每个剪辑与查询匹配。显然,这些战略耗费时间,往往导致本地化不满意的准确性,因为黄金时刻的长度不可预测。为了避免这些局限性,研究人员最近试图直接预测相关时刻的界限,而不需要先制作视频剪辑。一种主流方法是为目标查询和视频框架(例如,连接)生成一个多式特性矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量,然后对多式特性矢量矢量矢量取样。虽然这种方法已经取得一些进展,但我们认为这些方法没有很好地捕捉到本地化和视频框架之间的跨模式互动。在本文中,我们提议在不要求生成视频精度精度时直接预测相关时刻的界限。一种主流方法是为目标查询和视频框架(例如,连接)生成一个多式元量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量度矢量度矢量度矢量度矢量度矢量度矢量度矢量度矢量度矢量度矢量度矢量度矩阵,在测试中进行一个方向进行一个重要度度度度度度度度度度度度度度度度度度度度度度模型测测量度模型研究,在预测测量度的模型中进行一个方向上,在选择测路路路路路路路路路路路路路标量测测路路路路标度模型中进行一个测测算,在测路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路,在进行中进行中进行中进行中进行中进行中进行路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路