Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

翻译：近期，随着视频理解需求的急剧增长，视频时刻检索与高亮检测（MR/HD）成为研究热点。MR/HD的核心目标是定位给定文本查询对应的视频时刻，并逐片段评估其符合程度（即显著性分数）。尽管基于Transformer的模型取得了若干进展，但我们发现这些方法未能充分利用给定查询的信息。例如，在预测时刻及其显著性时，文本查询与视频内容之间的相关性有时会被忽略。为解决这一问题，我们提出了一种专为MR/HD设计的检测Transformer——查询相关DETR（QD-DETR）。鉴于在Transformer架构中查询的作用不够显著，我们的编码模块首先采用交叉注意力层，将文本查询的上下文显式注入视频表示。随后，为增强模型利用查询信息的能力，我们通过处理视频-查询对构造无关对，并训练这些负（无关）视频-查询对输出低显著性分数，从而促使模型精准估计查询-视频对间的匹配程度。最后，我们提出一种输入自适应的显著性预测器，能够针对给定视频-查询对自适应定义显著性分数的判定准则。大量实验验证了构建查询相关表示对MR/HD任务的重要性。具体而言，QD-DETR在QVHighlights、TVSum和Charades-STA数据集上均超越了现有最优方法。代码已开源至github.com/wjun0830/QD-DETR。