Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.
翻译:近期,随着视频理解需求的急剧增长,视频时刻检索与高亮检测(MR/HD)成为研究热点。MR/HD的核心目标是定位给定文本查询对应的视频时刻,并逐片段评估其符合程度(即显著性分数)。尽管基于Transformer的模型取得了若干进展,但我们发现这些方法未能充分利用给定查询的信息。例如,在预测时刻及其显著性时,文本查询与视频内容之间的相关性有时会被忽略。为解决这一问题,我们提出了一种专为MR/HD设计的检测Transformer——查询相关DETR(QD-DETR)。鉴于在Transformer架构中查询的作用不够显著,我们的编码模块首先采用交叉注意力层,将文本查询的上下文显式注入视频表示。随后,为增强模型利用查询信息的能力,我们通过处理视频-查询对构造无关对,并训练这些负(无关)视频-查询对输出低显著性分数,从而促使模型精准估计查询-视频对间的匹配程度。最后,我们提出一种输入自适应的显著性预测器,能够针对给定视频-查询对自适应定义显著性分数的判定准则。大量实验验证了构建查询相关表示对MR/HD任务的重要性。具体而言,QD-DETR在QVHighlights、TVSum和Charades-STA数据集上均超越了现有最优方法。代码已开源至github.com/wjun0830/QD-DETR。