ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet Captions, TACoS and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.

翻译：摘要：视频定位任务旨在根据丰富的语言描述，在未修剪的视频中定位所查询的动作或事件。现有无提议方法受限于视频与查询之间的复杂交互，过度强调跨模态特征融合与特征相关性。本文提出一种新颖的边界回归范式，在Transformer中实现回归令牌学习。具体而言，我们构建了一个简单而有效的无提议框架——视频定位Transformer（ViGT），它通过可学习回归令牌而非多模态或跨模态特征来预测时间边界。ViGT中可学习令牌的优势体现在：（1）令牌与视频或查询无关，避免了原始视频和查询的数据偏差；（2）令牌可同时从视频和查询特征中聚合全局上下文。首先，我们采用共享特征编码器将视频和查询投影到联合特征空间，再进行跨模态共注意力（即视频到查询注意力与查询到视频注意力）以突出各模态的判别性特征。接着，将可学习回归令牌[REG]与视频及查询特征拼接，作为视觉-语言Transformer的输入。最后，利用令牌[REG]预测目标时刻，并借助视觉特征约束每个时间戳的前景与背景概率。所提出的ViGT在ANet Captions、TACoS和YouCookII三个公开数据集上表现优异。广泛的消融实验与定性分析进一步验证了ViGT的可解释性。