Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on ActivityNet Captions, all while utilizing significantly fewer parameters with respect to baseline.
翻译:视频时刻检索是一项需要视频与文本模态间细粒度交互的具有挑战性的任务。近年来图像-文本预训练的研究表明,由于视觉序列与文本序列长度差异,现有大多数预训练模型存在信息不对称问题。我们质疑在同时需要保留空间与时间信息的视频-文本领域是否也存在相同问题。为此,我们评估了一项近期提出的解决方案——在视频定位任务中引入非对称协同注意力网络。此外,我们整合动量对比损失以实现两个模态的鲁棒判别性表示学习。我们注意到,这些补充模块的集成在TACoS数据集上相比当前最优模型取得了更优性能,在ActivityNet Captions数据集上获得可比结果,同时相较于基线模型显著减少了参数量。