In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5x inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel.
翻译:本文研究时序视频定位问题,旨在预测长段未修剪视频中由文本描述对应的起止时间点。得益于细粒度三维视觉特征,时序视频定位技术近年来取得了显著进展。然而,三维卷积神经网络的高复杂度使得稠密三维视觉特征的提取耗时且需要大量内存与计算资源。为实现高效时序视频定位,我们提出新颖的文本-视觉提示框架,该框架将优化后的扰动模式(即“提示”)同时注入时序视频定位模型的视觉输入与文本特征中。与三维卷积网络形成鲜明对比的是,我们证明文本-视觉提示能够在二维时序视频定位模型中有效协同训练视觉编码器与语言编码器,并仅使用低复杂度的稀疏二维视觉特征即可提升跨模态特征融合性能。进一步,我们提出时序距离交并比损失函数用于高效学习时序视频定位。在两个基准数据集(Charades-STA和ActivityNet Captions)上的实验表明,本文提出的文本-视觉提示显著提升了二维时序视频定位的性能(例如在Charades-STA上提升9.79%,在ActivityNet Captions上提升30.77%),并相较于使用三维视觉特征的时序视频定位实现了5倍推理加速。代码已发布于Open.Intel。