In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. The proposed prompts also compensate for the lack of spatiotemporal information in 2D CNNs for visual feature extraction. Further, we propose a TemporalDistance IoU (TDIoU) loss for efficient learning of TVG. Last but not least, extensive experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement in Charades-STA and 30.77% improvement in ActivityNet Captions) and achieves 5x inference acceleration over TVG of using 3D visual features. Code and model will be released.
翻译:本文研究时间视频定位(TVG)问题,旨在预测长段未裁剪视频中由文本句子描述时刻的起止时间点。得益于细粒度三维视觉特征,TVG技术近年来取得了显著进展。然而,三维卷积神经网络(CNNs)的高复杂度导致提取稠密三维视觉特征耗时且需要大量内存和计算资源。为实现高效TVG,我们提出了一种新颖的文本-视觉提示(TVP)框架,该框架将优化扰动模式(称为“提示”)同时融入TVG模型的视觉输入和文本特征中。与三维CNNs形成鲜明对比的是,我们证明TVP能够有效协同训练二维TVG模型中的视觉编码器和语言编码器,并仅使用低复杂度的稀疏二维视觉特征提升跨模态特征融合性能。所提出的提示还弥补了二维CNNs在视觉特征提取中时空信息缺失的问题。此外,我们提出时间距离交并比(TDIoU)损失函数以实现TVG的高效学习。最后,在Charades-STA和ActivityNet Captions两个基准数据集上的大量实验表明,所提出的TVP显著提升了二维TVG的性能(例如在Charades-STA上提升9.79%,在ActivityNet Captions上提升30.77%),并相比使用三维视觉特征的TVG实现了5倍推理加速。代码和模型将公开发布。