Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
翻译:视频段落定位(VPG)是视频-语言理解领域的新兴任务,旨在从未经裁剪的视频中定位具有语义关联和时间顺序的多个句子。然而,现有VPG方法严重依赖大量耗时费力的人工时间标签。本文引入并探索弱监督视频段落定位(WSVPG)以消除对时间标注的需求。不同于以往基于多实例学习或重构学习的弱监督定位框架(用于两阶段候选排序),我们提出一种新颖的孪生学习框架,该框架无需时间戳标签即可联合学习跨模态特征对齐与时间坐标回归,实现WSVPG的简洁一阶段定位。具体而言,我们设计了包含两个权值共享分支的孪生定位Transformer(SiamGTR),用于学习互补监督信号。利用增强分支直接回归伪视频中完整段落的时间边界,推理分支则用于捕捉顺序引导的特征对应关系以定位正常视频中的多个句子。大量实验证明,我们的范本具有卓越的实用性与灵活性,可实现高效的弱监督或半监督学习,其性能超越采用相同或更强监督训练的现有最优方法。