D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G.

翻译：时序句子定位（TSG）旨在根据给定的自然语言查询，从无修剪视频中定位特定时刻。近年来，弱监督方法相较于全监督方法仍存在较大性能差距，而后者需要繁琐的时间戳标注。本研究旨在降低标注成本的同时，使TSG任务保持与全监督方法相当的竞争力。为此，我们探索了近期提出的瞥见监督时序句子定位任务，该任务仅需为每个查询提供单帧标注（即瞥见标注）。在此设定下，我们提出一种基于动态高斯先验的瞥见标注定位框架（D3G），包含语义对齐组对比学习模块（SA-GCL）和动态高斯先验调整模块（DGA）。具体而言，SA-GCL通过联合利用高斯先验与语义一致性，从二维时序图中采样可靠的正样本时刻，从而在联合嵌入空间中对齐正样本句子-时刻对。此外，为缓解瞥见标注导致的标注偏差以及建模包含多事件的复杂查询，我们提出DGA模块，其通过动态调整分布以逼近目标时刻的真实值。在三个具有挑战性的基准数据集上的大量实验验证了所提D3G的有效性。该方法以较大优势超越现有最优弱监督方法，并缩小了与全监督方法的性能差距。代码开源地址：https://github.com/solicucu/D3G。