Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query. Since datasets in this domain are often gathered from limited video scenes, models tend to overfit to scene-specific factors, which leads to suboptimal performance when encountering new scenes in real-world applications. In a new scene, the fine-grained annotations are often insufficient due to the expensive labor cost, while the coarse-grained video-query pairs are easier to obtain. Thus, to address this issue and enhance model performance on new scenes, we explore the TVG task in an unsupervised domain adaptation (UDA) setting across scenes for the first time, where the video-query pairs in the source scene (domain) are labeled with temporal boundaries, while those in the target scene are not. Under the UDA setting, we introduce a novel Adversarial Multi-modal Domain Adaptation (AMDA) method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data. Specifically, we tackle the domain gap by utilizing domain discriminators, which help identify valuable scene-related features effective across both domains. Concurrently, we mitigate the semantic gap between different modalities by aligning video-query pairs with related semantics. Furthermore, we employ a mask-reconstruction approach to enhance the understanding of temporal semantics within a scene. Extensive experiments on Charades-STA, ActivityNet Captions, and YouCook2 demonstrate the effectiveness of our proposed method.
翻译:时段视频定位(TVG)旨在基于给定语言查询,在未裁剪视频中定位特定片段的起止时间边界。由于该领域的数据集通常来自有限的视频场景,模型容易过拟合场景特异性因素,导致在实际应用中遇到新场景时性能欠佳。在新场景中,细粒度标注因人工成本高昂而往往不足,但粗粒度的视频-查询对更易获取。为此,我们首次在无监督域自适应(UDA)框架下探索跨场景TVG任务,以解决上述问题并提升模型在新场景上的性能:其中源场景(域)中的视频-查询对带有时间边界标注,而目标场景中的则无标注。在UDA设定下,我们提出一种新颖的对抗性多模态域自适应(AMDA)方法,通过融入目标数据的特征信息自适应调整模型的场景相关知识。具体而言,我们利用域判别器缩小域差距,该判别器有助于识别在跨域场景中均有效的场景相关特征;同时,通过对齐具有相关语义的视频-查询对来缓解不同模态间的语义鸿沟。此外,我们采用掩码重建方法增强对场景内时间语义的理解。在Charades-STA、ActivityNet Captions和YouCook2上的大量实验证明了所提方法的有效性。