Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE.
翻译:时序定位,亦称视频片段检索,旨在定位与给定查询语句相对应的视频片段。自然语言的组合特性使其能够定位超出预定义事件的范围,这对现有方法的组合泛化能力提出了特定挑战。近期研究通过分解-重构的方式建立视频与查询之间的对应关系,以实现组合泛化。然而,这些方法仅考虑主导基元,并通过随机采样与重组构建负查询,导致产生语义不合理的负样本,阻碍模型学习合理的组合关系。此外,近期基于DETR的方法在组合时序定位中仍表现欠佳,当面对与正查询存在细微差异的负查询时,会表现出不合理的显著性响应。为应对这些局限,我们首先提出一种基于大语言模型的负查询构建方法,利用GPT-3.5-Turbo生成语义合理的困难负查询。随后,我们引入一种由粗到细的显著性排序策略,促使模型学习视频与分层负查询之间的多粒度语义关系,从而提升组合泛化能力。在两个具有挑战性的基准数据集上的大量实验验证了所提方法的有效性与泛化性。代码已开源:https://github.com/zxccade/SHINE。