Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.
翻译:零样本自然语言视频定位(NLVL)方法通过动态生成视频片段和伪查询标注,仅使用原始视频数据训练NLVL模型,已展现出令人瞩目的成果。然而,现有伪查询往往缺乏对源视频的语义支撑,导致内容结构松散且缺乏连贯性。本文系统研究了常识推理在零样本NLVL中的有效性。具体而言,我们提出CORONET零样本NLVL框架,通过常识增强模块利用常识知识弥合视频与生成伪查询之间的语义鸿沟。该框架采用图卷积网络(GCN)对从知识图谱中提取的、基于视频条件约束的常识信息进行编码,并通过交叉注意力机制在定位前增强编码后的视频与伪查询表征。在两个基准数据集上的实验评估表明,CORONET在各项召回率阈值上实现最高32.13%的性能提升,平均交并比(mIoU)提升达6.33%,全面超越零样本与弱监督基线方法。这些结果充分论证了常识推理对零样本NLVL的重要价值。