The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
翻译:近年来,短视频应用的用户规模呈现前所未有的增长态势,催生了对视频内容分析的巨大需求。其中,文本-视频检索作为核心功能,旨在从海量视频库中找出与给定文本描述最匹配的视频,其关键挑战在于弥合模态鸿沟。然而,现有方法大多将文本视为离散标记而忽略其句法结构。此外,由于缺乏与文本的交互,视频中丰富的时空线索往往未被充分利用。针对这些问题,我们认为利用文本作为引导来关注视频中相关的时间帧与空间区域具有重要价值。本文提出一种基于语法层次增强的文本-视频检索方法(SHE-Net),从两个维度利用文本内在的语义与句法层次结构弥合模态鸿沟:首先,为促进视觉内容的细粒度融合,我们采用揭示文本描述语法结构的文本句法层级来指导视觉表征;其次,为增强多模态交互与对齐,我们进一步利用语法层次来引导相似度计算。我们在MSR-VTT、MSVD、DiDeMo和ActivityNet四个公开文本-视频检索数据集上进行评估,实验结果与消融研究证实了所提方法的优势。