Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.

翻译：文本-视频检索是一项具有挑战性的任务，旨在根据文本查询识别相关视频。与传统的文本检索相比，文本-视频检索的主要障碍在于查询的文本属性与视频内容的视觉丰富性之间的语义鸿沟。以往的研究主要侧重于通过精细聚合词-帧匹配信号来对齐查询与视频。受人类通过模块化判断文本与视频相关性的认知过程启发，由于视频内容具有连续性和复杂性，判断需要高阶匹配信号。在本文中，我们提出块级文本-视频匹配，其中提取查询块以描述特定的检索单元，视频块被分割为视频中的不同片段。我们将块级匹配建模为查询词与视频帧之间的n元相关性建模，并引入多模态超图进行n元相关性建模。通过将文本单元和视频帧表示为节点，并利用超边描述它们之间的关系，构建了多模态超图。通过这种方式，查询和视频可以在高阶语义空间中对齐。此外，为增强模型的泛化能力，提取的特征被输入变分推断组件进行计算，获得高斯分布下的变分表示。超图和变分推断的结合使我们的模型能够捕捉文本和视觉内容之间复杂的n元交互。实验结果表明，我们提出的方法在文本-视频检索任务上达到了最先进的性能。