The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata, often fail to bridge the semantic gap between text descriptions and the multifaceted nature of video content. This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope. We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query. This clustering is further refined by our innovative Sweeper module, which identifies and mitigates noise within these clusters. Additionally, we introduce the Video-Text Cluster-Attention (VTC-Att) mechanism, which dynamically adjusts focus within the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features. Further experiments have demonstrated that our proposed model surpasses existing state-of-the-art models on five public datasets.
翻译:随着视频内容在各平台的迅速增长,对先进视频检索系统的需求日益迫切。传统方法主要依赖于直接匹配文本查询与视频元数据,往往难以弥合文本描述与视频内容多维度特性之间的语义鸿沟。本文提出了一种新颖的框架——视频文本聚类(VTC),通过聚类文本查询以捕获更广泛的语义范围,从而提升视频检索性能。我们设计了一种独特的聚类机制,将相关查询进行分组,使系统能够考虑每个查询的多种解释与细微差异。该聚类过程通过我们创新的Sweeper模块进一步优化,该模块能够识别并消除聚类中的噪声。此外,我们引入了视频文本聚类注意力(VTC-Att)机制,该机制根据视频内容动态调整聚类内的关注焦点,确保检索过程突出最相关的文本特征。在五个公开数据集上的实验表明,我们提出的模型性能超越了现有最先进模型。