The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
翻译:CLIP的成功推动了文本-视频检索领域的显著进展。然而,现有方法常受"盲目"特征交互的困扰,即由于文本查询的稀疏性,模型难以从背景噪声中辨别关键视觉信息。为弥合这一差距,我们受人类认知行为启发,提出了人类视觉驱动模型。该框架构建了一个由粗到精的对齐机制,包含两个关键组件:帧特征选择模块与块特征压缩模块。FFSM通过选择关键帧来消除时序冗余,模拟人类的宏观感知能力。随后,PFCM通过先进的注意力机制将块特征聚合为显著视觉实体,实现精确的实体级匹配,模拟微观感知能力。在五个基准数据集上的大量实验表明,HVD不仅能捕捉类人的视觉焦点,还实现了最先进的性能。