Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.
翻译:近年来,文本-视频检索领域的进展主要得益于对比学习。然而,现有方法往往忽视了模态间隙的影响,该间隙导致锚点表征经历原位优化(即优化张力),从而限制了其对齐能力。此外,噪声性困难负样本进一步扭曲了锚点的语义。为解决这些问题,我们提出了GARE,一种间隙感知检索框架,其在文本$t_i$与视频$v_j$之间引入了一个可学习的、针对特定样本对的增量$\Delta_{ij}$,通过重新分配梯度来缓解优化张力并吸收噪声。我们通过对信任域约束下的InfoNCE损失进行多元一阶泰勒展开来推导$\Delta_{ij}$,表明其能引导更新沿着局部一致的下降方向进行。一个轻量级的神经模块以语义间隙为条件,将批次内的增量耦合起来,实现结构感知的校正。此外,我们通过具有松弛压缩的变分信息瓶颈对$\Delta$进行正则化,以增强稳定性和语义一致性。在四个基准数据集上的实验表明,GARE能持续提升对齐精度与鲁棒性,验证了间隙感知张力缓解机制的有效性。代码发布于 https://github.com/musicman217/GARE-text-video-retrieval。