Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, \textit{i.e.}, the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of [email protected] on Charades-STA and Charades-CG). The code will be available at https://github.com/lntzm/MESM.

翻译：视频时刻检索（Video Moment Retrieval, VMR）旨在通过构建跨模态对齐策略，从非裁剪视频中检索与给定语言查询对应的时间片段。然而，现有策略往往次优，因为它们忽略了模态不平衡问题，即视频固有的语义丰富性远超给定有限长度句子。因此，为追求更优对齐，一个自然思路是增强视频模态以过滤与查询无关的语义，同时增强文本模态以捕获更多片段相关知识。本文提出模态增强语义建模（Modal-Enhanced Semantic Modeling, MESM），一种通过双层级特征增强实现更平衡对齐的新框架。首先，我们在帧-词级别通过词重构增强视频模态。该策略在帧级特征中强调与查询词相关的部分，同时抑制无关部分。因此，增强后的视频包含更少冗余语义，并与文本模态更平衡。其次，我们在片段-句子级别通过从上下文语句和真实片段中学习互补知识来增强文本模态。当这些知识加入查询后，文本模态便保持更多有意义语义，并与视频模态更平衡。通过实施MESM的双层级，两模态的语义信息更平衡以对齐，从而弥合模态差距。在三个广泛使用的基准（包括分布外设置）上的实验表明，所提框架达到了新的最先进性能，且具有显著泛化能力（例如，在Charades-STA和Charades-CG上[email protected]平均增益分别达4.42%和7.69%）。代码将发布于https://github.com/lntzm/MESM。