Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval

In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER's heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval. The code is publicly available at https://github.com/KaijingOfficial/DEMR.

翻译：在时刻检索领域，基于自然语言查询准确识别视频中的时序片段仍然具有挑战性。传统方法通常采用预训练模型，这些模型难以处理细粒度信息和确定性推理，导致与复杂或模糊时刻的对齐困难。为克服这些限制，我们探索了深度证据回归（DER）以构建一个基础证据基线。然而，该方法面临两个主要问题：无法有效处理模态不平衡，以及DER启发式不确定性正则器的结构差异，这些都对不确定性估计产生负面影响。这种错位导致高不确定性被错误地与准确样本而非挑战性样本关联。我们的观察表明，现有方法缺乏复杂视频场景所需的适应性。为此，我们提出了用于时刻检索的去偏证据学习（DEMR），这是一个新颖框架，包含用于跨模态对齐的反射翻转融合（RFF）块和增强文本敏感性的查询重构任务，从而减少不确定性估计中的偏差。此外，我们引入了几何正则器以优化不确定性预测，实现与困难时刻的自适应对齐并提升检索准确性。在标准数据集及去偏数据集ActivityNet-CD和Charades-CD上的广泛测试表明，该方法在有效性、鲁棒性和可解释性方面均有显著提升，使其成为时刻检索中时序语义鲁棒性的有前景解决方案。代码公开于https://github.com/KaijingOfficial/DEMR。