Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.
翻译:许多研究致力于改进文本-视频检索中的预训练方法或开发新型骨干网络。然而,现有方法可能受到学习与推理偏差问题的影响——正如近期其他文本-视频相关任务研究所揭示的那样:例如,动作识别中的空间外观特征或视频场景图生成中的时序目标共现现象,均可能引发虚假关联。本研究首次系统性地探讨了因裁剪视频片段训练集与测试集帧长度差异导致的时序偏差问题——据我们所知,这是文本-视频检索任务中首次针对该偏差的尝试。我们首先通过基线实验假设并验证了该偏差对模型的影响机制,继而提出一种因果去偏方法,并在Epic-Kitchens-100、YouCook2和MSR-VTT数据集上进行了大量实验与消融研究。实验表明,我们的模型不仅超越了基线及现有最优方法在语义相关性评估指标nDCG上的表现(证实偏差得到了有效缓解),在其他常规评估指标上也同样取得了突破。