A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.
翻译:多模态检索领域的最新趋势是通过双软最大损失(DSL)对测试集结果进行后处理。尽管这种方法能带来显著改进,但其通常需要将完整的测试样本矩阵作为DSL输入。本文提出了一种基于Sinkhorn变换的后处理方法,其性能优于DSL。此外,我们提出了一种无需访问多个测试查询的新后处理设置。实验表明,我们的方法能显著提升CLIP4Clip、BLIP、X-CLIP、DRL等先进模型的检索结果,从而在多个标准文本-视频检索数据集上——无论是使用完整测试集还是单查询设置——均实现了新的最佳性能。