With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval methods struggle to meet the needs of users demanding access to data from various modalities. To address this, cross-modal retrieval has emerged, enabling interaction across modalities, facilitating semantic matching, and leveraging complementarity and consistency between different modal data. Although prior literature undertook a review of the cross-modal retrieval field, it exhibits numerous deficiencies pertaining to timeliness, taxonomy, and comprehensiveness. This paper conducts a comprehensive review of cross-modal retrieval's evolution, spanning from shallow statistical analysis techniques to vision-language pre-training models. Commencing with a comprehensive taxonomy grounded in machine learning paradigms, mechanisms, and models, the paper then delves deeply into the principles and architectures underpinning existing cross-modal retrieval methods. Furthermore, it offers an overview of widely used benchmarks, metrics, and performances. Lastly, the paper probes the prospects and challenges that confront contemporary cross-modal retrieval, while engaging in a discourse on potential directions for further progress in the field. To facilitate the research on cross-modal retrieval, we develop an open-source code repository at https://github.com/BMC-SDNU/Cross-Modal-Retrieval.
翻译:随着多模态数据的指数级增长,传统的单模态检索方法难以满足用户访问不同模态数据的需求。为此,跨模态检索应运而生,它能够实现模态间的交互、促进语义匹配,并利用不同模态数据间的互补性与一致性。尽管已有文献对跨模态检索领域进行了综述,但在时效性、分类体系及全面性方面仍存在诸多不足。本文对跨模态检索的演进过程进行了全面综述,涵盖从浅层统计分析技术到视觉-语言预训练模型的发展历程。首先基于机器学习范式、机制与模型建立系统分类体系,继而深入剖析现有跨模态检索方法的核心原理与架构。此外,本文概述了广泛使用的基准数据集、评估指标及性能表现。最后,探讨了当前跨模态检索面临的机遇与挑战,并就该领域未来可能的发展方向展开论述。为促进跨模态检索研究,我们已在https://github.com/BMC-SDNU/Cross-Modal-Retrieval开源代码仓库。