In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers' quickly track this field, we build the project page for this survey, which can be found at https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.
翻译:在现实世界中,信息丰富且以不同模态呈现,理解并利用多种数据类型以改进检索系统是当前研究的重点。多模态复合检索通过整合文本、图像、音频等多种模态,以提供更精准、个性化且符合上下文语境的结果。为深入理解这一前景广阔的研究方向,本综述对多模态复合编辑与检索进行了深入探讨,涵盖图文复合编辑、图文复合检索及其他多模态复合检索。本文系统性地梳理了相关应用场景、方法、基准数据集、实验及未来发展方向。多模态学习是大模型时代的热点研究领域,IEEE PAMI期刊也已发表多篇关于多模态学习及基于Transformer的视觉语言模型的综述。据我们所知,本综述是首篇对多模态复合检索文献进行的全面回顾,是对现有多模态融合综述的及时补充。为帮助读者快速跟进该领域进展,我们为本综述建立了项目页面,访问地址为:https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval。