MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality

We investigate the problem of multimodal search of target modality, where the task involves enhancing a query in a specific target modality by integrating information from auxiliary modalities. The goal is to retrieve relevant objects whose contents in the target modality match the specified multimodal query. The paper first introduces two baseline approaches that integrate techniques from the Database, Information Retrieval, and Computer Vision communities. These baselines either merge the results of separate vector searches for each modality or perform a single-channel vector search by fusing all modalities. However, both baselines have limitations in terms of efficiency and accuracy as they fail to adequately consider the varying importance of fusing information across modalities. To overcome these limitations, the paper proposes a novel framework, called MUST. Our framework employs a hybrid fusion mechanism, combining different modalities at multiple stages. Notably, we leverage vector weight learning to determine the importance of each modality, thereby enhancing the accuracy of joint similarity measurement. Additionally, the proposed framework utilizes a fused proximity graph index, enabling efficient joint search for multimodal queries. MUST offers several other advantageous properties, including pluggable design to integrate any advanced embedding techniques, user flexibility to customize weight preferences, and modularized index construction. Extensive experiments on real-world datasets demonstrate the superiority of MUST over the baselines in terms of both search accuracy and efficiency. Our framework achieves over 10x faster search times while attaining an average of 93% higher accuracy. Furthermore, MUST exhibits scalability to datasets containing more than 10 million data elements.

翻译：本文研究了多模态目标模态搜索问题，该任务通过整合辅助模态的信息来增强特定目标模态的查询，旨在检索出目标模态内容与指定多模态查询相匹配的相关对象。本文首先介绍了两种融合数据库、信息检索与计算机视觉领域技术的基线方法：这些基线方法或合并各模态独立向量搜索的结果，或通过融合所有模态执行单通道向量搜索。然而，这两种基线方法均未能充分考虑跨模态信息融合的重要性差异，导致效率和准确性受限。为突破上述局限，本文提出了一种名为MUST的新型框架。该框架采用混合融合机制，在多阶段整合不同模态信息。特别地，我们通过向量权重学习确定各模态的重要性，从而提升联合相似度测量的准确性。此外，该框架利用融合近邻图索引实现多模态查询的高效联合搜索。MUST还具备多项优势特性，包括可集成任意先进嵌入技术的可插拔设计、允许用户自定义权重偏好的灵活性，以及模块化的索引构建方式。在真实数据集上的大量实验表明，MUST在搜索准确率和效率上均显著优于基线方法：我们的框架在实现平均93%准确率提升的同时，搜索速度提升超过10倍。此外，MUST展现出对包含超过1000万数据元素的数据集的可扩展性。