To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - MIRAGE. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, MIRAGE not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.
翻译:为有效利用用户特定数据,多模态大语言模型(MLLM)应用中常采用检索增强生成(RAG)技术。然而,传统检索方法往往存在检索精度有限的问题。近期多向量检索(MVR)技术通过分解查询并与分割后的图像进行匹配,提升了检索精度,但仍存在准确率与效率欠佳的问题,其忽略了查询与多样化图像对象之间的对齐关系,且存在冗余的细粒度图像片段。本研究提出一种高效的图像检索调度框架——MIRAGE。首先,我们引入一种新颖的层次化范式,针对不同图像对象采用多级中间粒度以增强对齐效果。其次,通过利用跨层次相似性一致性与层次稀疏性来最小化冗余检索,从而减少不必要的匹配计算。此外,我们为不同数据集自动配置参数,以提升跨场景应用的实用性。实证研究表明,MIRAGE不仅显著提升了检索精度,较现有MVR系统更可减少高达3.5倍的计算量。