To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - MIRAGE. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, MIRAGE not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.
翻译:为有效利用用户特定数据,多模态大语言模型应用常采用检索增强生成技术。然而传统检索方法常存在检索精度不足的问题。近期多向量检索通过分解查询并与分割图像片段进行匹配提升了检索精度,但仍存在查询与不同图像目标对齐不足、细粒度图像片段冗余等缺陷,导致精度与效率未达最优。本文提出面向图像检索的高效调度框架MIRAGE。首先,我们引入新颖的层次化范式,通过为不同图像目标设置多个中间粒度来增强对齐效果;其次,利用跨层次相似性一致性与层次稀疏性,通过最小化冗余细粒度匹配计算来降低检索冗余;同时,针对多样化应用场景,为各数据集自动配置参数。实验表明,MIRAGE在显著提升检索精度的同时,较现有多向量检索系统可实现最高3.5倍的计算量缩减。