The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has revolutionized information retrieval and expanded the practical applications of AI. However, current systems struggle in accurately interpreting user intent, employing diverse retrieval strategies, and effectively filtering unintended or inappropriate responses, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework that addresses these challenges through a multi-stage pipeline comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust filtering pipeline combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific concern defined by organizational policies. Extensive experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results, advancing the capabilities of multimodal retrieval systems.
翻译:检索增强生成(RAG)与多模态大语言模型(MLLMs)的融合,彻底改变了信息检索领域,并拓展了人工智能的实际应用。然而,现有系统在准确解读用户意图、采用多样化检索策略以及有效过滤非预期或不恰当响应方面仍存在不足,这限制了其效能。本文提出了一种新颖的多模态搜索框架——基于MLLM的上下文理解与增强搜索(CUE-M),该框架通过一个包含图像上下文增强、意图精炼、上下文查询生成、外部API集成以及基于相关性的过滤的多阶段流程,以应对上述挑战。CUE-M集成了一个鲁棒的过滤流程,该流程结合了基于图像、基于文本以及多模态的分类器,并能根据组织策略定义的实例特定和类别特定的关注点进行动态调整。在真实世界数据集以及基于知识的视觉问答(VQA)和安全性公共基准测试上进行的大量实验表明,CUE-M性能优于基线模型,并取得了新的最先进结果,从而推动了多模态检索系统的能力发展。