The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has revolutionized information retrieval and expanded the practical applications of AI. However, current systems struggle in accurately interpreting user intent, employing diverse retrieval strategies, and effectively filtering unintended or inappropriate responses, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework that addresses these challenges through a multi-stage pipeline comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust filtering pipeline combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific concern defined by organizational policies. Evaluations on a multimodal Q&A dataset and a public safety benchmark demonstrate that CUE-M outperforms baselines in accuracy, knowledge integration, and safety, advancing the capabilities of multimodal retrieval systems.
翻译:检索增强生成(RAG)与多模态大语言模型(MLLMs)的融合,彻底变革了信息检索领域,并拓展了人工智能的实际应用。然而,现有系统在准确解读用户意图、采用多样化检索策略以及有效过滤非预期或不恰当响应方面存在不足,这限制了其效能。本文提出了基于多模态大语言模型的上下文理解与增强搜索框架(CUE-M),该新型多模态搜索框架通过一个包含图像上下文增强、意图精炼、上下文查询生成、外部API集成以及基于相关性的过滤的多阶段流程,应对上述挑战。CUE-M集成了一套结合图像、文本及多模态分类器的鲁棒过滤流程,能够动态适应由组织策略定义的实例级和类别级特定关切。在多模态问答数据集和公共安全基准上的评估表明,CUE-M在准确性、知识整合和安全性方面均优于基线模型,推动了多模态检索系统能力的进步。