Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect'' -- suffering from geometric distortions, quality degradation, or semantic ambiguity -- leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three critical insights: (1) Vulnerability -- visual imperfections severely degrade both retrieval recall and end-to-end MRAG performance; (2) Restoration Potential \& Bottleneck -- while oracle preprocessing recovers near-perfect performance, off-the-shelf MLLMs struggle with tool selection and parameter prediction without specialized training; and (3) Training Enhancement -- supervised fine-tuning enables compact models to achieve comparable or superior performance to larger proprietary models, demonstrating the benchmark's value for developing robust MRAG systems The code is available at https://github.com/phycholosogy/VQQP_Bench

翻译：多模态检索增强生成（MRAG）已成为将多模态大语言模型与外部知识进行关联的关键范式。尽管查询预处理（如重写）在基于文本的RAG中已成为标准流程，但现有MRAG系统大多将视觉输入视为静态且不可修改的，并隐含假设其无噪声干扰。然而，现实世界中的视觉查询往往存在“缺陷”——包括几何畸变、质量退化或语义模糊等问题——这些缺陷会导致灾难性的检索失败。为填补这一研究空白，我们提出了V-QPP-Bench，这是首个专注于视觉查询预处理（V-QPP）的综合基准测试。我们将V-QPP构建为一项代理决策任务，要求多模态大语言模型自主诊断视觉缺陷并调用感知工具优化查询。通过对46,700个缺陷查询及多种MRAG范式的广泛评估，我们得出三个关键结论：（1）脆弱性——视觉缺陷会严重降低检索召回率与端到端MRAG性能；（2）修复潜力与瓶颈——虽然理想预处理能恢复近乎完美的性能，但未经专门训练的现有多模态大语言模型在工具选择与参数预测方面仍存在困难；（3）训练增强效果——监督微调能使紧凑模型达到与大型专有模型相当或更优的性能，这证明了本基准对开发鲁棒MRAG系统的价值。代码已开源：https://github.com/phycholosogy/VQQP_Bench