Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60\% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.
翻译:视觉问答(VQA)是一项具有挑战性的多模态任务,需要整合视觉与文本信息以生成准确回答。尽管多模态检索增强生成(mRAG)通过在图像和文本两侧提供更多证据,展现了提升VQA系统性能的潜力,但处理VQA查询(尤其是知识密集型查询)的默认流程通常依赖于具有内在依赖关系的多阶段mRAG流水线。为在保持VQA任务性能的同时缓解其效率限制,本文提出一种训练多模态规划代理的方法,通过动态分解mRAG流水线来解决VQA任务。该方法通过训练代理智能判断每个mRAG步骤的必要性,从而优化效率与效果之间的权衡。实验表明,该代理能帮助减少冗余计算,与现有方法相比搜索时间降低超过60%,并减少了高成本工具调用。同时,实验证明我们的方法在六个不同数据集上的平均表现优于所有基线模型,包括深度研究代理和精心设计的基于提示的方法。代码将公开释放。