Multimodal document question answering requires retrieving dispersed evidence from visually rich long documents and performing reliable reasoning over heterogeneous information. Existing multimodal RAG systems remain limited by two bottlenecks: static retrieval that ignores query complexity, and end-to-end Vision-Language Models (VLMs) that couple visual perception with logical reasoning, leading to inefficient computation and unstable answer generation. We propose AutoThinkRAG, a complexity-aware inference architecture for multimodal document QA. It has two components: (1) a Query Complexity Router that analyzes query difficulty and structure to adaptively select retrieval and reasoning paths; and (2) a Perception--Reasoning Decoupling architecture that uses a lightweight VLM as a high-fidelity visual interpreter to convert query-relevant visual cues into textual representations, which are then passed to an LLM for logical reasoning and answer synthesis. This design improves both efficiency and robustness, especially on long-document and unanswerable queries. Experiments on DocBench and MMLongBench show that AutoThinkRAG achieves 82.13\% and 51.29\% overall accuracy, respectively, while reducing per-query token consumption by 18.9\% and monetary cost by 18.2\%. Further analyses show that the gains are most pronounced on complex queries requiring adaptive retrieval and multi-step reasoning.
翻译:多模态文档问答任务需要从视觉丰富的长文档中检索分散的证据,并对异构信息进行可靠的推理。现有的多模态RAG系统仍受限于两个瓶颈:一是忽略查询复杂度的静态检索机制;二是将视觉感知与逻辑推理耦合的端到端视觉语言模型,这导致了计算效率低下和答案生成不稳定。我们提出了AutoThinkRAG,一种面向多模态文档问答的复杂度感知推理架构。它包含两个核心组件:(1)查询复杂度路由器,通过分析查询的难度与结构,自适应地选择检索与推理路径;(2)感知-推理解耦架构,该架构使用一个轻量级VLM作为高保真视觉解释器,将查询相关的视觉线索转换为文本表示,随后传递给一个LLM进行逻辑推理与答案合成。该设计同时提升了效率与鲁棒性,尤其在处理长文档和不可回答查询时表现突出。在DocBench和MMLongBench上的实验表明,AutoThinkRAG分别实现了82.13%和51.29%的整体准确率,同时将单查询的令牌消耗降低了18.9%,货币成本降低了18.2%。进一步分析表明,其性能提升在需要自适应检索和多步推理的复杂查询上最为显著。