Document Question Answering (DQA) involves generating answers from a document based on a user's query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Codes are available at https://github.com/ElephantOH/MAB-DQA.
翻译:文档问答(DQA)指根据用户查询从文档中生成答案,是文档理解中的关键任务。该任务需解读视觉布局,这促使近期研究采用多模态检索增强生成(RAG)技术,通过处理页面图像来生成答案。然而,在多模态RAG中,视觉DQA难以有效利用大量图像,因为检索阶段通常仅保留少数候选页面(如Top-4),导致信息丰富但视觉显著性较低的内容被忽视,而倾向于选择常见但信息量低的页面。针对此问题,我们提出基于多臂老虎机的DQA框架(MAB-DQA),以显式建模查询中多个隐式方面的不同重要性。具体而言,MAB-DQA将查询分解为面向方面的子查询,并为每个子查询检索特定方面的候选集。它将每个子查询视为一个臂,利用少量代表性页面的初步推理结果作为奖励信号来估计方面效用。在探索-利用策略的引导下,MAB-DQA动态地将检索预算重新分配给高价值方面。利用最具信息量的页面及其关联性,MAB-DQA生成预期结果。在四个基准测试上,MAB-DQA相比最先进方法平均提升5%-18%,持续增强文档理解能力。代码已开源至https://github.com/ElephantOH/MAB-DQA。