Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ($66.99$ average) and improves over the no-retrieval baseline by $37.1$ points on VisDoMBench and $4.8$ on MMLongBench-Doc.
翻译:长篇幅多模态文档问答的效果受限于最终被阅读器获取的证据,而非检索到的证据数量。在长文档中,发现结果常通过图表、标题和引言句重复出现,导致现代多模态检索增强生成系统中的基于相似度的检索器将资源分配给近似重复项,而忽略了互补性证据。本文提出一种检索器,通过在查询增强亲和图上选择约束主导集作为证据,其具备三项相似度排序无法提供的优势。首先,查询被编码为硬性结构约束,确保每个被选元素通过聚类锚点直接与问题关联。其次,相关性-冗余度平衡由谱边界自动确定,无需对多样性感知选择器所需的手动权衡参数进行调整。第三,通过复制子动力学实现全局均衡的选择过程,从而避免了贪婪启发式算法引入的偏差。该方法天然基于图结构且无需训练。基于Qwen3-VL-32B阅读器,CDS在VisDoMBench基准测试上取得了新的最优性能(平均66.99分),并在VisDoMBench和MMLongBench-Doc上分别较无检索基线提升了37.1分和4.8分。