Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ($66.99$ average) and improves over the no-retrieval baseline by $37.1$ points on VisDoMBench and $4.8$ on MMLongBench-Doc.
翻译:长多模态文档问答受限于到达阅读器的证据,而非检索到的数量。在冗长文档中,发现结果常反复出现在图表、标题和引言句子中,导致现代多模态检索增强生成(RAG)系统中基于相似度的检索器将资源分配给近乎重复的信息,而忽略互补证据。本研究提出一种检索器,在查询增强的亲和度图上以约束主导集(Constrained Dominant Set, CDS)形式选择证据,其具备三种相似度排序无法实现的优势。首先,查询被编码为硬结构约束,确保每个选定元素均通过聚类锚点直接关联至问题。其次,相关性-冗余度平衡由谱界自动确定,无需多样性感知选择器所需的手动调参权衡。第三,选择过程通过复制者动力学实现全局均衡,从而避免贪心启发式引入的扭曲。该方法本质上基于图结构且无需训练。使用Qwen3-VL-32B阅读器,CDS在VisDoMBench上达到平均66.99的新最优性能,并在VisDoMBench上相比无检索基线提升37.1个点,在MMLongBench-Doc上提升4.8个点。