ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) remains unreliable in long-form settings, where retrieved evidence is noisy or contradictory, making it difficult for RAG pipelines to maintain factual consistency. Existing approaches focus on retrieval expansion or verification during generation, leaving conflict resolution entangled with generation. To address this limitation, we propose ArbGraph, a framework for pre-generation evidence arbitration in long-form RAG that explicitly resolves factual conflicts. ArbGraph decomposes retrieved documents into atomic claims and organizes them into a conflict-aware evidence graph with explicit support and contradiction relations. On top of this graph, we introduce an intensity-driven iterative arbitration mechanism that propagates credibility signals through evidence interactions, enabling the system to suppress unreliable and inconsistent claims before final generation. In this way, ArbGraph separates evidence validation from text generation and provides a coherent evidence foundation for downstream long-form generation. We evaluate ArbGraph on two widely used long-form RAG benchmarks, LongFact and RAGChecker, using multiple large language model backbones. Experimental results show that ArbGraph consistently improves factual recall and information density while reducing hallucinations and sensitivity to retrieval noise. Additional analyses show that these gains are evident under conflicting or ambiguous evidence, highlighting the effectiveness of evidence-level conflict resolution for improving the reliability of long-form RAG. The implementation is publicly available at https://github.com/1212Judy/ArbGraph.

翻译：检索增强生成技术在长文本场景中仍不可靠——当检索到的证据存在噪声或矛盾时，RAG流水线难以维持事实一致性。现有方法侧重于检索扩展或生成过程中的验证，导致冲突消解与生成过程纠缠不清。为解决此问题，我们提出ArbGraph框架，该框架专为长文本RAG的生成前证据仲裁设计，能够显式解决事实冲突。ArbGraph将检索文档分解为原子性声明，并将其组织为包含显式支持与矛盾关系的冲突感知证据图。在此基础上，我们引入强度驱动的迭代仲裁机制，通过证据交互传播可信度信号，使系统在最终生成前压制不可靠和矛盾的声明。通过这种方式，ArbGraph将证据验证与文本生成分离，为下游长文本生成提供连贯的证据基础。我们在两个广泛使用的长文本RAG基准（LongFact与RAGChecker）上，采用多种大语言模型骨干网络进行评估。实验结果表明，ArbGraph在提升事实召回率与信息密度的同时，显著降低了幻觉发生率和检索噪声敏感性。进一步分析表明，这些收益在证据存在冲突或歧义时尤为显著，凸显了证据层级冲突消解对提升长文本RAG可靠性的有效性。该实现已在https://github.com/1212Judy/ArbGraph开源。