Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.
翻译:检索增强生成(RAG)系统可能接收到的证据不仅包含噪声,还可能相互矛盾。在多语言场景中,这一问题尤为突出,因为检索到的中文和英文证据可能支持互不相容的答案候选。我们通过X-RAMDocs-ZHEN(一个基于RAMDocs构建的受控中英文基准数据集)研究该问题,该数据集专门用于诊断RAG中的证据冲突。该基准包含300个样本,覆盖六种均衡条件,包括单语言支持、双语一致、反向冲突方向以及含可选噪声的冲突。我们进一步研究了X-MADAM-RAG——一个可解释的流水线,它将证据处理分解为逐文档候选提取、可见证据修复、确定性候选分组和冲突感知聚合。在采用Qwen2.5-7B-Instruct的原始受控基准上,X-MADAM-RAG实现了0.9667的严格准确率和0.9767的冲突感知成功率,优于证据归一化的单次调用基线。然而,一个零调用纯规则提取器在同一基准上达到了1.0000,暴露出显著的模板规律性。为探究这一局限性,我们构建了一个确定性的自然化压力测试,在保留候选字符串的同时移除显式答案模板。在其100样本子集上,纯规则提取准确率降至0.0000,但X-MADAM-RAG的严格准确率也降至0.3000,低于朴素基线和证据归一化基线。特权oracle仍保持完美性能,表明文档级提取是主要瓶颈。这些发现将X-RAMDocs-ZHEN和X-MADAM-RAG定位为受控证据冲突的诊断工具,而非泛化幻觉检测或自然检索鲁棒性的证据。