A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.
翻译:检索增强生成系统部署于多作者机构语料库时,同一问题可能因检索来源不同而给出不同答案——这一失效模式是主流单一正确答案范式无法诊断的。我们认为来源依赖性是NLP评估缺失的一个维度,对其进行审计意味着将评估单元从答案正确性转向来源间关系。我们在移植患者教育场景中具体化这一观点(该领域机构来源存在显著分歧),并发布三项成果:TransplantQA基准测试集(包含真实患者问题,每个问题通过将生成过程锚定于多个机构手册作为候选来源进行回答)、HERO-QA分层检索策略(可锚定并审计每个答案),以及一个结构化输出评判器(基于经过验证的五标签分类法对来源间关系进行评分)。在大规模评估中,更优的检索揭示了比先前估计多得多的分歧——低估的是其普遍性而非强度。该框架具有领域无关性,可迁移至法律与教育领域的RAG系统:对来源依赖性的测量是部署多源NLP系统的普遍责任。