Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.
翻译:基于证据的推理不仅需要将检索到的文本附加到预测结果上:模型应做出依赖于所提供证据是否支持目标主张的决策。在实践中,这一过程常因监督信号薄弱、证据与主张关联松散以及评估未能直接测试证据依赖性而失败。我们提出案例锚定证据验证这一通用框架——模型接收局部案例上下文、外部证据及结构化主张,需判断该案例中证据是否支持该主张。我们的核心贡献在于构建了一种监督生成流程,可自动生成显式支持样本与语义可控的非支持样本(包括反事实错误状态及主题相关负样本),无需人工证据标注。我们在放射学领域实例化该框架,基于生成的验证任务训练标准验证器。训练后的验证器显著优于纯案例基线和纯证据基线,在正确证据下保持强性能,当移除或替换证据时性能骤降,表明其具备真实的证据依赖性。该行为可迁移至未见过的证据文章及外部案例分布,但证据源偏移时性能下降,且仍受主干网络选择影响。总体而言,结果表明证据锚定推理的主要瓶颈不仅在于模型容量,更在于缺乏编码证据因果作用的监督机制。