Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.
翻译:大型语言模型(LLMs)虽能生成流畅文本,但容易产生幻觉——输出看似合理却缺乏证据支持的内容。这一缺陷在需依据可验证信息进行决策的高风险领域尤为严重。我们提出\textbf{EvidenceRL},一种在训练过程中强制模型遵循证据的强化学习框架。EvidenceRL通过两维度评分候选回答:基于证据的可靠性(与检索证据及上下文的蕴含关系)与正确性(与参考答案的一致性),并采用群体相对策略优化(GRPO)优化生成器。我们在心脏诊断和法律推理两个高风险领域进行评测,结果显示EvidenceRL在保持任务准确率的同时持续提升证据可靠性与忠实度。在心脏诊断任务中,Llama-3.2-3B模型的F1@3从37.0提升至54.5,证据可靠性指标Gmax@3从47.6升至78.2;幻觉率下降近5倍,基于证据的诊断率从31.8%提升至61.6%。在法律推理任务中,EvidenceRL将Llama-3.1-8B的忠实度从32.8%提升至67.6%,展现出跨领域的行为一致性。代码已开源:https://github.com/Wizaaard/EvidenceRL.git