Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
翻译:大型语言模型(LLMs),尤其是慢思考模型,常表现出严重的幻觉现象,即在推理过程中因无法准确识别知识边界而输出错误内容。尽管强化学习(RL)能增强复杂推理能力,但其面向结果的奖励机制通常缺乏对思考过程的事实监督,进一步加剧了幻觉问题。为解决慢思考模型的高幻觉率问题,我们提出知识增强型强化学习——KnowRL。KnowRL通过将基于知识验证的事实性奖励融入RL训练过程,引导模型进行基于事实的慢思考,帮助其识别知识边界。这种在RL训练中针对性地加入事实性输入,使模型能够学习和内化基于事实的推理策略。通过直接奖励推理步骤中遵循事实的行为,KnowRL培育出更可靠的思考过程。在三个幻觉评估数据集和两个推理评估数据集上的实验结果表明,KnowRL在保持模型原有强大推理能力的同时,有效缓解了慢思考模型的幻觉问题。我们的代码已开源:https://github.com/zjunlp/KnowRL。