As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
翻译:随着大型语言模型日益小型化与高效化,小型推理模型在资源受限环境下实现思维链推理至关重要。然而,此类模型易产生忠实性幻觉,尤其在中间推理步骤中。现有基于在线强化学习的缓解方法依赖结果性奖励或粗粒度思维链评估,当最终答案正确时可能无意间强化不忠实的推理过程。为突破这些局限,我们提出忠实感知步级强化学习,通过过程奖励模型提供的显式忠实性奖励引入步级监督,并结合隐式截断重采样策略从忠实前缀生成对比信号。在多个小型推理模型与开卷问答基准上的实验表明,该方法能持续降低思维链及最终答案中的幻觉,实现更忠实可靠的推理。代码发布于 https://github.com/Easy195/FaithRL。