Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at https://github.com/aintdoin/FaithRL.
翻译:基于可验证奖励的强化学习(RLVR)显著提升了大型语言模型(LLMs)在多步推理任务上的性能。然而,大多数RLVR流程依赖于稀疏的基于结果的奖励,对中间步骤的监督有限,从而助长了过度自信和虚假推理,进而加剧了幻觉问题。为解决此问题,我们提出了FaithRL——一个直接优化推理忠实度的通用强化学习框架。我们形式化了一个忠实度最大化目标,并从理论上证明了优化该目标能够缓解过度自信。为实现这一目标,我们引入了一种几何奖励设计和一个忠实度感知的优势调制机制,该机制通过惩罚无依据的步骤同时保留有效的部分推导来分配步骤级信用。在多种骨干模型和基准测试中,FaithRL持续降低了幻觉率,同时保持(并经常提升)答案正确性。进一步分析证实,FaithRL提高了逐步推理的忠实度,并展现出稳健的泛化能力。我们的代码公开于https://github.com/aintdoin/FaithRL。