Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.
翻译:尽管思维链(CoT)在大语言模型推理中发挥着关键作用,但直接对其进行奖励存在困难:训练奖励模型需要大量人工标注工作,而静态奖励模型难以适应不断演化的CoT分布并容易受到奖励攻击。这些挑战促使我们寻求一种无需人工标注、能够自主演化的CoT奖励方法。受近期自演化训练方法的启发,我们提出\textbf{RLCER}(基于自演化评分标准的思维链监督强化学习),该方法通过自提出且自演化的评分标准对CoT进行奖励,从而增强以结果为中心的RLVR框架。研究表明,即使在没有结果奖励的情况下,自提出且自演化的评分标准仍能提供可靠的CoT监督信号,使得RLCER的性能优于以结果为中心的RLVR。此外,当将这些自提出的评分标准作为提示线索使用时,还能进一步提升推理阶段的性能表现。