Step-level reward models (SRMs) can significantly enhance mathematical reasoning performance through process supervision or step-level preference alignment based on reinforcement learning. The performance of SRMs is pivotal, as they serve as critical guidelines, ensuring that each step in the reasoning process is aligned with desired outcomes. Recently, AlphaZero-like methods, where Monte Carlo Tree Search (MCTS) is employed for automatic step-level preference annotation, have proven particularly effective. However, the precise mechanisms behind the success of SRMs remain largely unexplored. To address this gap, this study delves into the counterintuitive aspects of SRMs, particularly focusing on MCTS-based approaches. Our findings reveal that the removal of natural language descriptions of thought processes has minimal impact on the efficacy of SRMs. Furthermore, we demonstrate that SRMs are adept at assessing the complex logical coherence present in mathematical language while having difficulty in natural language. These insights provide a nuanced understanding of the core elements that drive effective step-level reward modeling in mathematical reasoning. By shedding light on these mechanisms, this study offers valuable guidance for developing more efficient and streamlined SRMs, which can be achieved by focusing on the crucial parts of mathematical reasoning.
翻译:步骤级奖励模型(SRMs)通过基于强化学习的过程监督或步骤级偏好对齐,能够显著提升数学推理性能。SRMs的性能至关重要,因为它们作为关键指导准则,确保推理过程中的每一步都与期望结果保持一致。最近,采用蒙特卡洛树搜索(MCTS)进行自动步骤级偏好标注的类AlphaZero方法被证明尤为有效。然而,SRMs成功背后的精确机制在很大程度上仍未得到探索。为填补这一空白,本研究深入探讨了SRMs的反直觉特性,特别聚焦于基于MCTS的方法。我们的研究发现,移除思维过程的自然语言描述对SRMs的有效性影响甚微。此外,我们证明SRMs擅长评估数学语言中存在的复杂逻辑连贯性,但在处理自然语言时却存在困难。这些见解为理解驱动数学推理中有效步骤级奖励建模的核心要素提供了细致入微的视角。通过阐明这些机制,本研究为开发更高效、更精简的SRMs提供了有价值的指导,这可以通过聚焦数学推理的关键部分来实现。