Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's `Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query's learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.
翻译:将带可验证奖励的强化学习(RLVR)应用于优化大语言模型(LLM)可被概念化为逐步编辑查询的"推理树"。该过程涉及探索节点(令牌)并在各节点动态修改模型策略。当结合数据调度时,这一过程能进一步提升数据效率与准确率。然而,现有RLVR数据调度方法通常依赖基于路径的指标对查询排序,忽视了这些查询的推理树结构。本文提出一种新指标——推理分数(r-score),该指标基于查询推理树的结构衡量其学习难度。基于r-score,我们提出推理树调度算法(Re-Schedule),该调度算法构建了一个从结构简单(高r-score)到复杂(低r-score)查询的课程式学习方案。在六个数学推理基准上的实验表明,Re-Schedule显著提升了平均准确率,最高可达3.2%的提升。这些显著结果验证了我们的方法,并证明对推理树的结构理解为RLVR数据调度提供了更强大且更原理性的基础。