Group Relative Policy Optimization (GRPO) trains Chain-of-Thought reasoning with verifiable rewards, but estimating thought-level advantages without value functions often suffers from high variance. Although tree-style branching is used in practice to reduce variance, it lacks a theoretical explanation of why it works and whether it is important or potentially necessary. We study thought-level advantage estimation in GRPO from a variance perspective under a minimal tree-style setting where multiple continuations are sampled for each thought. Using the multivariate delta method, we reveal a sampling-dimension asymmetry. Increasing sampled thoughts ($K$) leaves a strictly positive estimation-variance floor, whereas increasing continuations per thought ($M$) drives the leading-order estimation variance to zero at rate $1/M$. This implies that, within the fixed-temperature GRPO-style estimator without value models studied here, accurate thought-level advantage estimation cannot be achieved by scaling thought sampling alone, making continuation-level branching a principled and potentially necessary mechanism rather than a heuristic. Experiments further provide empirical evidence for its effectiveness and potential necessity, demonstrating improved optimization stability, training efficiency, and final performance not only in math but also across vision domains and under different model architectures and sizes.
翻译:群体相对策略优化(GRPO)通过可验证奖励训练思维链推理,但缺乏价值函数的思维级优势估计常面临高方差问题。尽管实践中采用树状分支可降低方差,但其有效性及必要性缺乏理论解释。本文从方差角度研究GRPO中最小树状分支设置下的思维级优势估计——即对每个思维采样多条续接路径。基于多元增量法,我们揭示了采样维度的不对称性:增加思维采样数(K)会使估计方差留下严格正的下界,而增加每条思维的续接路径数(M)能使主导阶估计方差以1/M速率趋近于零。这表明,在所研究的固定温度无价值模型GRPO估计器框架下,仅通过扩大思维采样无法实现准确的思维级优势估计,因此续接级分支并非启发式技巧,而是具有原理必要性。实验不仅验证了该方法在数学领域的有效性及潜在必要性,还证明其在视觉领域及不同模型架构与规模下均可提升优化稳定性、训练效率与最终性能。