Group Relative Policy Optimization (GRPO) trains Chain-of-Thought reasoning with verifiable rewards, but estimating thought-level advantages without value functions often suffers from high variance. Although tree-style branching is used in practice to reduce the variance, it lacks a theoretical explanation of why it works and whether it is important or even potentially necessary. We study thought-level advantage estimation in GRPO from a variance perspective under a minimal tree-style setting where multiple answers are sampled for each thought. Using the multivariate delta method, we reveal an asymmetry in how different sampling dimensions affect variance. Increasing the number of sampled thoughts ($K$) leaves a strictly positive variance floor, whereas increasing the number of answers per thought ($M$) induces a monotonic decrease in variance, asymptotically decreasing it to zero. This implies that accurate thought-level advantage estimation is impossible through scaling thought sampling alone, making branching a potentially necessary mechanism rather than a heuristic. Experiments further provide empirical evidence for both the effectiveness and necessity of answer-level branching, demonstrating improved optimization stability, training efficiency, and final performance not only in math but also across a broad range of vision domains and under different model architectures and sizes.
翻译:组相对策略优化(GRPO)利用可验证的奖励训练思维链推理,但在没有价值函数的情况下估计思维层面的优势通常存在高方差问题。尽管实践中采用树形分支来降低方差,但缺乏关于其为何有效、是否重要甚至是否可能必要的理论解释。我们在一个最小化的树形设置下,从方差角度研究了GRPO中的思维层面优势估计,其中每个思维会采样多个答案。利用多元德尔塔方法,我们揭示了不同采样维度对方差影响的不对称性:增加采样的思维数量($K$)会留下严格为正的方差下限,而增加每个思维的答案数量($M$)则会导致方差单调递减,并渐近地降至零。这意味着仅通过扩展思维采样无法实现准确的思维层面优势估计,使得分支成为一种潜在必要的机制而非启发式方法。实验进一步为答案层面分支的有效性和必要性提供了实证证据,表明其不仅能在数学领域,还能在广泛的视觉领域以及不同的模型架构和规模下,提升优化稳定性、训练效率和最终性能。