Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three approaches have been widely adopted: (i) Proximal policy optimization and advantage actor-critic rely on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. (ii) Group relative policy optimization (GRPO) avoids training a value network by approximating the value function using sample averages. However, GRPO samples a large number of reasoning traces per prompt to achieve accurate value function approximation, making it computationally expensive. (iii) REINFORCE-type algorithms sample only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. In this work, we focus on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.
翻译:近年来,大型语言模型(LLMs)的进展日益依赖强化学习(RL)来提升其推理能力。目前广泛采用三种方法:(i)近端策略优化与优势演员-评论家方法依赖深度神经网络估计学习策略的值函数,以降低策略梯度的方差,但估计和维护此类值网络会带来显著的计算与内存开销;(ii)群体相对策略优化(GRPO)通过使用样本平均值近似值函数避免了训练值网络,但为此需为每个提示采样大量推理轨迹,导致计算成本高昂;(iii)REINFORCE类算法每个提示仅采样单条推理轨迹,降低了计算成本,但样本效率低下。本研究聚焦于一种实际且资源受限的场景:每个提示仅能采样少量推理轨迹,而低方差梯度估计对高质量策略学习仍至关重要。为应对这一挑战,我们将兼具计算效率与统计效率的经典非参数统计方法引入大语言模型推理,以核平滑作为具体实例进行值函数估计与后续策略优化。数值与理论结果表明,本方法能实现精准的值函数与梯度估计,从而显著提升策略优化效果。