We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.
翻译:我们证明,在具有二元奖励的大语言模型中,几种流行的强化学习算法可以视为对给定提示下正确答案概率的单调变换进行随机梯度上升。具体而言,与拒绝采样算法相关的变换是对数函数,而与GRPO算法相关的变换是平方根的反正弦函数。