DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.
翻译:DeepSeek-R1 通过其基于规则的奖励系统,成功提升了大语言模型(LLM)的推理能力。尽管这是一个“完美”的奖励系统,能有效缓解奖励黑客问题,但此类奖励函数通常是离散的。我们的实验观察表明,离散奖励可能导致梯度异常、优化不稳定和收敛缓慢。为解决这一问题,我们提出了 ReDit(奖励抖动),该方法通过添加简单的随机噪声来抖动离散奖励信号。利用这种扰动后的奖励,学习过程能持续提供探索性梯度,从而实现更平滑的梯度更新并加速收敛。注入的噪声还在平坦奖励区域引入了随机性,鼓励模型探索新策略并逃离局部最优。跨多种任务的实验证明了 ReDit 的有效性和效率。平均而言,ReDit 仅需约 10% 的训练步数即可达到与原始 GRPO 相当的性能,并且在训练时长相近时,仍能比原始 GRPO 表现出 4% 的性能提升。可视化结果证实 ReDit 显著缓解了梯度问题。此外,我们还提供了理论分析以进一步验证这些优势。