Group Relative Policy Optimization (GRPO) has significantly advanced the training of large language models and enhanced their reasoning capabilities, while it remains susceptible to instability due to the use of hard clipping. Soft Adaptive Policy Optimization (SAPO) addresses this limitation by replacing clipping with a smooth sigmoid-based gate function, which leads to more stable updates. We have decided to push this theory further and investigate the impact of different gate functions on both training stability and final model performance. We formalize the key properties that admissible gates should satisfy and identify several families of such functions for empirical evaluation. This paper presents an analysis of our findings based on experiments conducted with the Qwen2.5-7B-Instruct model on mathematical reasoning tasks. These results provide practical guidance for designing smoother and more robust policy optimization objectives for large language model training.
翻译:群体相对策略优化(GRPO)显著推进了大语言模型的训练并增强了其推理能力,然而由于使用了硬截断,该方法仍易受不稳定性影响。软自适应策略优化(SAPO)通过将截断替换为基于平滑sigmoid的门函数来解决这一局限,从而实现了更稳定的更新。我们决定进一步推进该理论,研究不同门函数对训练稳定性和最终模型性能的影响。我们形式化了可接受门函数应满足的关键性质,并识别了多个此类函数族用于实证评估。本文基于在数学推理任务上使用Qwen2.5-7B-Instruct模型进行的实验,分析了我们的发现。这些结果为设计更平滑、更鲁棒的大语言模型训练策略优化目标提供了实用指导。