This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy update method for large language model fine-tuning. PROMA accumulates policy gradients across microbatches by projecting out sequence-wise gradient components before microbatch aggregation. The projection is applied layer-wise during the backward pass, enabling efficient implementation without additional forward or backward passes. Empirically, PROMA enforces tighter control of local KL divergence than GRPO, resulting in more stable policy learning. Unlike PPO and GRPO, PROMA achieves proximal updates without inducing entropy collapse and does not rely on a reference policy or likelihood-ratio clipping.
翻译:本技术说明介绍了投影微批次累积(PROMA),一种用于大语言模型微调的近端策略更新方法。PROMA通过在微批次聚合前投影掉序列维度的梯度分量,实现跨微批次的策略梯度累积。该投影在反向传播过程中逐层应用,无需额外的前向或反向传播即可高效实现。实证表明,与GRPO相比,PROMA能对局部KL散度实施更严格的控制,从而实现更稳定的策略学习。不同于PPO和GRPO,PROMA在实现近端更新的同时不会引发熵崩溃,且不依赖参考策略或似然比截断。