While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.
翻译:尽管群体相对策略优化(GRPO)已成为一种可扩展的无评论者策略学习框架,但将其扩展到具有显式行为约束的场景仍缺乏深入探索。我们提出了约束化GRPO,这是一种基于拉格朗日方法的GRPO扩展,用于约束化策略优化。约束通过指示器成本函数指定,使得能够通过拉格朗日松弛直接优化违反率。我们发现,在优势估计中采用朴素的多分量处理方法会破坏约束学习:不匹配的分量标准差会扭曲不同目标项的相对重要性,进而破坏拉格朗日信号并阻碍有效的约束实施。我们通过形式化推导揭示了这一效应,从而提出了我们的标量化优势构建方法,该方法保持了奖励项与约束项之间预期的权衡关系。在玩具网格世界中的实验证实了所预测的优化病理现象,并证明标量化优势能够恢复稳定的约束控制。此外,我们在机器人任务上评估了约束化GRPO,结果表明它在提高任务成功率的同时改善了约束满足度,为日益依赖大型多模态基础模型的具身AI领域中的约束化策略优化提供了一种简单而有效的方案。