Group Relative Policy Optimization (GRPO) has recently emerged as an effective approach for improving the reasoning capabilities of large language models through online multi-objective reinforcement learning. While personalization on private data is increasingly vital, traditional Reinforcement Learning (RL) alignment is often memory-prohibitive for on-device federated learning due to the overhead of maintaining a separate critic network. GRPO's critic-free architecture enables feasible on-device training, yet transitioning to a federated setting introduces systemic challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs. We propose FedMOA, a federated GRPO framework for multi-objective alignment under heterogeneous rewards. FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, which prioritizes primary reasoning as auxiliary objectives saturate. On the server side, it utilizes a task- and accuracy-aware aggregation strategy to prioritize high-quality updates. Experiments on mathematical reasoning and code generation benchmarks demonstrate that FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.
翻译:组相对策略优化(GRPO)最近作为一种通过在线多目标强化学习提升大语言模型推理能力的有效方法而出现。尽管在私有数据上进行个性化日益重要,但传统强化学习对齐方法由于需要维护独立的评论家网络而产生开销,通常对设备端联邦学习而言内存消耗过大。GRPO的无评论家架构使得设备端训练成为可能,然而过渡到联邦环境会引入系统性挑战:异构的奖励定义、不平衡的多目标优化以及高昂的训练成本。我们提出了FedMOA,一个用于异构奖励下多目标对齐的联邦GRPO框架。FedMOA通过基于超梯度下降的在线自适应加权机制来稳定本地训练,该机制在辅助目标饱和时优先考虑主要推理任务。在服务器端,它采用一种任务与精度感知的聚合策略,以优先处理高质量的更新。在数学推理和代码生成基准测试上的实验表明,FedMOA持续优于联邦平均算法,在提升全局性能、个性化以及多目标平衡的同时,实现了高达2.2%的精度增益。