We study a Federated Reinforcement Learning (FedRL) problem with constraint heterogeneity. In our setting, we aim to solve a reinforcement learning problem with multiple constraints while $N$ training agents are located in $N$ different environments with limited access to the constraint signals and they are expected to collaboratively learn a policy satisfying all constraint signals. Such learning problems are prevalent in scenarios of Large Language Model (LLM) fine-tuning and healthcare applications. To solve the problem, we propose federated primal-dual policy optimization methods based on traditional policy gradient methods. Specifically, we introduce $N$ local Lagrange functions for agents to perform local policy updates, and these agents are then scheduled to periodically communicate on their local policies. Taking natural policy gradient (NPG) and proximal policy optimization (PPO) as policy optimization methods, we mainly focus on two instances of our algorithms, ie, {FedNPG} and {FedPPO}. We show that FedNPG achieves global convergence with an $\tilde{O}(1/\sqrt{T})$ rate, and FedPPO efficiently solves complicated learning tasks with the use of deep neural networks.
翻译:我们研究了一种具有约束异质性的联邦强化学习(FedRL)问题。在该设定下,我们旨在求解一个带有多重约束的强化学习问题,其中$N$个训练智能体位于$N$个不同的环境中,对约束信号的访问受限,且期望它们协作学习一个满足所有约束信号的策略。这类学习问题在大语言模型(LLM)微调和医疗应用场景中普遍存在。为解决该问题,我们基于传统策略梯度方法提出了联邦原始-对偶策略优化方法。具体而言,我们引入$N个局部拉格朗日函数供智能体执行局部策略更新,并安排这些智能体定期通信交换其局部策略。以自然策略梯度(NPG)和近端策略优化(PPO)作为策略优化方法,我们主要关注算法的两个实例,即{FedNPG}和{FedPPO}。我们证明FedNPG以$\tilde{O}(1/\sqrt{T})$的速率实现全局收敛,而FedPPO通过使用深度神经网络高效求解复杂学习任务。