We address offline reinforcement learning with privacy guarantees, where the goal is to train a policy that is differentially private with respect to individual trajectories in the dataset. To achieve this, we introduce DP-MORL, an MBRL algorithm coming with differential privacy guarantees. A private model of the environment is first learned from offline data using DP-FedAvg, a training method for neural networks that provides differential privacy guarantees at the trajectory level. Then, we use model-based policy optimization to derive a policy from the (penalized) private model, without any further interaction with the system or access to the input data. We empirically show that DP-MORL enables the training of private RL agents from offline data and we furthermore outline the price of privacy in this setting.
翻译:我们研究带隐私保证的离线强化学习问题,其目标是训练一个在数据集个体轨迹上满足差分隐私策略的强化学习策略。为此,我们提出DP-MORL——一种具有差分隐私保证的基于模型的强化学习(MBRL)算法。首先,通过DP-FedAvg(一种在轨迹层面提供差分隐私保证的神经网络训练方法)从离线数据中学习环境的私有模型。随后,利用基于模型的策略优化方法,从(带惩罚的)私有模型中推导出策略,无需与系统进行任何额外交互或访问输入数据。实验表明,DP-MORL能够基于离线数据训练出具备隐私保护的强化学习智能体,并进一步揭示了该场景下隐私保护的代价。