In this paper, we study the problem of (finite horizon tabular) Markov decision processes (MDPs) with heavy-tailed rewards under the constraint of differential privacy (DP). Compared with the previous studies for private reinforcement learning that typically assume rewards are sampled from some bounded or sub-Gaussian distributions to ensure DP, we consider the setting where reward distributions have only finite $(1+v)$-th moments with some $v \in (0,1]$. By resorting to robust mean estimators for rewards, we first propose two frameworks for heavy-tailed MDPs, i.e., one is for value iteration and another is for policy optimization. Under each framework, we consider both joint differential privacy (JDP) and local differential privacy (LDP) models. Based on our frameworks, we provide regret upper bounds for both JDP and LDP cases and show that the moment of distribution and privacy budget both have significant impacts on regrets. Finally, we establish a lower bound of regret minimization for heavy-tailed MDPs in JDP model by reducing it to the instance-independent lower bound of heavy-tailed multi-armed bandits in DP model. We also show the lower bound for the problem in LDP by adopting some private minimax methods. Our results reveal that there are fundamental differences between the problem of private RL with sub-Gaussian and that with heavy-tailed rewards.
翻译:本文研究了在差分隐私约束下具有重尾奖励的(有限时域表格型)马尔可夫决策过程问题。与以往通常假设奖励来自有界或次高斯分布以确保差分隐私的私有强化学习研究不同,我们考虑奖励分布仅存在有限$(1+v)$阶矩(其中$v \in (0,1]$)的场景。通过借助奖励的鲁棒均值估计器,我们首先提出了两种适用于重尾MDP的框架:一种基于值迭代,另一种基于策略优化。在每个框架下,我们同时考虑了联合差分隐私和本地差分隐私模型。基于所提框架,我们给出了JDP和LDP情况下的遗憾上界,并证明了分布矩和隐私预算均对遗憾值有显著影响。最后,我们通过将问题归结为DP模型下重尾多臂强盗问题的实例无关下界,建立了JDP模型中重尾MDP的遗憾最小化下界,并采用私有极小极大方法给出了该问题在LDP下的下界。研究结果表明,具有次高斯奖励的私有强化学习与具有重尾奖励的私有强化学习之间存在本质差异。