Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.
翻译:受个性化医疗及其他涉及敏感数据的应用启发,我们研究了在差分隐私(DP)约束下的强化学习在线探索问题。现有研究表明,在联合差分隐私(JDP)和本地差分隐私(LDP)下可实现无遗憾学习,但未提供具有最优遗憾的算法。针对JDP情形,我们填补了这一空白:设计了一个$\epsilon$-JDP算法,其遗憾为$\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$,对于所有满足$\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$的取值,该结果与非私有学习的信息论下界相匹配。其中$S$、$A$表示状态和动作数量,$H$表示规划视界,$T$为步数。据我们所知,这是首个在$T\rightarrow \infty$时实现渐近“免费隐私”的私有RL算法。我们的技术(可能具有独立研究价值)包括:私有发布伯恩斯坦型探索奖励机制,以及一种改进的访问统计量发布方法。相同技术还意味着在LDP情形下可获得略优的遗憾界。