As reinforcement learning techniques are increasingly applied to real-world decision problems, attention has turned to how these algorithms use potentially sensitive information. We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We give examples of how this setting covers real-world problems in privacy for sequential decision-making. We solve this problem in the policy gradients framework by introducing a regularizer based on the mutual information (MI) between the sensitive state and the actions. We develop a model-based stochastic gradient estimator for optimization of privacy-constrained policies. We also discuss an alternative MI regularizer that serves as an upper bound to our main MI regularizer and can be optimized in a model-free setting, and a powerful direct estimator that can be used in an environment with differentiable dynamics. We contrast previous work in differentially-private RL to our mutual-information formulation of information disclosure. Experimental results show that our training method results in policies that hide the sensitive state, even in challenging high-dimensional tasks.
翻译:随着强化学习技术越来越多地应用于现实决策问题,人们开始关注这些算法如何利用潜在的敏感信息。我们研究在最大化奖励的同时,通过动作最小化某些敏感状态变量泄露的策略训练任务。我们举例说明该设置如何涵盖序列决策中的现实隐私问题。通过在策略梯度框架中引入基于敏感状态与动作之间互信息(MI)的正则化项,我们解决了这一问题。我们开发了一种基于模型的随机梯度估计器,用于优化隐私约束策略。我们还讨论了一种替代的MI正则化器,它作为我们主要MI正则化器的上界,可在无模型环境中优化,以及一种能够在具有可微动态的环境中使用的高效直接估计器。我们将以往的差分隐私强化学习工作与本文基于互信息的信息泄露表述进行对比。实验结果表明,我们的训练方法能够生成隐藏敏感状态的策略,即使在具有挑战性的高维任务中也是如此。