We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.
翻译:我们提出了表演强化学习的框架,其中学习者选择的策略会影响环境中的潜在奖励和转移动态。遵循最近关于表演预测的文献~\cite{Perdomo et al., 2020},我们引入了表演稳定策略的概念。然后,我们考虑了强化学习问题的正则化版本,并表明在转移动态的合理假设下,反复优化此目标会收敛到表演稳定策略。我们的证明利用了强化学习问题的对偶视角,在分析其他算法在决策依赖环境中的收敛性时可能具有独立意义。然后,我们将结果扩展到学习者仅执行梯度上升步骤而非完全优化目标的情况,以及学习者能够从变化环境中获取有限数量轨迹的情况。对于这两种情况,我们利用表演强化学习的对偶表述,并建立了向稳定解的收敛性。最后,通过在网格世界环境中的广泛实验,我们展示了收敛性对各种参数(如正则化、平滑度和样本数量)的依赖性。