We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \: \mathbf{KL}\left( {\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$, where $D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting, $O$ is the finite number of states (constant over all changes), $A$ is the finite number of actions (constant over all changes), $K_T$ is the number of change points up to horizon $T$, and $\mathbf{\theta}^{(\ell)}$ is the transition kernel during the interval $[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over the set of states $\mathbb{O}$. Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes. In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments. We provide a detailed experimental setup along with a code repository (upon publication) that can be used to easily reproduce our experiments.
翻译:本文研究非平稳强化学习(RL)环境下的学习问题,该环境可由分段平稳离散时间马尔可夫决策过程(MDP)完整描述。我们提出一种针对一般多项分布输入流的重启式贝叶斯在线变点检测算法变体(R-BOCPD),该算法在虚警率和检测延迟方面具有接近最优的理论保证。在此基础上,我们提出一种改进型UCRL2算法,命名为R-BOCPD-UCRL2,适用于状态转移核服从多项分布的MDP。通过有限时间性能分析,我们证明R-BOCPD-UCRL2具有优越的遗憾界:$O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \: \mathbf{KL}\left( {\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$,其中$D$为定义分段平稳MDP的MDP集合中的最大直径,$O$为有限状态数(所有变化中恒定),$A$为有限动作数(所有变化中恒定),$K_T$为时间范围$T$内的变点数量,$\mathbf{\theta}^{(\ell)}$为区间$[c_\ell, c_{\ell+1})$内的转移核,该核在状态集$\mathbb{O}$上假设服从多项分布。值得关注的是,该性能界限不直接随MDP状态转移分布和奖励的变化幅度而缩放,即能够建模突变场景。在合成环境的多类场景中,R-BOCPD-UCRL2性能优于现有最优方法。我们提供了详细实验设置及代码仓库(论文发表后公开),可用于轻松复现实验结果。