We study learning in periodic Markov Decision Process (MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period $N$ and as $\mathcal{O}(\sqrt{Tlog T})$ with the horizon length $T$. Utilizing the information about the sparsity of transition matrix of augmented MDP, we propose another algorithm PUCRLB which enhances upon PUCRL2, both in terms of regret ($O(\sqrt{N})$ dependency on period) and empirical performance. Finally, we propose two other algorithms U-PUCRL2 and U-PUCRLB for extended uncertainty in the environment in which the period is unknown but a set of candidate periods are known. Numerical results demonstrate the efficacy of all the algorithms.
翻译:我们研究了周期马尔可夫决策过程(MDP)中的学习问题,这是一种特殊的非平稳MDP,其状态转移概率和奖励函数均呈现周期性变化,并在平均奖励最大化框架下进行分析。通过将周期索引添加到状态空间中,我们将问题转化为一个平稳MDP,并提出了一种周期置信上界强化学习-2(PUCRL2)算法。我们证明PUCRL2的遗憾值与周期$N$呈线性关系,同时随水平长度$T$以$\mathcal{O}(\sqrt{T\log T})$的速率增长。利用增广MDP转移矩阵的稀疏性信息,我们提出了另一种算法PUCRLB,该算法在遗憾值(对周期$N$的依赖仅为$O(\sqrt{N})$)和实证性能方面均优于PUCRL2。最后,针对环境中周期未知但候选周期集合已知的扩展不确定性情形,我们提出了U-PUCRL2和U-PUCRLB两种算法。数值结果验证了所有算法的有效性。