In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the \emph{diameter} $D$ of the MDP is $\Omega(S^S)$, where $S$ is the number of states. Therefore, the existing lower and upper bounds on the regret at time$T$, of order $O(\sqrt{DSAT})$ for MDPs with $S$ states and $A$ actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm {\sc Ucrl2} is in fact upper bounded by $\tilde{\mathcal{O}}(\sqrt{E_2AT})$ where $E_2$ is related to the weighted second moment of the stationary measure of a reference policy. Importantly, $E_2$ is bounded independently of $S$. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.
翻译:本文重新审视了具有生灭结构的马尔可夫决策过程(MDP)中无折扣强化学习的遗憾界。具体而言,我们考虑了一个带有不耐烦任务的受控队列,其主要目标是在能耗与用户感知性能之间优化权衡。在此设定下,MDP的“直径”$D$ 为 $\Omega(S^S)$,其中 $S$ 是状态数。因此,对于具有 $S$ 个状态和 $A$ 个动作的 MDP,现有关于时间 $T$ 内遗憾的下界与上界(阶数为 $O(\sqrt{DSAT})$)可能表明强化学习在此场景中效率低下。然而,在我们的主要结果中,我们利用 MDP 的结构证明了经典学习算法 {\sc Ucrl2} 的轻微变体实际上具有上界 $\tilde{\mathcal{O}}(\sqrt{E_2AT})$,其中 $E_2$ 与参考策略平稳测度的加权二阶矩相关。重要的是,$E_2$ 独立于 $S$ 有界。因此,我们的界渐近地独立于状态数和直径。这一结果基于对学习算法访问 MDP 状态次数的仔细研究,该访问次数呈现高度非均匀性。