The expected regret of any reinforcement learning algorithm is lower bounded by $\Omega\left(\sqrt{DXAT}\right)$ for undiscounted returns, where $D$ is the diameter of the Markov decision process, $X$ the size of the state space, $A$ the size of the action space and $T$ the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an $M/M/c/S$ queue with $m$ job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size $S$, making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by $O(S\log T + \sqrt{mT \log T})$ in the finite server case. In the infinite server case, we prove that the dependence of the regret on $S$ disappears.
翻译:对于无折扣回报,任何强化学习算法的期望遗憾下界为 $\Omega\left(\sqrt{DXAT}\right)$,其中 $D$ 是马尔可夫决策过程的直径,$X$ 是状态空间大小,$A$ 是动作空间大小,$T$ 是时间步数。然而,该下界是普适的。通过考虑问题结构的一些特定知识,可以获得更小的遗憾。本文研究一个 $M/M/c/S$ 队列的准入控制问题,该队列具有 $m$ 个作业类别,且奖励与等待成本依赖于作业类别。排队系统的直径通常随缓冲区大小 $S$ 呈指数增长,这使得前述下界对于任何实际应用都高得难以接受。我们提出一种受 UCRL2 启发的算法,并利用问题结构将期望总遗憾的上界控制在 $O(S\log T + \sqrt{mT \log T})$(有限服务器情形)。在无限服务器情形下,我们证明了遗憾对 $S$ 的依赖性消失。