We present an efficient reinforcement learning algorithm that learns the optimal admission control policy in a partially observable queueing network. Specifically, only the arrival and departure times from the network are observable, and optimality refers to the average holding/rejection cost in infinite horizon. While reinforcement learning in Partially Observable Markov Decision Processes (POMDP) is prohibitively expensive in general, we show that our algorithm has a regret that only depends sub-linearly on the maximal number of jobs in the network, $S$. In particular, in contrast with existing regret analyses, our regret bound does not depend on the diameter of the underlying Markov Decision Process (MDP), which in most queueing systems is at least exponential in $S$. The novelty of our approach is to leverage Norton's equivalent theorem for closed product-form queueing networks and an efficient reinforcement learning algorithm for MDPs with the structure of birth-and-death processes.
翻译:我们提出了一种高效的强化学习算法,用于学习部分可观测排队网络中的最优接纳控制策略。具体而言,网络中仅可观测到任务的到达与离开时间,而最优性指的是无限时间域内的平均持有/拒绝成本。尽管在部分可观测马尔可夫决策过程(POMDP)中进行强化学习通常代价极高,但我们证明,所提算法的遗憾仅依赖于网络中最大任务数$S$的次线性函数。特别地,与现有遗憾分析不同,我们的遗憾界不依赖于底层马尔可夫决策过程(MDP)的直径,而在大多数排队系统中,该直径至少是$S$的指数级。我们方法的新颖之处在于利用闭乘积形式排队网络的诺顿等效定理,以及针对具有生灭过程结构的MDP的高效强化学习算法。