We consider the problem of efficiently routing jobs that arrive into a central queue to a system of heterogeneous servers. Unlike homogeneous systems, a threshold policy, that routes jobs to the slow server(s) when the queue length exceeds a certain threshold, is known to be optimal for the one-fast-one-slow two-server system. But an optimal policy for the multi-server system is unknown and non-trivial to find. While Reinforcement Learning (RL) has been recognized to have great potential for learning policies in such cases, our problem has an exponentially large state space size, rendering standard RL inefficient. In this work, we propose ACHQ, an efficient policy gradient based algorithm with a low dimensional soft threshold policy parameterization that leverages the underlying queueing structure. We provide stationary-point convergence guarantees for the general case and despite the low-dimensional parameterization prove that ACHQ converges to an approximate global optimum for the special case of two servers. Simulations demonstrate an improvement in expected response time of up to ~30% over the greedy policy that routes to the fastest available server.
翻译:我们考虑将到达中央队列的任务高效路由至异构服务器系统的问题。与同构系统不同,在单快-单慢双服务器系统中,当队列长度超过特定阈值时将任务路由至慢速服务器的阈值策略被认为是最优的。然而,多服务器系统的最优策略尚未明确且难以求解。尽管强化学习在此类场景中展现出巨大的策略学习潜力,但我们的问题具有指数级增长的状态空间规模,导致标准强化学习方法效率低下。本文提出ACHQ算法——一种基于高效策略梯度的算法,通过利用底层排队结构采用低维软阈值策略参数化。我们为一般情况提供了驻点收敛保证,并证明了尽管采用低维参数化,ACHQ在双服务器特例情况下能收敛至近似全局最优解。仿真结果表明,与路由至最快可用服务器的贪婪策略相比,预期响应时间最多可提升约30%。