Motivated by the wide range of modern applications of the Erlang-B blocking model beyond communication networks and call centers to sizing and pricing in design production systems, messaging systems, and app-based parking systems, we study admission control for such a system but with unknown arrival and service rates. In our model, at every job arrival, a dispatcher decides to assign the job to an available server or block it. Every served job yields a fixed reward for the dispatcher, but it also results in a cost per unit time of service. Our goal is to design a dispatching policy that maximizes the long-term average reward for the dispatcher based on observing only the arrival times and the state of the system at each arrival that reflects a realistic sampling of such systems. Critically, the dispatcher observes neither the service times nor departure times so that standard reinforcement learning-based approaches that use reward signals do not apply. Hence, we develop our learning-based dispatch scheme as a parametric learning problem a'la self-tuning adaptive control. In our problem, certainty equivalent control switches between an always admit if room policy (explore infinitely often) and a never admit policy (immediately terminate learning), which is distinct from the adaptive control literature. Hence, our learning scheme judiciously uses the always admit if room policy so that learning doesn't stall. We prove that for all service rates, the proposed policy asymptotically learns to take the optimal action and present finite-time regret guarantees. The extreme contrast in the certainty equivalent optimal control policies leads to difficulties in learning that show up in our regret bounds for different parameter regimes: constant regret in one regime versus regret growing logarithmically in the other.
翻译:受Erlang-B阻塞模型在通信网络和呼叫中心之外的广泛应用(如设计生产系统、消息系统和基于应用的停车系统中的规模确定与定价)的启发,我们研究此类系统的准入控制问题,但考虑到达率和服务率未知的情况。在模型中,每项任务到达时,调度器决定将其分配给空闲服务器或拒绝服务。每个被服务的任务为调度器带来固定收益,但同时产生单位时间的服务成本。我们的目标是通过仅观察任务到达时间以及每次到达时的系统状态(反映对此类系统的实际采样),设计一个调度策略,最大化调度器的长期平均收益。关键之处在于,调度器无法观测服务时间或离开时间,因此基于奖励信号的标准强化学习方法并不适用。为此,我们将基于学习的调度方案建模为参数学习问题(采用自校正自适应控制方法)。在该问题中,确定性等价控制在"始终接纳(若有空闲)"策略(无限探索)与"永不接纳"策略(立即终止学习)之间切换,这与自适应控制文献中的典型情况不同。因此,我们的学习方案明智地使用"始终接纳"策略,以避免学习停滞。我们证明:对所有服务率,所提策略能渐近学习最优动作,并给出有限时间遗憾保证。确定性等价最优控制策略的极端对比导致学习困难,这体现在不同参数域的遗憾界中:一个域为常数遗憾,另一个域则为对数增长的遗憾。