We consider a long-term average profit maximizing admission control problem in an M/M/1 queuing system with unknown service and arrival rates. With a fixed reward collected upon service completion and a cost per unit of time enforced on customers waiting in the queue, a dispatcher decides upon arrivals whether to admit the arriving customer or not based on the full history of observations of the queue-length of the system. (Naor 1969, Econometrica) showed that if all the parameters of the model are known, then it is optimal to use a static threshold policy -- admit if the queue-length is less than a predetermined threshold and otherwise not. We propose a learning-based dispatching algorithm and characterize its regret with respect to optimal dispatch policies for the full information model of Naor (1969). We show that the algorithm achieves an $O(1)$ regret when all optimal thresholds with full information are non-zero, and achieves an $O(\ln^{1+\epsilon}(N))$ regret for any specified $\epsilon>0$, in the case that an optimal threshold with full information is $0$ (i.e., an optimal policy is to reject all arrivals), where $N$ is the number of arrivals.
翻译:我们考虑一个M/M/1排队系统中、在服务率和到达率未知情况下的长期平均利润最大化准入控制问题。在每个服务完成时获得固定奖励,并对排队等待的顾客按单位时间收取成本,调度员根据对系统队长观测的完整历史,在顾客到达时决定是否允许其进入系统。(Naor 1969, Econometrica) 的研究表明,若模型所有参数已知,则采用静态阈值策略为最优——当队长小于预定阈值时允许进入,否则拒绝。我们提出一种基于学习的调度算法,并刻画其相对于Naor (1969) 全信息模型的最优调度策略的遗憾值。我们证明:当全信息下的所有最优阈值非零时,该算法实现$O(1)$的遗憾值;当全信息下的最优阈值为$0$时(即最优策略为拒绝所有到达),对任意给定的$\epsilon>0$,该算法实现$O(\ln^{1+\epsilon}(N))$的遗憾值,其中$N$为到达顾客总数。