Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$η$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $η$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $Ω(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.

翻译：上下文队列赌博机为在未知上下文相关服务速率下学习调度异构作业提供了框架。在随机上下文中，现有算法实现了 $\widetilde{\mathcal{O}}(T^{-1/4})$ 的队列长度遗憾，定义为学习器与最优策略在时间范围 $T$ 内队列长度的期望差。本文将此速率改进至 $\widetilde{\mathcal{O}}(T^{-1/2})$。关键观察在于，随机探索仅需在精心选择的截止轮次之前进行，而非贯穿整个时间范围。我们提出三阶段算法 CQB-$η$-2：（i）纯随机探索以构建初始估计器；（ii）结合 UCB 规则的 $η$-随机探索，以在维持负漂移的同时继续学习；（iii）探索截止后的纯 UCB 阶段。我们的证明将队列长度遗憾在截止轮次处分解。截止前，负漂移抑制了因次优选择导致的队列长度差异。截止后，前两个阶段提供了充足的随机探索样本，确保 UCB 决策产生的离开速率差距较小。结合这两项界，得到阶为 $\widetilde{\mathcal{O}}(T^{-1/2})$ 的队列长度遗憾。我们进一步证明了阶为 $Ω(T^{-1/2})$ 的极小化最大下界。证明构造了两个在最终服务决策前统计不可区分的困难实例，并利用队列特定耦合参数将由此产生的检验误差转化为队列长度遗憾。综上，我们的上下界刻画了关于时间范围 $T$ 的极小化最大依赖关系，精度达到对数因子。