We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for \emph{loss-independent delays}, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for \emph{loss-dependent delays}, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the \emph{delay-as-payoff model}, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.
翻译:我们研究了在多种延迟模型下带延迟反馈的随机线性赌博机问题,并建立了近乎最优的遗憾界。我们的结果揭示了延迟线性赌博机何时展现出与多臂赌博机(MAB)相同的定性行为,以及线性结构何时会带来全新的挑战。具体而言:(1)对于**损失无关延迟**(延迟不依赖于实际损失,但可能依赖于臂的选择),我们证明延迟仅带来可加性的遗憾惩罚。在随机延迟下,该惩罚与期望延迟成比例;而在对抗性延迟下,它与未完成观测的最大数量成比例。值得注意的是,这两种延迟惩罚均与维度无关,优于现有最优结果;(2)对于**损失相关延迟**,我们证明线性赌博机比MAB困难得多:不同于MAB,我们在线性赌博机中证明了(对数因子内)匹配的上界和下界,其延迟惩罚依赖于维度的平方根;(3)对于**延迟即收益模型**(损失相关延迟的特例),我们证明仅依赖于最优臂延迟的MAB最优保证在线性赌博机中同样无法实现。综合以上结果,本文为延迟反馈与线性泛化之间的相互作用提供了精确刻画。