We study online learning with oblivious losses and delays under a novel ``capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under ``clairvoyance'' (i.e., delay durations are revealed upfront each round) and/or ``preemptibility'' (i.e., we have ability to stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the ``optimal capacity'' needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound $d_{\max}$, adding $\smash{\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$), the minimax regret is $\Theta\bigl(\sqrt{TK(1+d/C)+Td\log(K)}\bigr)$ and the optimal capacity is $\Theta(\min\{K/\log(K),d\}\bigr)$ in the bandit setting, while in the full-information setting, the minimax regret is $\Theta\bigl(\sqrt{T(d+1)\log(K)}\bigr)$ and the optimal capacity is $\Theta(1)$. For round-dependent and fixed delays, our upper bounds are achieved using novel scheduling policies, based on Pareto-distributed proxy delays and batching techniques. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity.
翻译:本文研究具有遗忘损失和延迟的在线学习问题,并引入一种新颖的“容量约束”,该约束限制了可同时追踪历史轮次以获取延迟反馈的数量。在“预知性”(即每轮延迟时长提前已知)和/或“可抢占性”(即能够停止追踪先前选定轮次的反馈)条件下,我们建立了可达到遗憾的匹配上界与下界(至多相差对数项),刻画了匹配经典延迟在线学习极小极大速率所需的“最优容量”——经典设定隐含假设了无限容量。我们的算法在所有容量级别上均实现极小极大最优遗憾,在次优容量下性能可优雅降级。对于$K$个动作、$T$轮总延迟$D$的情形,在预知性假设下且容量$C = \Omega(\log(T))$时,我们在赌博机设定中实现遗憾$\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$,在全信息反馈设定中实现遗憾$\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$。当以可抢占性替代预知性时,我们需要已知最大延迟上界$d_{\max}$,此时遗憾增加$\smash{\widetilde{O}(d_{\max})}$项。对于固定延迟$d$(即$D=Td$),赌博机设定中的极小极大遗憾为$\Theta\bigl(\sqrt{TK(1+d/C)+Td\log(K)}\bigr)$,最优容量为$\Theta(\min\{K/\log(K),d\}\bigr)$;而在全信息设定中,极小极大遗憾为$\Theta\bigl(\sqrt{T(d+1)\log(K)}\bigr)$,最优容量为$\Theta(1)$。针对轮次相关延迟与固定延迟,我们的上界通过基于帕累托分布代理延迟和批处理技术的新型调度策略实现。关键的是,本研究统一了延迟赌博机、标签高效学习与在线调度框架,证明在延迟反馈下进行鲁棒在线学习仅需相当有限的追踪容量即可实现。