We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.
翻译:我们研究了在安全目标下的对抗性多臂老虎机问题,包含延迟反馈与非延迟反馈两种场景:在保持相对于指定“安全”基线策略的几乎恒定遗憾的同时,实现极小化最优的最坏情况遗憾。现有方法在即时反馈且比较器平滑时可以平衡这种权衡,但任意延迟可能错误地调整保守性与探索之间的转换时机,危及安全保证。为填补这一空白,我们提出谨慎银行家(Prudent-Banker),这是一种新颖的算法,将延迟自适应版本的在线镜像下降与改进的分阶段攻击机制相结合。其关键技术贡献在于一个延迟校准的重启阈值,该阈值严格考虑了未观察反馈导致的扭曲并可靠地检测比较器次优性。我们还为安全约束下的对抗性延迟老虎机建立了新的下界,表明在基线安全要求下,谨慎银行家的遗憾保证在忽略对数因子时是不可改进的。据我们所知,谨慎银行家是首个实现最优安全-鲁棒性权衡的算法:伪遗憾为$\widetilde{O}(\sqrt{T}+\sqrt{D})$,同时相对于安全比较器的遗憾为$\widetilde{O}(1)$,无论是在延迟还是非延迟场景下。跨多种延迟分布的实验表明,与标准的延迟鲁棒基线不同,谨慎银行家能有效平衡安全与学习。