The Lipschitz bandit problem extends stochastic bandits to a continuous action set defined over a metric space, where the expected reward function satisfies a Lipschitz condition. In this work, we introduce a new problem of Lipschitz bandit in the presence of stochastic delayed feedback, where the rewards are not observed immediately but after a random delay. We consider both bounded and unbounded stochastic delays, and design algorithms that attain sublinear regret guarantees in each setting. For bounded delays, we propose a delay-aware zooming algorithm that retains the optimal performance of the delay-free setting up to an additional term that scales with the maximal delay $τ_{\max}$. For unbounded delays, we propose a novel phased learning strategy that accumulates reliable feedback over carefully scheduled intervals, and establish a regret lower bound showing that our method is nearly optimal up to logarithmic factors. Finally, we present experimental results to demonstrate the efficiency of our algorithms under various delay scenarios.
翻译:Lipschitz赌博机问题将随机赌博机扩展至定义在度量空间上的连续动作集,其中期望奖励函数满足Lipschitz条件。本文首次提出存在随机延迟反馈的Lipschitz赌博机问题,在该问题中奖励并非即时获得,而是经过随机延迟后才被观测到。我们同时考虑了有界与无界随机延迟情形,并设计了在每种设置下都能获得次线性遗憾保证的算法。针对有界延迟,我们提出了一种延迟感知的缩放算法,该算法在无延迟设置的最优性能基础上,仅增加一个与最大延迟$τ_{\max}$成比例的附加项。对于无界延迟,我们提出了一种新颖的分阶段学习策略,该策略在精心调度的时间区间内累积可靠反馈,并通过建立遗憾下界证明我们的方法在忽略对数因子意义下近乎最优。最后,我们通过实验展示了所提算法在不同延迟场景下的有效性。