We investigate bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under an arbitrary delay. Previous studies have established a regret bound of $O(T^{3/4}+d^{1/3}T^{2/3})$ for this problem, where $d$ is the maximum delay, by simply feeding delayed loss values to the classical bandit gradient descent (BGD) algorithm. In this paper, we develop a novel algorithm to enhance the regret, which carefully exploits the delayed bandit feedback via a blocking update mechanism. Our analysis first reveals that the proposed algorithm can decouple the joint effect of the delays and bandit feedback on the regret, and improve the regret bound to $O(T^{3/4}+\sqrt{dT})$ for convex functions. Compared with the previous result, our regret matches the $O(T^{3/4})$ regret of BGD in the non-delayed setting for a larger amount of delay, i.e., $d=O(\sqrt{T})$, instead of $d=O(T^{1/4})$. Furthermore, we consider the case with strongly convex functions, and prove that the proposed algorithm can enjoy a better regret bound of $O(T^{2/3}\log^{1/3}T+d\log T)$. Finally, we show that in a special case with unconstrained action sets, it can be simply extended to achieve a regret bound of $O(\sqrt{T\log T}+d\log T)$ for strongly convex and smooth functions.
翻译:我们研究了带延迟反馈的赌徒凸优化(BCO)问题,其中仅能观察到动作在任意延迟下的损失值。先前研究通过将延迟损失值直接输入经典赌徒梯度下降(BGD)算法,为这一问题建立了$O(T^{3/4}+d^{1/3}T^{2/3})$的遗憾界(其中$d$为最大延迟)。本文提出了一种新算法以优化遗憾值,该算法通过分块更新机制巧妙利用延迟赌徒反馈。分析首先表明,所提算法能解耦延迟与赌徒反馈对遗憾的联合影响,将凸函数的遗憾界改进至$O(T^{3/4}+\sqrt{dT})$。与先前结果相比,我们的遗憾值在更大延迟量(即$d=O(\sqrt{T})$而非$d=O(T^{1/4})$)下匹配了无延迟设定中BGD的$O(T^{3/4})$遗憾值。此外,针对强凸函数情形,我们证明所提算法可实现更优的$O(T^{2/3}\log^{1/3}T+d\log T)$遗憾界。最后,我们证明对于无约束动作集的特殊情形,该算法可简单拓展至强凸光滑函数,实现$O(\sqrt{T\log T}+d\log T)$的遗憾界。