This paper studies online convex optimization with stochastic constraints. We propose a variant of the drift-plus-penalty algorithm that guarantees $O(\sqrt{T})$ expected regret and zero constraint violation, after a fixed number of iterations, which improves the vanilla drift-plus-penalty method with $O(\sqrt{T})$ constraint violation. Our algorithm is oblivious to the length of the time horizon $T$, in contrast to the vanilla drift-plus-penalty method. This is based on our novel drift lemma that provides time-varying bounds on the virtual queue drift and, as a result, leads to time-varying bounds on the expected virtual queue length. Moreover, we extend our framework to stochastic-constrained online convex optimization under two-point bandit feedback. We show that by adapting our algorithmic framework to the bandit feedback setting, we may still achieve $O(\sqrt{T})$ expected regret and zero constraint violation, improving upon the previous work for the case of identical constraint functions. Numerical results demonstrate our theoretical results.
翻译:本文研究带有随机约束的在线凸优化问题。我们提出了一种漂移加惩罚算法的变体,该算法在固定迭代次数后能保证$O(\sqrt{T})$的期望遗憾和零约束违反,这优于原始漂移加惩罚方法中$O(\sqrt{T})$的约束违反。与原始漂移加惩罚方法不同,我们的算法无需预知时间范围$T$的长度。这一成果基于我们提出的新颖漂移引理,该引理提供了虚拟队列漂移的时变界,进而推导出期望虚拟队列长度的时变界。此外,我们将框架扩展至两点赌博机反馈下的随机约束在线凸优化。研究表明,通过将算法框架适配至赌博机反馈场景,我们仍能实现$O(\sqrt{T})$的期望遗憾和零约束违反,较之先前针对相同约束函数情况的研究有所改进。数值实验结果验证了我们的理论结论。