In this study, we propose a new method for constructing UCB-type algorithms for stochastic multi-armed bandits based on general convex optimization methods with an inexact oracle. We derive the regret bounds corresponding to the convergence rates of the optimization methods. We propose a new algorithm Clipped-SGD-UCB and show, both theoretically and empirically, that in the case of symmetric noise in the reward, we can achieve an $O(\log T\sqrt{KT\log T})$ regret bound instead of $O\left (T^{\frac{1}{1+\alpha}} K^{\frac{\alpha}{1+\alpha}} \right)$ for the case when the reward distribution satisfies $\mathbb{E}_{X \in D}[|X|^{1+\alpha}] \leq \sigma^{1+\alpha}$ ($\alpha \in (0, 1])$, i.e. perform better than it is assumed by the general lower bound for bandits with heavy-tails. Moreover, the same bound holds even when the reward distribution does not have the expectation, that is, when $\alpha<0$.
翻译:在本研究中,我们提出了一种基于不精确预言的一般凸优化方法构建随机多臂赌博机UCB类型算法的全新方法。我们推导了与优化方法收敛速度相对应的遗憾界。我们提出了一种新算法Clipped-SGD-UCB,并从理论与实证两方面证明:在奖励存在对称噪声的情况下,当奖励分布满足$\mathbb{E}_{X \in D}[|X|^{1+\alpha}] \leq \sigma^{1+\alpha}$($\alpha \in (0, 1]$)时,可获得$O(\log T\sqrt{KT\log T})$的遗憾界,而非$O\left (T^{\frac{1}{1+\alpha}} K^{\frac{\alpha}{1+\alpha}} \right)$,即其性能优于重尾分布赌博机的通用下界。此外,即使奖励分布不存在期望(即$\alpha<0$时),该界依然成立。