This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of ${O}(\sqrt{T \log T})$ for adversarial environments and of $O(\frac{\log T}{\Delta_{\min}} + \sqrt{\frac{C \log T}{\Delta_{\min}}})$ for stochastic environments with adversarial corruptions, where $T$, $\Delta_{\min}$, and $C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence. In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of $O(\frac{\sigma^2 \log T}{\Delta_{\min}})$ as well, where $\sigma^2$ denotes the maximum variance of the feedback loss. The proposed algorithm is based on the SCRiBLe algorithm. By incorporating into this a new technique we call scaled-up sampling, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.
翻译:本文提出了一种线性赌博机算法,该算法在两个不同层次上自适应于环境。在较高层次上,所提出的算法适应于多种类型的环境。更精确地说,它实现了最佳三世界遗憾界,即对于对抗性环境为 ${O}(\sqrt{T \log T})$,对于带有对抗性扰动的随机环境为 $O(\frac{\log T}{\Delta_{\min}} + \sqrt{\frac{C \log T}{\Delta_{\min}}})$,其中 $T$、$\Delta_{\min}$ 和 $C$ 分别表示时间范围、最小次优间隙和总扰动量。此处省略了维度上的多项式因子。在较低层次上,在对抗性和随机性两种机制中,所提出的算法能适应特定的环境特征,从而获得更好的性能。该算法具有数据依赖的遗憾界,该界依赖于最优行动的累积损失、总二次变分以及损失向量序列的路径长度。此外,对于随机环境,所提出的算法还具有方差自适应的遗憾界 $O(\frac{\sigma^2 \log T}{\Delta_{\min}})$,其中 $\sigma^2$ 表示反馈损失的最大方差。该算法基于SCRiBLe算法。通过融入一种我们称为尺度上采样的新技术,我们获得了高层次的自适应性;而通过融入乐观在线学习技术,我们获得了低层次的自适应性。