This study considers the linear contextual bandit problem with independent and identically distributed (i.i.d.) contexts. In this problem, existing studies have proposed Best-of-Both-Worlds (BoBW) algorithms whose regrets satisfy $O(\log^2(T))$ for the number of rounds $T$ in a stochastic regime with a suboptimality gap lower-bounded by a positive constant, while satisfying $O(\sqrt{T})$ in an adversarial regime. However, the dependency on $T$ has room for improvement, and the suboptimality-gap assumption can be relaxed. For this issue, this study proposes an algorithm whose regret satisfies $O(\log(T))$ in the setting when the suboptimality gap is lower-bounded. Furthermore, we introduce a margin condition, a milder assumption on the suboptimality gap. That condition characterizes the problem difficulty linked to the suboptimality gap using a parameter $\beta \in (0, \infty]$. We then show that the algorithm's regret satisfies $O\left(\left\{\log(T)\right\}^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$. Here, $\beta= \infty$ corresponds to the case in the existing studies where a lower bound exists in the suboptimality gap, and our regret satisfies $O(\log(T))$ in that case. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $\alpha$-Linear-Contextual (LC)-Tsallis-INF.
翻译:本研究考虑具有独立同分布上下文的线性上下文赌博机问题。针对该问题,现有研究提出的最优双场景算法在随机场景中假设亚优性间隙存在正常数下界时,其遗憾值满足关于回合数$T$的$O(\log^2(T))$,在对抗场景中则满足$O(\sqrt{T})$。然而,该算法对$T$的依赖关系仍有改进空间,且亚优性间隙假设可被放宽。为此,本文提出一种在亚优性间隙存在下界时遗憾值满足$O(\log(T))$的算法。此外,我们引入一种关于亚优性间隙的更温和假设——边际条件,该条件通过参数$\beta \in (0, \infty]$刻画与亚优性间隙相关的问题难度。我们进一步证明该算法的遗憾值满足$O\left(\left\{\log(T)\right\}^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$。其中$\beta=\infty$对应现有研究中亚优性间隙存在下界的情形,此时遗憾值满足$O(\log(T))$。本文提出的算法基于采用Tsallis熵的跟随正则化领导者策略,被称为$\alpha$-线性上下文-Tsallis-INF。