The Indexed Minimum Empirical Divergence (IMED) algorithm is a highly effective approach that offers a stronger theoretical guarantee of the asymptotic optimality compared to the Kullback--Leibler Upper Confidence Bound (KL-UCB) algorithm for the multi-armed bandit problem. Additionally, it has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. Despite its effectiveness, the generalization of this algorithm to contextual bandits with linear payoffs has remained elusive. In this paper, we present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms. We demonstrate that LinIMED provides a $\widetilde{O}(d\sqrt{T})$ upper regret bound where $d$ is the dimension of the context and $T$ is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.
翻译:索引最小经验散度(IMED)算法是一种高效方法,与多臂赌博机问题中的Kullback-Leibler上置信界(KL-UCB)算法相比,提供了更强的渐近最优性理论保证。此外,经验观察表明其性能优于基于UCB的算法和Thompson采样。尽管该算法效果显著,但其向具有线性收益的情境赌博机的推广一直未能实现。本文提出了IMED算法的新型线性版本,我们将其称为LinIMED算法族。我们证明LinIMED能提供$\widetilde{O}(d\sqrt{T})$的遗憾上界,其中$d$为情境维度,$T$为时间跨度。进一步的实证研究表明,在某些情况下LinIMED及其变体性能优于广泛使用的线性赌博机算法(如LinUCB和线性Thompson采样)。