This paper is motivated by recent developments in the linear bandit literature, which have revealed a discrepancy between the promising empirical performance of algorithms such as Thompson sampling and Greedy, when compared to their pessimistic theoretical regret bounds. The challenge arises from the fact that while these algorithms may perform poorly in certain problem instances, they generally excel in typical instances. To address this, we propose a new data-driven technique that tracks the geometry of the uncertainty ellipsoid, enabling us to establish an instance-dependent frequentist regret bound for a broad class of algorithms, including Greedy, OFUL, and Thompson sampling. This result empowers us to identify and ``course-correct" instances in which the base algorithms perform poorly. The course-corrected algorithms achieve the minimax optimal regret of order $\tilde{\mathcal{O}}(d\sqrt{T})$, while retaining most of the desirable properties of the base algorithms. We present simulation results to validate our findings and compare the performance of our algorithms with the baselines.
翻译:本文源于线性赌博机领域的最新进展,该领域揭示了汤普森采样和贪婪等算法在实证表现优异与其悲观理论遗憾界之间的矛盾。挑战在于,尽管这些算法在某些问题实例中表现欠佳,但通常在典型实例中表现优异。为解决这一问题,我们提出了一种新型数据驱动技术,通过追踪不确定性椭球的几何结构,为包括贪婪、OFUL和汤普森采样在内的广泛算法类别建立了实例依赖的频率学派遗憾界。该结果使我们能够识别基础算法表现不佳的实例并进行"纠偏"。纠偏后的算法在保留基础算法大部分理想特性的同时,实现了极小极大最优遗憾阶$\tilde{\mathcal{O}}(d\sqrt{T})$。我们通过仿真结果验证了研究发现,并比较了所提算法与基线算法的性能。