This paper is motivated by recent research in the $d$-dimensional stochastic linear bandit literature, which has revealed an unsettling discrepancy: algorithms like Thompson sampling and Greedy demonstrate promising empirical performance, yet this contrasts with their pessimistic theoretical regret bounds. The challenge arises from the fact that while these algorithms may perform poorly in certain problem instances, they generally excel in typical instances. To address this, we propose a new data-driven technique that tracks the geometric properties of the uncertainty ellipsoid around the main problem parameter. This methodology enables us to formulate an instance-dependent frequentist regret bound, which incorporates the geometric information, for a broad class of base algorithms, including Greedy, OFUL, and Thompson sampling. This result allows us to identify and ``course-correct" problem instances in which the base algorithms perform poorly. The course-corrected algorithms achieve the minimax optimal regret of order $\tilde{\mathcal{O}}(d\sqrt{T})$ for a $T$-period decision-making scenario, effectively maintaining the desirable attributes of the base algorithms, including their empirical efficacy. We present simulation results to validate our findings using synthetic and real data.
翻译:本文受近期$d$维随机线性赌博机文献研究启发,揭示了令人不安的差异:汤普森采样和贪婪等算法展现出有前景的实证表现,但其理论遗憾界却相对悲观。这一挑战源于以下事实:虽然这些算法在某些问题实例中可能表现不佳,但它们在典型实例中通常表现优异。针对此问题,我们提出了一种新的数据驱动技术,通过追踪主问题参数周围不确定椭球的几何特性。该方法使我们能够为基础算法大类(包括贪婪、OFUL和汤普森采样)推导出包含几何信息的实例相关频率遗憾界。这一结果使我们能够识别并对基础算法表现欠佳的问题实例进行"纠偏"。纠偏后的算法在$T$期决策场景中达到了阶为$\tilde{\mathcal{O}}(d\sqrt{T})$的极小极大最优遗憾,有效保持了基础算法的理想特性(包括其实证有效性)。我们通过合成数据和真实数据的仿真结果验证了研究发现的正确性。