Information-directed sampling (IDS) is a powerful framework for solving bandit problems which has shown strong results in both Bayesian and frequentist settings. However, frequentist IDS, like many other bandit algorithms, requires that one have prior knowledge of a (relatively) tight upper bound on the norm of the true parameter vector governing the reward model in order to achieve good performance. Unfortunately, this requirement is rarely satisfied in practice. As we demonstrate, using a poorly calibrated bound can lead to significant regret accumulation. To address this issue, we introduce a novel frequentist IDS algorithm that iteratively refines a high-probability upper bound on the true parameter norm using accumulating data. We focus on the linear bandit setting with heteroskedastic subgaussian noise. Our method leverages a mixture of relevant information gain criteria to balance exploration aimed at tightening the estimated parameter norm bound and directly searching for the optimal action. We establish regret bounds for our algorithm that do not depend on an initially assumed parameter norm bound and demonstrate that our method outperforms state-of-the-art IDS and UCB algorithms.
翻译:信息导向采样(IDS)是解决赌博机问题的一个强大框架,在贝叶斯和频率学派设定下均表现出优异性能。然而,与许多其他赌博机算法类似,频率学派IDS需要预先掌握控制奖励模型的真实参数向量范数的(相对)紧致上界,才能实现良好性能。遗憾的是,这一要求在实践中很少得到满足。正如我们所展示的,使用校准不当的界可能导致显著的累积遗憾。为解决此问题,我们提出了一种新颖的频率学派IDS算法,该算法利用累积数据迭代优化真实参数范数的高概率上界。我们聚焦于具有异方差次高斯噪声的线性赌博机设定。该方法融合了多种相关信息增益准则,以平衡旨在收紧估计参数范数界的探索与直接搜索最优动作之间的关系。我们建立了不依赖于初始假设参数范数界的算法遗憾界,并证明该方法优于当前最先进的IDS和UCB算法。