We study the problem of online learning in contextual bandit problems where the loss function is assumed to belong to a known parametric function class. We propose a new analytic framework for this setting that bridges the Bayesian theory of information-directed sampling due to Russo and Van Roy (2018) and the worst-case theory of Foster, Kakade, Qian, and Rakhlin (2021) based on the decision-estimation coefficient. Drawing from both lines of work, we propose a algorithmic template called Optimistic Information-Directed Sampling and show that it can achieve instance-dependent regret guarantees similar to the ones achievable by the classic Bayesian IDS method, but with the major advantage of not requiring any Bayesian assumptions. The key technical innovation of our analysis is introducing an optimistic surrogate model for the regret and using it to define a frequentist version of the Information Ratio of Russo and Van Roy (2018), and a less conservative version of the Decision Estimation Coefficient of Foster et al. (2021). Keywords: Contextual bandits, information-directed sampling, decision estimation coefficient, first-order regret bounds.
翻译:本文研究上下文赌博机问题中的在线学习问题,其中损失函数被假定属于一个已知的参数函数类。我们为此设定提出了一种新的分析框架,该框架连接了Russo与Van Roy(2018)提出的信息导向采样的贝叶斯理论,以及Foster、Kakade、Qian和Rakhlin(2021)基于决策估计系数的最坏情况理论。借鉴这两项工作,我们提出了一种名为“乐观信息导向采样”的算法模板,并证明其能够实现与经典贝叶斯IDS方法相似的实例依赖性遗憾界,但具有无需任何贝叶斯假设的重要优势。我们分析的关键技术创新在于引入了一个乐观的代理模型来估计遗憾,并用其定义了Russo与Van Roy(2018)信息率的频率主义版本,以及Foster等人(2021)决策估计系数的一个较不保守的版本。关键词:上下文赌博机,信息导向采样,决策估计系数,一阶遗憾界。