We study the problem of online learning in contextual bandit problems where the loss function is assumed to belong to a known parametric function class. We propose a new analytic framework for this setting that bridges the Bayesian theory of information-directed sampling due to Russo and Van Roy (2018) and the worst-case theory of Foster, Kakade, Qian, and Rakhlin (2021) based on the decision-estimation coefficient. Drawing from both lines of work, we propose a algorithmic template called Optimistic Information-Directed Sampling and show that it can achieve instance-dependent regret guarantees similar to the ones achievable by the classic Bayesian IDS method, but with the major advantage of not requiring any Bayesian assumptions. The key technical innovation of our analysis is introducing an optimistic surrogate model for the regret and using it to define a frequentist version of the Information Ratio of Russo and Van Roy (2018), and a less conservative version of the Decision Estimation Coefficient of Foster et al. (2021). Keywords: Contextual bandits, information-directed sampling, decision estimation coefficient, first-order regret bounds.
翻译:我们研究了损失函数属于已知参数函数类的上下文赌博机在线学习问题。针对该设定,我们提出了一种新的分析框架,该框架融合了Russo和Van Roy(2018)基于贝叶斯理论的信息导向采样与Foster、Kakade、Qian和Rakhlin(2021)基于决策估计系数的最坏情况分析理论。综合两类研究成果,我们提出了一种名为"乐观信息导向采样"的算法模板,并证明该算法能够实现与经典贝叶斯IDS方法类似的实例依赖遗憾保证,其核心优势在于无需任何贝叶斯假设。我们分析的关键技术创新包括:引入乐观替代模型用于遗憾分析,并基于此定义了Russo和Van Roy(2018)信息比率的频率派版本,以及Foster等人(2021)决策估计系数的非保守变体。关键词:上下文赌博机,信息导向采样,决策估计系数,一阶遗憾界。