We study the problem of stochastic contextual bandits in the agnostic setting, where the goal is to compete with the best policy in a given class without assuming realizability or imposing model restrictions on losses or rewards. In this work, we establish the first fast rate for regret relative to the best-in-class policy. Our proposed algorithm updates the policy at every round by minimizing a pessimistic objective, defined as a clipped inverse-propensity estimate of the policy value plus a variance penalty. By leveraging entropy assumptions on the policy class and a Hölderian error-bound condition (a generalization of the margin condition), we achieve fast best-in-class regret rates, including polylogarithmic rates in the parametric case. The analysis is driven by a sequential self-normalized maximal inequality for bounded martingale empirical processes, which yields uniform variance-adaptive confidence bounds and guarantees pessimism under adaptive data collection.
翻译:我们研究了在不可知设定下的随机上下文强盗问题,其目标是在不假设可实现性或对损失/奖励施加模型约束的情况下,与给定策略类中的最优策略竞争。本文首次建立了相对于最优类别策略的遗憾快速收敛速率。我们提出的算法通过每轮最小化一个悲观目标函数来更新策略,该目标定义为策略值的裁剪逆倾向性估计加上方差惩罚项。通过利用策略类的熵假设和Hölderian误差边界条件(边际条件的泛化),我们实现了最佳类内遗憾的快速收敛速率,包括参数情况下的多对数速率。该分析基于有界鞅经验过程的序列化自归一化极大不等式,该不等式提供了均匀方差自适应置信区间,并保证了自适应数据收集下的悲观性。