We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret in the fully online setting.
翻译:我们研究批约束下的非参数化情境赌博机问题,其中每个动作的期望回报被建模为协变量的光滑函数,策略更新在每批观测结束时执行。我们为该设定建立了极小化最大遗憾下界,并提出了一种新颖的批式学习算法,该算法可实现最优遗憾(达到对数因子级别)。本质上,我们的方法将协变量空间动态划分为更小的区间,并精细地将区间宽度与批大小对齐。理论结果表明,对于非参数化情境赌博机,在全在线设定中,近乎恒定的策略更新次数即可达到最优遗憾。