We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret in the fully online setting.
翻译:本文研究批量约束下的非参数上下文赌博机问题,其中每个行动的期望奖励被建模为协变量的平滑函数,且策略更新在每个观测批次结束时进行。我们建立了该场景下的极小极大遗憾下界,并提出了一种新颖的批量学习算法,该算法能够达到最优遗憾(至多相差对数因子)。本质上,我们的方法将协变量空间动态划分为更小的区间,并使其宽度与批量大小精细匹配。理论结果表明,对于非参数上下文赌博机,近乎恒定次数的策略更新即可在全在线设定下达到最优遗憾。