Personalized services are central to today's digital economy, and their sequential decisions are often modeled as contextual bandits. Modern applications pose two main challenges: high-dimensional covariates and the need for nonparametric models to capture complex reward-covariate relationships. We propose a contextual bandit algorithm based on a sparse additive reward model that addresses both challenges through (i) a doubly penalized estimator for nonparametric reward estimation and (ii) an epoch-based design with adaptive screening to balance exploration and exploitation. We prove a sublinear regret bound that grows only logarithmically in the covariate dimensionality; to our knowledge, this is the first such result for nonparametric contextual bandits with high-dimensional covariates. We also derive an information-theoretic lower bound, and the gap to the upper bound vanishes as the reward smoothness increases. Extensive experiments on synthetic data and real data from video recommendation and personalized medicine show strong performance in high-dimensional settings.
翻译:个性化服务是当今数字经济的核心,其序列决策通常被建模为上下文赌博机。现代应用面临两大挑战:高维协变量以及需要非参数模型来捕捉奖励-协变量间的复杂关系。我们提出一种基于稀疏加性奖励模型的上下文赌博机算法,通过以下方式应对这两个挑战:(i) 采用双重惩罚估计器进行非参数奖励估计;(ii) 设计基于轮次的适应性筛选机制以平衡探索与利用。我们证明了次线性遗憾界,其仅随协变量维度对数增长;据我们所知,这是高维协变量非参数上下文赌博机领域的首个此类结果。我们还推导了信息论下界,且该下界与上界的差距随奖励平滑度增加而消失。在合成数据以及来自视频推荐和个性化医疗的真实数据上进行的大量实验表明,该算法在高维场景中具有优越性能。