Kernel Single-Index Bandits: Estimation, Inference, and Learning

We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized $\varepsilon$-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including $\tilde{O}(\sqrt{T})$ rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.

翻译：我们研究有限动作的上下文赌博机问题，其中每个臂的奖励遵循单指标模型，包含臂特定的指标参数和未知非参数链接函数。我们考虑臂对应稳定决策选项且协变量在赌博机策略下自适应演化的场景。这一设定带来了显著的统计挑战：采样分布依赖于分配规则，观测值随时间具有依赖性，且逆倾向加权会导致方差膨胀。我们提出一种核化ε-贪心算法，该算法结合了基于Stein方法的指标参数估计与奖励函数的逆倾向加权核岭回归。此方法在保持可解释性的同时，实现了灵活的半参数学习。我们的分析为自适应收集数据的推断开发了新工具。我们建立了自适应采样下单指标估计量的渐近正态性，从而得到有效的置信域，并推导了再生核希尔伯特空间估计量的方向性泛函中心极限定理，为逐点渐近有效置信区间的构建提供了依据。该分析依赖于逆加权Gram矩阵的集中界以及鞅中心极限定理。我们进一步获得了有限时间遗憾保证，包括在公共链接利普希茨条件下的$\tilde{O}(\sqrt{T})$ 速率，表明半参数结构可在不牺牲统计效率的前提下被利用。这些结果为单指标上下文赌博机中的同步学习与推断提供了统一框架。