非线性上下文赌博机中可证明的任意时间集成采样算法 (Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits)

We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (\texttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling (\texttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\mathcal{O}(d^{3/2} \sqrt{T} + d^{9/2})$ for \texttt{GLM-ES} and $\mathcal{O}(\widetilde{d} \sqrt{T})$ for \texttt{Neural-ES}, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate \texttt{GLM-ES}, \texttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.

翻译：我们为非线性上下文赌博机中的集成采样提供了一个统一的算法框架，并针对两种最常见的非线性上下文赌博机设置建立了相应的遗憾界：针对广义线性赌博机的广义线性集成采样（\texttt{GLM-ES}）和针对神经上下文赌博机的神经集成采样（\texttt{Neural-ES}）。这两种方法均通过对随机扰动数据进行最大似然估计，来维护奖励模型参数的多个估计量。我们证明了\texttt{GLM-ES}具有$\mathcal{O}(d^{3/2} \sqrt{T} + d^{9/2})$的高概率频率学派遗憾界，而\texttt{Neural-ES}具有$\mathcal{O}(\widetilde{d} \sqrt{T})$的遗憾界，其中$d$是特征向量的维度，$\widetilde{d}$是神经正切核矩阵的有效维度，$T$是轮数。这些遗憾界与非线上下文赌博机设置中随机探索算法的最新结果相匹配。在理论分析中，我们引入了专门应对非线性模型特有挑战的技术。在实际应用方面，我们通过开发算法的任意时间版本，移除了对固定时间范围的假设，适用于$T$未知的情况。最后，我们对\texttt{GLM-ES}、\texttt{Neural-ES}及其任意时间变体进行了实证评估，结果显示出强大的性能。总体而言，我们的研究确立了集成采样作为非线性上下文赌博机中一种可证明且实用的随机探索方法。