We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
翻译:本文提出了一种针对非平稳核赌博机的算法,该算法无需预先知晓非平稳程度。算法遵循通过求解平衡探索与利用的优化问题得到的随机策略,通过检测到奖励函数变化时重启来适应非平稳性。我们的算法在非平稳核赌博机设定下获得了比以往工作更紧的动态遗憾界。此外,当使用线性核应用于非平稳线性赌博机设定时,该算法接近极小化最优,解决了非平稳线性赌博机文献中的一个公开问题。我们将算法扩展为使用神经网络动态调整特征映射以适应观测数据,并利用神经正切核理论证明了该扩展的动态遗憾界。实验表明,我们的算法及其扩展能够适应不同程度的非平稳性。