In stochastic contextual bandits, an agent sequentially makes actions from a time-dependent action set based on past experience to minimize the cumulative regret. Like many other machine learning algorithms, the performance of bandits heavily depends on their multiple hyperparameters, and theoretically derived parameter values may lead to unsatisfactory results in practice. Moreover, it is infeasible to use offline tuning methods like cross-validation to choose hyperparameters under the bandit environment, as the decisions should be made in real time. To address this challenge, we propose the first online continuous hyperparameter tuning framework for contextual bandits to learn the optimal parameter configuration within a search space on the fly. Specifically, we use a double-layer bandit framework named CDT (Continuous Dynamic Tuning) and formulate the hyperparameter optimization as a non-stationary continuum-armed bandit, where each arm represents a combination of hyperparameters, and the corresponding reward is the algorithmic result. For the top layer, we propose the Zooming TS algorithm that utilizes Thompson Sampling (TS) for exploration and a restart technique to get around the switching environment. The proposed CDT framework can be easily used to tune contextual bandit algorithms without any pre-specified candidate set for hyperparameters. We further show that it could achieve sublinear regret in theory and performs consistently better on both synthetic and real datasets in practice.
翻译:在随机上下文Bandit问题中,智能体基于历史经验从随时间变化的动作集合中顺序选择动作,以最小化累积遗憾。与许多其他机器学习算法类似,Bandit算法的性能严重依赖于其多个超参数,而理论推导出的参数值在实际中可能无法获得令人满意的结果。此外,在Bandit环境下无法使用交叉验证等离线调优方法选择超参数,因为决策必须实时进行。为应对这一挑战,我们提出了首个面向上下文Bandit的在线连续超参数调优框架,能够在搜索空间中实时学习最优参数配置。具体而言,我们采用名为CDT(连续动态调优)的双层Bandit框架,将超参数优化建模为非平稳连续臂Bandit问题,其中每个臂代表一组超参数组合,对应的奖励为算法运行结果。在顶层,我们提出Zooming TS算法,该算法利用汤普森采样(Thompson Sampling)进行探索,并通过重启技术适应动态环境。所提出的CDT框架无需预设超参数候选集,即可轻松用于调优上下文Bandit算法。我们进一步证明该框架在理论上可实现次线性遗憾,并在合成数据集与真实数据集上的实际表现均持续优于现有方法。