In stochastic contextual bandit problems, an agent sequentially makes actions from a time-dependent action set based on past experience to minimize the cumulative regret. Like many other machine learning algorithms, the performance of bandits heavily depends on their multiple hyperparameters, and theoretically derived parameter values may lead to unsatisfactory results in practice. Moreover, it is infeasible to use offline tuning methods like cross validation to choose hyperparameters under the bandit environment, as the decisions should be made in real time. To address this challenge, we propose the first online continuous hyperparameter tuning framework for contextual bandits to learn the optimal parameter configuration within a search space on the fly. Specifically, we use a double-layer bandit framework named CDT (Continuous Dynamic Tuning) and formulate the hyperparameter optimization as a non-stationary continuum-armed bandit, where each arm represents a combination of hyperparameters, and the corresponding reward is the algorithmic result. For the top layer, we propose the Zooming TS algorithm that utilizes Thompson Sampling (TS) for exploration and a restart technique to get around the switching environment. The proposed CDT framework can be easily used to tune contextual bandit algorithms without any pre-specified candidate set for hyperparameters. We further show that it could achieve sublinear regret in theory and performs consistently better on both synthetic and real datasets in practice.
翻译:在随机上下文赌博机问题中,智能体基于过往经验从随时间变化的动作集合中依次选择动作,以最小化累积遗憾。与其他机器学习算法类似,赌博机的性能高度依赖于其多个超参数,而理论推导的参数值在实际中可能无法取得令人满意的结果。此外,在赌博机环境中无法使用交叉验证等离线调优方法选择超参数,因为决策需实时进行。为解决这一挑战,我们提出了首个面向上下文赌博机的在线连续超参数调优框架,能在搜索空间中实时学习最优参数配置。具体而言,我们采用名为CDT(连续动态调优)的双层赌博机框架,将超参数优化建模为一个非平稳连续臂赌博机问题,其中每个臂代表一组超参数组合,对应奖励为算法运行结果。在顶层我们提出Zooming TS算法,利用汤普森采样进行探索,并采用重启技术适应环境变化。提出的CDT框架无需预先指定超参数候选集,即可便捷地调优上下文赌博机算法。我们进一步证明该方法理论上可实现次线性遗憾,并在合成数据集和真实数据集上均展现出持续更优的表现。