We consider the problem of model selection for the general stochastic contextual bandits under the realizability assumption. We propose a successive refinement based algorithm called Adaptive Contextual Bandit ({\ttfamily ACB}), that works in phases and successively eliminates model classes that are too simple to fit the given instance. We prove that this algorithm is adaptive, i.e., the regret rate order-wise matches that of any provable contextual bandit algorithm (ex. \cite{falcon}), that needs the knowledge of the true model class. The price of not knowing the correct model class turns out to be only an additive term contributing to the second order term in the regret bound. This cost possess the intuitive property that it becomes smaller as the model class becomes easier to identify, and vice-versa. We also show that a much simpler explore-then-commit (ETC) style algorithm also obtains similar regret bound, despite not knowing the true model class. However, the cost of model selection is higher in ETC as opposed to in {\ttfamily ACB}, as expected. Furthermore, for the special case of linear contextual bandits, we propose specialized algorithms that obtain sharper guarantees compared to the generic setup.
翻译:我们研究了可实现性假设下一般随机上下文赌博机的模型选择问题。我们提出了一种基于逐次细化的算法,称为自适应上下文赌博机({\ttfamily ACB}),该算法分阶段工作,并逐步剔除过于简单而无法拟合给定实例的模型类。我们证明该算法具有自适应性,即其遗憾率在阶数上与任何需知真实模型类的可证明上下文赌博机算法(例如\cite{falcon})相匹配。未知正确模型类的代价仅表现为一个加性项,该项贡献于遗憾界中的二阶项。这一代价具有直观性质:当模型类越容易被识别时,该项越小,反之亦然。我们还表明,一种更简单的先探索后承诺(ETC)风格的算法,在不知真实模型类的情况下,也能获得类似的遗憾界。然而,与{\ttfamily ACB}相比,ETC中的模型选择代价更高,这符合预期。此外,针对线性上下文赌博机的特殊情况,我们提出了相较于通用设置能获得更优保证的专用算法。