Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly ($\text{poly}M$) with the number of models $M$ in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved ($\log M$) dependence on $M$ for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon $n$, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.
翻译:在Bandit优化背景下,模型选择是一个具有挑战性的问题,因为它不仅需要在动作选择上权衡探索与利用,还需在模型选择中同样如此。一种自然方法是依赖将不同模型视为专家的在线学习算法。然而现有方法在遗憾值方面,其复杂度随模型数量 \(M\) 呈多项式增长(\(\text{poly}M\))。我们的关键洞察在于:在线性Bandits的模型选择中,可通过有利的偏差-方差权衡向在线学习器模拟全信息反馈。基于此,我们开发出ALEXP算法,其遗憾值对 \(M\) 的依赖呈指数级改进(\(\log M\))。ALEXP具有任意时刻的遗憾保证,既无需知晓时间范围 \(n\),也不依赖初始纯探索阶段。我们的方法利用Lasso的新颖时间均匀性分析,建立了在线学习与高维统计学之间的全新联系。