We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, [Blanchard et al.] characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly depending on the learner's actions. We show that optimistic universal learning for non-stationary contextual bandits is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various non-stationarity models, including online and adversarial reward mechanisms. In particular, the set of learnable processes for non-stationary rewards is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new non-stationary phenomena.
翻译:我们研究情境赌博机中学习的基本极限,其中学习者的奖励取决于其动作和已知情境,这扩展了经典的多臂赌博机至可利用辅助信息的情形。我们关注通用一致性算法,此类算法能在无函数类限制条件下,相较于任意可测固定策略实现次线性遗憾。对于平稳情境赌博机(此时底层奖励机制为时不变),[Blanchard 等人]刻画了可实现通用一致性的可学习情境过程,并进一步给出了在可实现条件下确保通用一致性的算法——该性质被称为乐观通用一致性。然而,奖励机制可能随时间演变,且可能取决于学习者的动作,这一现象已得到充分理解。我们证明,与在线学习(包括标准监督学习)所有先前研究的情境相反,非平稳情境赌博机的乐观通用学习通常是不可行的。我们还给出了各类非平稳性模型(包括在线和对抗性奖励机制)下通用学习的充要条件。特别地,非平稳奖励的可学习过程集合仍极为广泛——大于独立同分布、平稳或遍历过程——但严格小于监督学习或平稳情境赌博机,这揭示了新的非平稳现象。