We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, Blanchard et. al (2022) characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly adversarially, and depending on the learner's actions. We show that optimistic universal learning for contextual bandits with adversarial rewards is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various adversarial reward models, and an exact characterization for online rewards. In particular, the set of learnable processes for these reward models is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new adversarial phenomena.
翻译:我们研究了情境赌博机中学习的基本极限,其中学习者的奖励取决于其动作和已知情境,这扩展了标准的多臂赌博机到可利用辅助信息的情形。我们关注泛化一致算法,该算法相对于任何可测的固定策略都能实现次线性遗憾,且无需任何函数类限制。对于平稳情境赌博机,当底层奖励机制是时间不变时,Blanchard等人(2022)刻画了可学习的情境过程,使得泛化一致性可达到;并进一步给出了在可达到时确保泛化一致性的算法,这一性质被称为乐观泛化一致性。然而,众所周知,奖励机制可能随时间演化,可能具有对抗性,并依赖于学习者的动作。我们证明,对于带有对抗性奖励的情境赌博机,乐观泛化学习通常是不可能的,这与之前所有在线学习研究的情境——包括标准监督学习——相反。我们还给出了各种对抗性奖励模型下泛化学习的必要充分条件,以及对在线奖励的精确刻画。特别地,这些奖励模型下的可学习过程集合仍然极其广泛——比独立同分布、平稳或遍历过程更大——但通常严格小于监督学习或平稳情境赌博机中的可学习过程集合,揭示了新的对抗性现象。