The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) often struggle to deal with general function classes and large context spaces. In this paper, we study general contextual bandits with an offline regression oracle and propose a simple, generic principle to design optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB). The key innovation of UCCB is building confidence bounds in policy space, rather than in action space as is done in UCB. We demonstrate that these algorithms are provably optimal and computationally efficient in handling general function classes and large context spaces. Furthermore, we illustrate that the UCCB principle can be seamlessly extended to infinite-action general contextual bandits, provide the first solutions to these settings when employing an offline regression oracle.
翻译:面对不确定性时的乐观原则是多臂赌博机和强化学习中最广泛使用且最成功的思想之一。然而,现有的乐观算法(主要是UCB及其变体)往往难以处理一般函数类和大规模上下文空间。本文研究基于离线回归预言机的一般情境赌博机,并提出一种简单通用的乐观算法设计原则,称为"上反事实置信界"(UCCB)。UCCB的核心创新在于在策略空间而非UCB的动作空间中构建置信界。我们证明这些算法在应对一般函数类和大规模上下文空间时具有可证明的最优性和计算高效性。此外,我们说明UCCB原则可以无缝扩展到无限动作一般情境赌博机,为采用离线回归预言机的此类设置提供了首个解决方案。