We consider the development of adaptive, instance-dependent algorithms for interactive decision making (bandits, reinforcement learning, and beyond) that, rather than only performing well in the worst case, adapt to favorable properties of real-world instances for improved performance. We aim for instance-optimality, a strong notion of adaptivity which asserts that, on any particular problem instance, the algorithm under consideration outperforms all consistent algorithms. Instance-optimality enjoys a rich asymptotic theory originating from the work of \citet{lai1985asymptotically,graves1997asymptotically}, but non-asymptotic guarantees have remained elusive outside of certain special cases. Even for problems as simple as tabular reinforcement learning, existing algorithms do not attain instance-optimal performance until the number of rounds of interaction is doubly exponential in the number of states. In this paper, we take the first step toward developing a non-asymptotic theory of instance-optimal decision making with general function approximation. We introduce a new complexity measure, the Allocation-Estimation Coefficient (AEC), and provide a new algorithm, $\mathsf{AE}^2$, which attains non-asymptotic instance-optimal performance at a rate controlled by the AEC. Our results recover the best known guarantees for well-studied problems such as finite-armed and linear bandits and, when specialized to tabular reinforcement learning, attain the first instance-optimal regret bounds with polynomial dependence on all problem parameters, improving over prior work exponentially. We complement these results with lower bounds that show that i) existing notions of statistical complexity are insufficient to derive non-asymptotic guarantees, and ii) under certain technical conditions, boundedness of the AEC is necessary to learn an instance-optimal allocation of decisions in finite time.
翻译:我们研究自适应、实例依赖型算法在交互决策(赌博机、强化学习及其他相关领域)中的发展。这类算法不仅在最坏情况下表现良好,还能根据实际应用场景的有利特性调整策略以提升性能。我们追求实例最优性——一种强适应性的概念,要求特定问题实例中,所考虑的算法优于所有一致性算法。实例最优性源于Lai (1985) 和 Graves (1997) 的开创性渐近理论,但在非渐近保证方面,除特殊情形外仍鲜有突破。即使对于表格型强化学习这类简单问题,现有算法需在交互轮次数量达到状态数的双重指数级时才能实现实例最优性能。本文首次探索了具有泛函近似能力的实例最优决策非渐近理论。我们引入新的复杂度度量——分配估计系数(AEC),并提出新算法$\mathsf{AE}^2$,该算法在AEC控制下实现非渐近实例最优性能。我们的结果恢复了有限臂赌博机与线性赌博机等经典问题的最优已知保证,并在表格型强化学习特例中首次获得所有问题参数多项式依赖的实例最优遗憾界,较先前工作实现指数级改进。我们通过下界补充说明:(i) 现有统计复杂度概念不足以推导非渐近保证;(ii) 在特定技术条件下,AEC的有界性是有限时间内学习实例最优决策分配的必要条件。