A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher complexity) that govern the statistical complexity of learning. However, characterizing the statistical complexity of interactive learning is substantially more challenging due to the adaptive nature of the problem. The main result of this work provides a complexity measure, the Decision-Estimation Coefficient, that is proven to be both necessary and sufficient for sample-efficient interactive learning. In particular, we provide: 1. a lower bound on the optimal regret for any interactive decision making problem, establishing the Decision-Estimation Coefficient as a fundamental limit. 2. a unified algorithm design principle, Estimation-to-Decisions (E2D), which transforms any algorithm for supervised estimation into an online algorithm for decision making. E2D attains a regret bound that matches our lower bound up to dependence on a notion of estimation performance, thereby achieving optimal sample-efficient learning as characterized by the Decision-Estimation Coefficient. Taken together, these results constitute a theory of learnability for interactive decision making. When applied to reinforcement learning settings, the Decision-Estimation Coefficient recovers essentially all existing hardness results and lower bounds. More broadly, the approach can be viewed as a decision-theoretic analogue of the classical Le Cam theory of statistical estimation; it also unifies a number of existing approaches -- both Bayesian and frequentist.
翻译:交互式学习与决策制定(从赌徒问题到强化学习)的核心挑战在于设计出样本高效、自适应的学习算法,以实现接近最优的遗憾值。这一问题与经典的(监督式)最优统计学习问题类似,后者存在诸如VC维和Rademacher复杂度等成熟复杂性度量,用以刻画学习的统计复杂性。然而,由于交互式学习的自适应特性,刻画其统计复杂性要困难得多。本研究的主要成果提出了一种复杂性度量——决策估计系数(Decision-Estimation Coefficient),该系数被证明是样本高效交互式学习的充分必要条件。具体而言,我们提供了:1. 任何交互式决策问题的最优遗憾下界,确立了决策估计系数作为基本极限;2. 统一的算法设计原则——估计到决策(Estimation-to-Decisions, E2D),该原则可将任何监督式估计算法转化为在线决策算法。E2D能够达到与下界相匹配的遗憾界(除对估计性能的依赖性外),从而根据决策估计系数实现最优样本高效学习。综合来看,这些结果构成了交互式决策的可学习性理论。应用于强化学习场景时,决策估计系数可恢复几乎所有现有的困难结果与下界。更广泛地,该方法可视为经典Le Cam统计估计理论在决策论中的类比,同时统一了贝叶斯与频率学派等多种现有方法。