Modern Reinforcement Learning (RL) is more than just learning the optimal policy; Alternative learning goals such as exploring the environment, estimating the underlying model, and learning from preference feedback are all of practical importance. While provably sample-efficient algorithms for each specific goal have been proposed, these algorithms often depend strongly on the particular learning goal and thus admit different structures correspondingly. It is an urging open question whether these learning goals can rather be tackled by a single unified algorithm. We make progress on this question by developing a unified algorithm framework for a large class of learning goals, building on the Decision-Estimation Coefficient (DEC) framework. Our framework handles many learning goals such as no-regret RL, PAC RL, reward-free learning, model estimation, and preference-based learning, all by simply instantiating the same generic complexity measure called "Generalized DEC", and a corresponding generic algorithm. The generalized DEC also yields a sample complexity lower bound for each specific learning goal. As applications, we propose "decouplable representation" as a natural sufficient condition for bounding generalized DECs, and use it to obtain many new sample-efficient results (and recover existing results) for a wide range of learning goals and problem classes as direct corollaries. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling and Maximum Likelihood Estimation, showing that they enjoy sample complexity bounds under similar structural conditions as the DEC.
翻译:现代强化学习(RL)远不止于学习最优策略;探索环境、估计底层模型以及从偏好反馈中学习等替代性学习目标,均具有实际重要性。尽管针对每个特定目标已提出可证明样本高效的算法,但这些算法高度依赖具体学习目标,因而对应地呈现出不同的结构。这些学习目标能否由单一的统计算法加以处理,是一个亟待解决的开放性问题。我们通过基于决策-估计系数(DEC)框架,为一大类学习目标开发了统一算法框架,从而在此问题上取得了进展。我们的框架可处理无遗憾RL、PAC RL、免奖励学习、模型估计以及基于偏好的学习等多种学习目标,只需实例化名为“广义DEC”的同一通用复杂度度量及其对应的通用算法。广义DEC还能为每个具体学习目标提供样本复杂度下界。作为应用,我们提出“可解耦表示”作为限制广义DEC的自然充分条件,并利用它直接推论出广泛学习目标和问题类别下的众多新样本高效结果(并恢复现有结果)。最后,作为关联,我们重新分析了基于后验抽样和极大似然估计两种现有的基于模型的乐观算法,表明它们在类似DEC的结构条件下具有样本复杂度界。