We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation in online interactive decision making. In specific, GEC captures the hardness of exploration by comparing the error of predicting the performance of the updated policy with the in-sample training error evaluated on the historical data. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR, where generalized regular PSR, a new tractable PSR class identified by us, includes nearly all known tractable POMDPs and PSRs. Furthermore, in terms of algorithm design, we propose a generic posterior sampling algorithm, which can be implemented in both model-free and model-based fashion, under both fully observable and partially observable settings. The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning. We prove that the proposed algorithm is sample efficient by establishing a sublinear regret upper bound in terms of GEC. In summary, we provide a new and unified understanding of both fully observable and partially observable RL.
翻译:我们研究交互决策统一框架下的样本高效强化学习,该框架包含马尔可夫决策过程(MDP)、部分可观测马尔可夫决策过程(POMDP)和预测状态表示(PSR)作为特例。为寻找赋能样本高效学习的最小假设,我们提出了一种新的复杂度度量——广义埃尔·德尔系数(GEC),其刻画了在线交互决策中探索与利用之间的基本权衡。具体而言,GEC通过比较更新策略性能预测误差与历史数据评估的样本内训练误差来量化探索难度。我们证明,低GEC的强化学习问题构成一个极其丰富的类别,包含低贝尔曼埃尔·德尔维度问题、双线性类、低证据秩问题、部分可观测双线性类以及广义正则PSR(由我们新识别的一类可解PSR),其中广义正则PSR几乎涵盖所有已知的可解POMDP和PSR。此外,在算法设计方面,我们提出了一种通用的后验采样算法,该算法可在无模型和基于模型的模式下实现,同时适用于完全可观测和部分可观测设置。所提算法在两方面对标准后验采样算法进行了改进:(i) 使用乐观先验分布,偏向于具有更高价值的假设;(ii) 将对数似然函数设定为历史数据评估的经验损失,其中损失函数的选择同时支持无模型和基于模型的学习。我们通过建立关于GEC的次线性遗憾上界,证明了所提算法的样本高效性。综上,我们为完全可观测和部分可观测强化学习提供了新的统一理解。