In reinforcement learning, the classic objectives of maximizing discounted and finite-horizon cumulative rewards are PAC-learnable: There are algorithms that learn a near-optimal policy with high probability using a finite amount of samples and computation. In recent years, researchers have introduced objectives and corresponding reinforcement-learning algorithms beyond the classic cumulative rewards, such as objectives specified as linear temporal logic formulas. However, questions about the PAC-learnability of these new objectives have remained open. This work demonstrates the PAC-learnability of general reinforcement-learning objectives through sufficient conditions for PAC-learnability in two analysis settings. In particular, for the analysis that considers only sample complexity, we prove that if an objective given as an oracle is uniformly continuous, then it is PAC-learnable. Further, for the analysis that considers computational complexity, we prove that if an objective is computable, then it is PAC-learnable. In other words, if a procedure computes successive approximations of the objective's value, then the objective is PAC-learnable. We give three applications of our condition on objectives from the literature with previously unknown PAC-learnability and prove that these objectives are PAC-learnable. Overall, our result helps verify existing objectives' PAC-learnability. Also, as some studied objectives that are not uniformly continuous have been shown to be not PAC-learnable, our results could guide the design of new PAC-learnable objectives.
翻译:在强化学习中,最大化折扣累计奖励和有限视界累计奖励的经典目标具有可PAC学习性:存在算法能够使用有限样本和计算量,以高概率学习近优策略。近年来,研究者引入了超越经典累计奖励的目标及相应的强化学习算法,例如以线性时序逻辑公式指定的目标。然而,这些新目标的可PAC学习性问题仍悬而未决。本文通过建立两种分析场景下可PAC学习性的充分条件,论证了一般强化学习目标的可PAC学习性。具体而言,在仅考虑样本复杂度的分析中,我们证明:若以预言机形式给出的目标具有一致连续性,则该目标可PAC学习。进一步,在考虑计算复杂度的分析中,我们证明:若目标可计算,则该目标可PAC学习——即若存在过程能逐次逼近目标值,则该目标可PAC学习。我们将在文献中选取三个已知可PAC学习性尚未明确的目标作为应用实例,证明这些目标具有可PAC学习性。总体而言,我们的研究成果有助于验证现有目标的可PAC学习性。此外,由于部分已被研究的非一致连续目标已被证明不可PAC学习,我们的研究结果可为设计新型可PAC学习目标提供指导。