Recently, there has been remarkable progress in reinforcement learning (RL) with general function approximation. However, all these works only provide regret or sample complexity guarantees. It is still an open question if one can achieve stronger performance guarantees, i.e., the uniform probably approximate correctness (Uniform-PAC) guarantee that can imply both a sub-linear regret bound and a polynomial sample complexity for any target learning accuracy. We study this problem by proposing algorithms for both nonlinear bandits and model-based episodic RL using the general function class with a bounded eluder dimension. The key idea of the proposed algorithms is to assign each action to different levels according to its width with respect to the confidence set. The achieved uniform-PAC sample complexity is tight in the sense that it matches the state-of-the-art regret bounds or sample complexity guarantees when reduced to the linear case. To the best of our knowledge, this is the first work for uniform-PAC guarantees on bandit and RL that goes beyond linear cases.
翻译:近期,具有通用函数近似的强化学习取得了显著进展。然而,现有工作仅提供遗憾界或样本复杂度保证。能否实现更强的性能保证(即统一概率近似正确性保证),该保证能同时推导出次线性遗憾界和任意目标学习精度的多项式样本复杂度,仍是一个开放问题。我们通过提出适用于非线性赌博机和基于模型的回合制强化学习的算法来研究该问题,这些算法使用具有有界Eluder维度的通用函数类。所提算法的核心思想是根据每个动作相对于置信区间的宽度将其分配至不同层级。所实现的Uniform-PAC样本复杂度在简化至线性情形时与当前最优的遗憾界或样本复杂度保证相匹配,因此具有紧致性。据我们所知,这是首个针对超越线性情形的赌博机和强化学习提供Uniform-PAC保证的工作。