A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
翻译:强化学习中的一个关键问题是学习最优策略。本文在在线设置下,针对表格型无限时域折扣马尔可夫决策过程研究该问题。现有算法要么无法实现遗憾最优性,要么需要承担高昂的内存和计算成本。此外,现有最优算法都需要较长的预热时间才能实现最优样本效率——即其最优性仅在样本量超过高阈值时才能得到保证。我们通过引入一种采用方差缩减技术以及一种缓慢但自适应切换执行策略的新颖技术的无模型算法,同时解决了这两个开放性问题。这是折扣设置下首个具有遗憾最优性的无模型算法,且额外具备低预热时间的优势。