A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
翻译:强化学习中的一个关键问题是学习最优策略。我们在在线设置下研究了表格型无限时域折扣马尔可夫决策过程。现有算法要么无法实现遗憾最优性,要么需要承担高内存和计算成本。此外,现有最优算法都需要较长的预热时间才能实现最优样本效率,即除非样本量超过高阈值,否则其最优性无法保证。我们通过引入一种采用方差缩减的无模型算法以及一种以缓慢且自适应方式切换执行策略的新技术,解决了这两个开放问题。这是折扣设置下首个具有低预热时间优势的遗憾最优无模型算法。