Reinforcement learning algorithms are usually stated without theoretical guarantees regarding their performance. Recently, Jin, Yang, Wang, and Jordan (COLT 2020) showed a polynomial-time reinforcement learning algorithm (namely, LSVI-UCB) for the setting of linear Markov decision processes, and provided theoretical guarantees regarding its running time and regret. In real-world scenarios, however, the space usage of this algorithm can be prohibitive due to a utilized linear regression step. We propose and analyze two modifications of LSVI-UCB, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret. We show experimentally, on synthetic data and real-world benchmarks, that our algorithms achieve low space usage and running time, while not significantly sacrificing regret.
翻译:强化学习算法通常缺乏关于其性能的理论保证。最近,Jin、Yang、Wang和Jordan(COLT 2020)针对线性马尔可夫决策过程提出了一种多项式时间强化学习算法(即LSVI-UCB),并提供了关于其运行时间和悔值的理论保证。然而在实际场景中,由于算法采用的线性回归步骤,其空间占用可能过高。我们提出并分析了LSVI-UCB的两种改进方案,通过交替进行学习与非学习阶段,在保持次线性悔值的同时降低空间与时间消耗。我们在合成数据和真实基准测试中通过实验证明,所提算法在实现低空间占用与短运行时间的同时,并未显著牺牲悔值性能。