The goal of this technical note is to introduce a new finitetime analysis of tabular temporal difference (TD) learning based on discrete-time stochastic linear system models. TD-learning is a fundamental reinforcement learning (RL) algorithm to evaluate a given policy by estimating the corresponding value function for a Markov decision process. While there has been a series of successful works in theoretical analysis of TD-learning, it was not until recently that researchers found some guarantees on its statistical efficiency by developing finite-time error bounds. In this paper, we propose a unique control theoretic finitetime analysis of tabular TD-learning, which directly exploits discrete-time linear system models and standard notions in control communities. The proposed work provides new simple templates and additional insights for analysis of TD-learning and RL algorithms.
翻译:本技术报告旨在介绍一种基于离散时间随机线性系统模型的表格型时间差分(TD)学习的新型有限时间分析方法。TD学习是一种基础性的强化学习(RL)算法,通过估计马尔可夫决策过程中相应的值函数来评估给定策略。尽管在TD学习的理论分析方面已有一系列成功的工作,但直到最近研究人员才通过发展有限时间误差界来发现其统计效率的某些保证。本文提出了一种独特的控制理论视角下的表格型TD学习有限时间分析,该方法直接利用离散时间线性系统模型以及控制领域的标准概念。所提出的工作为TD学习与RL算法的分析提供了新的简洁模板和额外的见解。