We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
翻译:本文分析分位数时序差分学习(QTD),这是一种分布强化学习算法,已被证明是多个成功的大规模强化学习应用中的关键组成部分。尽管取得了这些实证成功,但直到现在,对QTD的理论理解仍难以捉摸。与可以使用标准随机逼近工具分析的经典TD学习不同,QTD更新不近似压缩映射,具有高度非线性,且可能存在多个不动点。本文的核心结果是证明QTD以概率1收敛到相关动态规划族的不动点,从而为QTD奠定了坚实的理论基础。该证明通过随机逼近理论和非光滑分析,建立了QTD与非线性微分包含之间的联系。