Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
翻译:许多现有的强化学习方法在后端采用随机梯度迭代,其稳定性依赖于一个假设:数据生成过程以指数级快速混合,且混合速率参数出现在步长选择中。遗憾的是,对于大规模状态空间或奖励稀疏的设置,这一假设被违反,且混合时间未知,导致步长不可用。本文提出一种与混合时间相适应的强化学习方法,该方法在演员-评论家算法中嵌入多级蒙特卡洛估计器,用于估计评论家、演员和平均奖励。我们称此方法为多级演员-评论家(MAC),它专为无限时域平均奖励设置而开发,其参数选择既不依赖混合时间的先验知识,也不假设指数衰减特性,因此可直接应用于混合时间较慢的场景。尽管如此,该方法实现了与最先进的演员-评论家算法相当的收敛速率。实验结果表明,在奖励稀疏的强化学习问题中,对稳定性所需技术条件的这些放宽限制,在实际应用中转化为更优的性能。