Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
翻译:现有许多强化学习方法采用随机梯度迭代作为后端,其稳定性依赖于数据生成过程以指数速度快速混合的假设,且混合速率参数出现在步长选择中。然而,当状态空间较大或奖励稀疏时,这一假设往往被违背,且混合时间未知,导致步长无法有效运作。本文提出一种适应混合时间的强化学习新方法,通过在多级蒙特卡洛估计器中嵌入评论家、演员和平均奖励,并集成于Actor-Critic算法框架。该方法名为**多级Actor-Critic**(MAC),专为无限时域平均奖励场景设计,其参数选择既不依赖混合时间的先验知识,也不假设指数衰减特性,因此可轻松应用于混合时间较慢的实践场景。尽管如此,MAC仍能实现与最先进Actor-Critic算法相当的收敛速率。实验表明,在稀疏奖励的强化学习问题中,这些对稳定性技术条件的放宽限制,转化为实际性能的显著提升。