We establish an optimal sample complexity of $O(ε^{-2})$ for obtaining an $ε$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(ε^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.
翻译:针对具有有限状态-动作空间的无限时域折扣马尔可夫决策过程,我们证明了使用单时间尺度演员-评论家算法获得$ε$最优全局策略的最优样本复杂度为$O(ε^{-2})$,改进了此前$O(ε^{-3})$的最佳已知结果。我们的方法采用STORM(随机递归动量)技术来降低评论家更新过程中的方差。然而,由于样本来源于由演化策略诱导的非平稳占用测度,仅通过STORM进行方差缩减并不充分。为解决这一挑战,我们维护一个包含少量近期样本的缓冲区,并在每次评论家更新时从中均匀采样。重要的是,这些机制与现有的深度学习架构兼容,仅需进行微小修改,且不会影响实际应用性。