In RL, memory models such as RNNs and transformers address Partially Observable Markov Decision Processes (POMDPs) by mapping trajectories to latent Markov states. Neither model scales particularly well to long sequences, especially compared to an emerging class of memory models sometimes called linear recurrent models. We discover that the recurrent update of these models is a monoid, leading us to formally define a novel memory monoid framework. We revisit the traditional approach to batching in recurrent RL, highlighting both theoretical and empirical deficiencies. Leveraging the properties of memory monoids, we propose a new batching method that improves sample efficiency, increases the return, and simplifies the implementation of recurrent loss functions in RL.
翻译:在强化学习中,RNN和Transformer等记忆模型通过将轨迹映射到潜在马尔可夫状态来处理部分可观测马尔可夫决策过程(POMDP)。然而,与新兴的、有时被称为线性循环模型的记忆模型类别相比,这两种模型在处理长序列时尤其表现不佳。我们发现这些模型的循环更新构成一个幺半群,这促使我们正式定义一种新颖的记忆幺半群框架。我们重新审视了循环强化学习中传统的批处理方法,揭示了其在理论与实证上的缺陷。利用记忆幺半群的性质,我们提出了一种新的批处理方法,该方法提升了样本效率、增加了回报,并简化了强化学习中循环损失函数的实现。