State construction from sensory observations is an important component of a reinforcement learning agent. One solution for state construction is to use recurrent neural networks. Back-propagation through time (BPTT), and real-time recurrent learning (RTRL) are two popular gradient-based methods for recurrent learning. BPTT requires the complete sequence of observations before computing gradients and is unsuitable for online real-time updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules, or learning the network incrementally, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade-off the functional capacity of the network to achieve scalable learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a benchmark inspired by animal learning and by doing policy evaluation for pre-trained Rainbow-DQN agents in the Arcade Learning Environment (ALE).
翻译:从感官观测构建状态是强化学习智能体的重要组成部分。一种构建状态的方法是使用循环神经网络。时间反向传播(BPTT)和实时循环学习(RTRL)是两种流行的基于梯度的循环学习方法。BPTT在计算梯度前需要完整的观测序列,不适用于在线实时更新。RTRL可以进行在线更新,但在大规模网络上扩展性差。本文提出两种约束使RTRL具有可扩展性。我们证明,通过将网络分解为独立模块,或逐步学习网络,可以使RTRL的复杂度随参数数量线性扩展。与先前的可扩展梯度估计算法(如UORO和截断BPTT)不同,我们的算法不会向梯度估计引入噪声或偏差。相反,它们通过权衡网络的功能容量来实现可扩展学习。我们在基于动物学习的基准测试中,以及在街机学习环境(ALE)中对预训练的Rainbow-DQN智能体进行策略评估时,证明了我们的方法相较于截断BPTT的有效性。