Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.
翻译:训练循环神经网络(RNN)需要跨长序列计算进行信用分配。标准的时间反向传播(BPTT)方法在此存在缺陷:其时间维度上的顺序性限制了并行性,且易受梯度消失或爆炸问题影响,导致难以学习长程依赖。我们提出监督记忆训练(SMT),该方法通过将RNN训练简化为基于单步记忆转移标签 $(m_t, x_{t+1}) \rightarrow m_{t+1}$ 的监督学习,完全绕过了循环信用传播。SMT通过训练基于Transformer的编码器完成预测状态目标来获取记忆标签——仅保留预测未来所需的过去信息。通过解耦"记忆内容"与"记忆更新方式",SMT实现了时间并行的RNN训练,任意两个词元之间具有稳定的 $O(1)$ 长度梯度路径,且无需展开RNN。实验表明,在语言建模和像素序列建模等任务中预训练各类RNN架构时,SMT的性能优于BPTT。SMT使得非线性RNN能够更好地捕捉长程依赖并进行并行训练,这有望推动构建过去经验时间抽象模型的规模化发展。