Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
翻译:Transformer已彻底革新了几乎所有自然语言处理(NLP)任务,但其内存和计算复杂度随序列长度呈二次方增长。相比之下,循环神经网络(RNN)在内存和计算需求上呈线性增长,但由于并行化和可扩展性的限制,难以达到与Transformer相同的性能。我们提出了一种新颖的模型架构——接收加权键值(RWKV),它融合了Transformer高效的可并行训练与RNN高效的推理能力。该方法采用线性注意力机制,使我们能够将模型构建为Transformer或RNN形式,从而在训练时实现计算并行化,并在推理时保持恒定的计算和内存复杂度。我们将模型规模扩展至140亿参数,这是迄今训练过的最大密集RNN,并发现RWKV的性能与同等规模的Transformer相当,这表明未来工作可借助该架构创建更高效的模型。本研究在序列处理任务的计算效率与模型性能权衡之间迈出了重要一步。