In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.
翻译:本研究提出保留网络(RetNet)作为大型语言模型的基础架构,同时实现训练并行性、低成本推理与优异性能。我们从理论上推导了循环机制与注意力机制之间的关联,并提出用于序列建模的保留机制——该机制支持并行计算、循环计算及分块循环计算三种范式。具体而言,并行表示支持训练并行化;循环表示可实现低成本$O(1)$推理,在不牺牲性能的前提下提升解码吞吐量、降低延迟与GPU显存消耗;分块循环表示则以线性复杂度实现高效长序列建模——每个数据块被并行编码的同时,通过循环方式对块间信息进行归纳聚合。语言建模实验表明,RetNet实现了优越的扩展性能、并行训练、低成本部署与高效推理。这些显著特性使RetNet成为大型语言模型中Transformer的有力继任者。代码将于https://aka.ms/retnet 开源。