In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.
翻译:本文提出记忆网络(RetNet)作为大型语言模型的基础架构,同时实现训练并行性、低推理成本与卓越性能。我们从理论层面推导了递归与注意力机制之间的关联,进而提出用于序列建模的记忆机制,该机制支持三种计算范式:并行、递归和分段递归。具体而言,并行表征支持训练并行化;递归表征实现低成本的O(1)推理,在不牺牲性能的前提下提升解码吞吐量、降低延迟与GPU内存占用;分段递归表征则通过线性复杂度实现高效长序列建模——每个分块并行编码的同时以递归方式汇总各分块。语言建模实验表明,RetNet在扩展性能、并行训练、低成本部署及高效推理方面均展现出显著优势。这些令人瞩目的特性使RetNet成为Transformer在大型语言模型领域的有力继承者。相关代码将发布于 https://aka.ms/retnet。