Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
翻译:内存层采用可训练键值查找机制,为模型增加额外参数而不提升浮点运算量。从概念上讲,稀疏激活的内存层与计算密集型稠密前馈层形成互补,以较低成本提供专门用于信息存储与检索的容量。本研究将内存层从概念验证推向实用化,证明了其在当代规模下的有效性。在下游任务中,搭载改进型内存层的语言模型在计算预算超过两倍的稠密模型上表现更优,同时在计算量与参数规模均匹配的情况下超越混合专家模型。我们发现该改进在事实性任务上的增益尤为显著。我们提供了完全可并行化的内存层实现方案,通过预训练至1万亿词元的1280亿内存参数模型,与80亿参数的基线模型进行对比,展示了其规模化扩展规律。