Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
翻译:内存层通过可训练的键值查找机制为模型增加额外参数,而不增加浮点运算量。从概念上讲,稀疏激活的内存层与计算密集型密集前馈层形成互补,以较低成本提供专门用于存储和检索信息的容量。本研究将内存层从概念验证推向实用化,证明了其在当代规模下的有效性。在下游任务中,搭载改进内存层的语言模型在计算预算超过两倍的密集模型面前表现更优,同时在计算量与参数量均匹配的情况下也优于混合专家模型。我们发现该改进在事实性任务上的增益尤为显著。我们提供了完全可并行化的内存层实现,展示了包含高达1280亿内存参数的模型扩展规律,并基于1万亿token进行预训练,与参数量高达80亿的基线模型进行了对比。