A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.
翻译:Transformer模型在更广泛问题求解中的一个主要限制是其计算复杂度随输入规模呈二次方增长。本研究探索了通过循环记忆增强预训练Transformer模型的方法,在保持线性计算扩展能力的同时延长输入上下文长度。我们的方法能够将高达两百万token序列的信息存储于记忆中,并保持高检索精度。语言建模任务的实验表明,随着处理输入片段数量的增加,困惑度持续改善。这些结果验证了所提方法的有效性,该方法在自然语言理解与生成任务中具有显著增强长程依赖处理能力的潜力,并能实现面向内存密集型应用的大规模上下文处理。