Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads. NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.
翻译:先前的方法试图通过手工设计的规则丢弃现代基础模型特定部分的上下文来抵消其不断攀升的计算成本,同时力求保持其原始性能。我们通过神经注意力记忆模型(NAMMs)克服了这种权衡,引入了一种用于内存管理的可学习网络,从而同时提升了Transformer的性能与效率。我们在预训练的Transformer之上演化NAMMs,为不同层和注意力头提供专注于最相关信息的差异化潜在上下文。NAMMs普遍适用于任何使用自注意力的模型,因为它们仅以生成的注意力矩阵中的值为条件。通过在少量问题上学习NAMMs,我们在多个长上下文基准测试中实现了显著的性能提升,同时将模型的输入上下文削减至原始尺寸的一小部分。我们证明了这种条件化机制的通用性:仅在语言数据上训练的NAMMs能够零样本迁移到全新的Transformer架构,甚至跨越输入模态,其优势在视觉和强化学习领域同样得以延续。