The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the base model's performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.
翻译:Transformer推理的每词成本随上下文长度增长而扩大,这阻碍了其在终身上下文学习中的应用。线性注意力作为一种高效替代方案,即使在无限上下文长度下也能保持恒定的内存占用。虽然这有望成为终身学习的候选方案,但其内存容量存在不足。本文提出LoLA——一种无需训练的线性注意力增强方法,可提升联想召回能力。LoLA将上下文中的历史键值对分配到三种记忆系统中:(i) 局部滑窗缓存中的近期键值对;(ii) 稀疏全局缓存中的难记忆键值对;(iii) 线性注意力循环隐状态中的通用键值对。通过消融实验证明,我们提出的自召回误差度量对高效管理长期联想记忆至关重要。在密钥检索任务中,LoLA将基线模型准确率从0.6%提升至97.4%,且其在4K上下文长度下的缓存容量仅为Llama-3.1 8B的4.6分之一。在零样本常识推理任务中,LoLA也优于其他1B与8B参数的次二次复杂度模型。