In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.
翻译:Transformer中的上下文学习(ICL)作为一种在线联想记忆机制,被认为是其在复杂序列处理任务上取得优异性能的关键。然而,在门控线性注意力模型中,这种记忆具有固定容量且容易受到干扰,尤其在处理长序列时更为明显。我们提出Palimpsa——一种将ICL视为持续学习问题的自注意力模型,该问题必须解决稳定性与可塑性之间的权衡困境。Palimpsa采用贝叶斯元可塑性机制,其中每个注意力状态的可塑性与其重要性状态相关联,该重要性状态由捕捉累积知识的先验分布所锚定。我们证明,多种门控线性注意力模型可视为特定架构选择与后验近似的特例,且Mamba2是Palimpsa在遗忘机制占主导时的特殊情形。这一理论关联使得任何非元可塑性模型都能转化为元可塑性模型,从而显著扩展其记忆容量。实验表明,Palimpsa在多查询联想记忆(MQAR)基准测试及常识推理任务中均持续优于基线模型。