Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While softmax attention offers unbounded storage at prohibitive quadratic cost, linear variants are more efficient but suffer from limited, fixed-size storage. We introduce Fast-weight Product Key Memory (FwPKM), a sparse fast-weight memory layer that resolves this tension. FwPKM updates sparsely activated parameters at both training and inference time using chunk-level gradient descent on a local memory-rewrite objective. This performs Test-Time Training (TTT)-style gradient updates on activated slots in a sparse memory, enabling rapid memorization and retrieval of many new key-value associations while keeping per-token compute low and fixed. Experiments show that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
翻译:现代语言模型中的序列建模层通常面临存储容量与计算效率之间的权衡。虽然softmax注意力机制提供了无界存储能力,但其二次方计算代价过高;而线性变体虽更高效,却受限于固定大小的有限存储。本文提出快速权重乘积键记忆(FwPKM),一种稀疏快速权重记忆层,以解决这一矛盾。FwPKM通过在局部记忆重写目标上执行块级梯度下降,在训练和推理时对稀疏激活的参数进行更新。该方法对稀疏记忆中的激活槽执行测试时训练(TTT)式梯度更新,能够快速记忆并检索大量新的键值关联,同时保持每词元计算量低且恒定。实验表明,FwPKM可作为有效的情景记忆模块,与标准模块的语义记忆形成互补,在长上下文数据集上显著降低困惑度。值得注意的是,在"大海捞针"评估中,尽管仅使用4K词元序列进行训练,FwPKM能够泛化至128K词元的上下文长度。