Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
翻译:现代语言模型中的序列建模层通常面临存储容量与计算效率之间的权衡。Softmax注意力机制虽能提供无界存储,却以难以承受的二次计算成本为代价;线性变体虽计算高效,却受限于固定大小的有限存储。我们提出快速权重乘积键值记忆(FwPKM),这是一种新颖的架构,通过将稀疏的乘积键值记忆(PKM)从静态模块转变为动态的“快速权重”情景记忆,从而解决了这一矛盾。与PKM不同,FwPKM在训练和推理阶段均通过局部块级梯度下降动态更新其参数,使模型能够快速记忆并检索输入序列中的新键值对。实验表明,FwPKM作为一种有效的情景记忆,能够补充标准模块的语义记忆,在长上下文数据集上显著降低了困惑度。值得注意的是,在“大海捞针”评估中,尽管仅在4K词元序列上训练,FwPKM能够泛化至128K词元的上下文长度。