Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.
翻译:快速权重架构通过保持与上下文长度无关的恒定内存开销,为长上下文建模提供了一种有前景的替代基于注意力的Transformer的方案。然而,其潜力受到下一词元预测训练范式的限制。NTP优化单词元预测,忽略了前缀之后多个词元间的语义连贯性。因此,快速权重模型(其动态更新参数以存储上下文信息)学习到的次优表示无法捕获长程依赖关系。我们提出了REFINE(基于下一序列预测的强化快速权重),这是一个在下一序列预测目标下训练快速权重模型的强化学习框架。REFINE基于预测熵选择信息丰富的词元位置,生成多词元展开序列,分配自监督的序列级奖励,并使用组相对策略优化对模型进行优化。REFINE适用于预训练语言模型的整个训练生命周期:中期训练、训练后以及测试时训练。我们在LaCT-760M和DeltaNet-1.3B上的实验表明,在“大海捞针”检索、长上下文问答以及LongBench中的多样化任务上,REFINE始终优于使用NTP的监督微调。REFINE为改进快速权重架构中的长上下文建模提供了一个有效且通用的框架。