Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.
翻译:大型语言模型(LLMs)在处理长序列任务时,由于自注意力机制固有的计算复杂性和内存开销,面临显著的性能瓶颈。为应对这些挑战,本文提出 \textsc{AllMem}——一种新颖高效的混合架构,它将滑动窗口注意力(SWA)与非线性测试时训练(TTT)记忆网络相结合。\textsc{AllMem} 使模型能够有效扩展至超长上下文,同时缓解灾难性遗忘。该方法不仅克服了线性记忆模型常见的表示能力限制,还显著降低了长序列推理过程中的计算与内存开销。此外,我们实现了一种内存高效微调策略,用记忆增强的滑动窗口层替换预训练模型中的标准注意力层。该框架支持将任何现成的预训练 LLM 高效转换为基于 \textsc{AllMem} 的架构。实证评估表明,我们的 4k 窗口模型在 37k 长度的 LongBench 上实现了近乎无损的性能,与全注意力相比仅下降 0.83。进一步地,在 128k 上下文长度的 InfiniteBench 上,我们的 8k 窗口变体表现优于全注意力模型,这验证了参数化记忆在抑制噪声、保持稳健长程建模能力方面的有效性,同时避免了全局注意力带来的过高计算代价。