Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to "forget" earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
翻译:近期循环架构(如Mamba和RWKV)的进展展现出强大的语言能力。与基于Transformer的模型不同,这些架构将所有上下文信息编码到固定大小的状态中,从而实现了极高的推理效率。然而,这种方法可能导致信息干扰,即不同词元数据相互冲突,导致超过特定上下文长度后性能下降和输出不连贯。为防止此问题,大多数循环神经网络都设计了旨在“遗忘”早期词元的机制。本文揭示,即使内置遗忘机制,基于Mamba的模型仍难以有效遗忘早期词元。我们证明该问题源于训练上下文长度相对于状态尺寸过短,使得模型无需学习遗忘即可表现良好。进一步研究表明,模型学习遗忘所需的最小训练长度与状态尺寸呈线性比例关系,而准确检索五位密码的最大上下文长度与状态尺寸呈指数比例关系,这表明模型在开始遗忘后仍保留部分信息。这些发现凸显了当前循环神经网络架构的关键局限性,并为改进长上下文建模提供了重要见解。我们的研究表明,未来的循环神经网络设计必须综合考虑状态尺寸、训练长度与遗忘机制之间的相互作用,才能在长上下文任务中实现稳健性能。