State space models (SSMs) for language modelling promise an efficient and performant alternative to quadratic-attention Transformers, yet show variable performance on recalling basic information from the context. While performance on synthetic tasks like Associative Recall (AR) can point to this deficiency, behavioural metrics provide little information as to \textit{why} -- on a mechanistic level -- certain architectures fail and others succeed. To address this, we conduct experiments on AR, and find that only Transformers and Based SSM models fully succeed at AR, with Mamba and DeltaNet close behind, while the other SSMs (H3, Hyena) fail. We then use causal interventions to explain why. We find that Transformers and Based learn to store key-value associations in-context using induction. By contrast, the SSMs seem to compute these associations only at the last state using a single layer. We further investigate the mechanism underlying the success of Mamba, and find novel evidence that Mamba \textit{does} implement induction: not via the SSM, but instead via short convolutions. Further experiments on a new hierarchical retrieval task, Associative Treecall (ATR), show that all architectures learn the same mechanism as they did for AR. Furthermore, we show that Mamba can learn Attention-like induction on ATR when short convolutions are removed. These results reveal that architectures with similar accuracy may still have substantive differences, motivating the adoption of mechanistic evaluations.
翻译:用于语言建模的状态空间模型(SSM)有望成为二次注意力Transformer的高效高性能替代方案,但在从上下文中回忆基本信息方面表现出不稳定的性能。虽然关联回忆(AR)等合成任务上的性能可以指出这一缺陷,但行为指标几乎无法解释在机制层面上某些架构失败而其他架构成功的原因。为此,我们在AR任务上进行了实验,发现只有Transformer和Based SSM模型能完全成功完成AR,Mamba和DeltaNet紧随其后,而其他SSM(H3、Hyena)则失败。随后,我们通过因果干预来解释其原因。我们发现Transformer和Based通过学习归纳在上下文中存储键值关联。相比之下,SSM似乎仅通过单层在最终状态计算这些关联。我们进一步研究了Mamba成功的机制,并发现了新的证据表明Mamba确实实现了归纳:并非通过SSM,而是通过短卷积实现。在一个新的分层检索任务——关联树调用(ATR)上的进一步实验表明,所有架构都学习了与AR相同的机制。此外,我们证明当移除短卷积时,Mamba可以在ATR上学习类似注意力的归纳。这些结果表明,具有相似准确性的架构仍可能存在实质性差异,这促使我们采用机制性评估方法。