Despite the advantageous subquadratic complexity of modern recurrent deep learning models -- such as state-space models (SSMs) -- recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba's performance and optimization stability.
翻译:尽管现代循环深度学习模型(如状态空间模型SSMs)具有亚二次复杂度的优势,但近期研究指出,在推理和记忆任务上,它们与Transformer模型相比可能存在不足。本文深入探讨了其中一个基准测试:关联回忆(AR)——该任务已被证明与语言建模性能高度相关,并详细检视了近期提出的令牌混合策略中扩展性与优化问题的影响。我们首先证明,与标准Transformer不同,学习率的选择对现代循环模型的性能起着关键作用:这一问题可能严重影响先前工作中报告的性能,表明需要进一步研究以稳定训练过程。其次,我们发现循环模型与基于注意力的模型在宽度扩展与深度扩展方面表现出截然不同的优势,其中注意力机制在仅限单层时完全无法解决AR任务。随后,我们进一步考察单层Transformer,发现尽管其性能较差,但其训练动态竟意外地呈现出归纳头形成的特征——这一现象此前仅在双层模型中观察到。最后,通过架构消融实验,我们研究了各组件如何影响Transformer和Mamba模型的性能及优化稳定性。