Revisiting associative recall in modern recurrent models

Despite the advantageous subquadratic complexity of modern recurrent deep learning models -- such as state-space models (SSMs) -- recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba's performance and optimization stability.

翻译：尽管现代循环深度学习模型（如状态空间模型SSMs）具有亚二次复杂度的优势，但近期研究指出，在推理和记忆任务上，它们与Transformer模型相比可能存在不足。本文深入探讨了其中一个基准测试：关联回忆（AR）——该任务已被证明与语言建模性能高度相关，并详细检视了近期提出的令牌混合策略中扩展性与优化问题的影响。我们首先证明，与标准Transformer不同，学习率的选择对现代循环模型的性能起着关键作用：这一问题可能严重影响先前工作中报告的性能，表明需要进一步研究以稳定训练过程。其次，我们发现循环模型与基于注意力的模型在宽度扩展与深度扩展方面表现出截然不同的优势，其中注意力机制在仅限单层时完全无法解决AR任务。随后，我们进一步考察单层Transformer，发现尽管其性能较差，但其训练动态竟意外地呈现出归纳头形成的特征——这一现象此前仅在双层模型中观察到。最后，通过架构消融实验，我们研究了各组件如何影响Transformer和Mamba模型的性能及优化稳定性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日