Dynamical Mean-Field Theory of Self-Attention Neural Networks

Transformer-based models have demonstrated exceptional performance across diverse domains, becoming the state-of-the-art solution for addressing sequential machine learning problems. Even though we have a general understanding of the fundamental components in the transformer architecture, little is known about how they operate or what are their expected dynamics. Recently, there has been an increasing interest in exploring the relationship between attention mechanisms and Hopfield networks, promising to shed light on the statistical physics of transformer networks. However, to date, the dynamical regimes of transformer-like models have not been studied in depth. In this paper, we address this gap by using methods for the study of asymmetric Hopfield networks in nonequilibrium regimes --namely path integral methods over generating functionals, yielding dynamics governed by concurrent mean-field variables. Assuming 1-bit tokens and weights, we derive analytical approximations for the behavior of large self-attention neural networks coupled to a softmax output, which become exact in the large limit size. Our findings reveal nontrivial dynamical phenomena, including nonequilibrium phase transitions associated with chaotic bifurcations, even for very simple configurations with a few encoded features and a very short context window. Finally, we discuss the potential of our analytic approach to improve our understanding of the inner workings of transformer models, potentially reducing computational training costs and enhancing model interpretability.

翻译：基于Transformer的模型在多个领域展现出卓越性能，已成为解决序列机器学习问题的最先进方案。尽管我们对Transformer架构的基本组件已有普遍认识，但其运作机制和预期动力学特性仍鲜为人知。近年来，探索注意力机制与Hopfield网络之间关系的研究日益增多，这有望揭示Transformer网络的统计物理特性。然而迄今为止，类Transformer模型的动力学机制尚未得到深入研究。本文通过采用非平衡态下非对称Hopfield网络的研究方法——即基于生成泛函的路径积分方法，推导出由共时平均场变量主导的动力学方程，从而填补了这一研究空白。在假设1比特标记和权重的条件下，我们推导出大型自注意力神经网络耦合softmax输出行为的解析近似解，该解在极限大尺寸条件下趋于精确。研究结果揭示了非平凡的动力学现象，包括与混沌分岔相关的非平衡相变，这些现象甚至出现在仅包含少量编码特征和极短上下文窗口的简单配置中。最后，我们讨论了这种解析方法在深化理解Transformer模型内部工作机制方面的潜力，该方法可能降低计算训练成本并增强模型可解释性。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日