As artificial intelligence models have exploded in scale and capability, understanding of their internal mechanisms remains a critical challenge. Inspired by the success of dynamical systems approaches in neuroscience, here we propose a novel framework for studying computations in deep learning systems. We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers. We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis. Activations in the RS accelerate and grow denser over layers, while individual units trace unstable periodic orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers. These insights bridge dynamical systems theory and mechanistic interpretability, establishing a foundation for a "neuroscience of AI" that combines theoretical rigor with large-scale data analysis to advance our understanding of modern neural networks.
翻译:随着人工智能模型规模和能力的爆炸式增长,理解其内部机制仍是一个关键挑战。受神经科学中动力学系统方法成功的启发,本文提出了一种研究深度学习系统计算过程的新框架。我们聚焦于Transformer模型中的残差流,将其概念化为跨层演化的动力学系统。研究发现,尽管残差流并非特权基,但其单个单元的激活在跨层间表现出强烈的连续性。残差流中的激活随层数增加而加速且密度增大,同时单个单元呈现出不稳定周期轨道。在降维空间中,残差流遵循弯曲轨迹,并在较低层级表现出类吸引子动力学。这些发现连接了动力学系统理论与机制可解释性研究,为建立"人工智能神经科学"奠定了基础——该领域将理论严谨性与大规模数据分析相结合,以深化我们对现代神经网络的理解。