Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose noise stability as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to all input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical noise stability regularization method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35\%$ and $75\%$ respectively. Our results sculpt a new connection between signal propagation in neural networks and interpretability, with noise stability emerging as a powerful tool for understanding and improving modern Transformers.
翻译:理解深度学习中的简单性偏置为开发可靠的人工智能提供了一条前景广阔的路径。受布尔函数分析启发,平均敏感度是衡量此特性的常用指标,它捕捉了模型对单令牌扰动的鲁棒性。我们认为平均敏感度存在两个关键局限:它缺乏向实值域的自然推广,并且无法解释我们在现代大语言模型中经验观察到的"类junta"输入依赖性。为应对这些局限,我们提出噪声稳定性作为一种更全面的简单性度量指标。噪声稳定性表达了模型对所有输入坐标同时施加相关噪声的鲁棒性。我们对单层注意力机制和ReLU多层感知机层的噪声稳定性进行了理论分析,并采用协方差区间传播方法处理多层传播问题。基于此理论,我们开发了一种实用的噪声稳定性正则化方法。在算法任务和下一令牌预测任务上的实验表明,我们的正则化器能持续催化顿悟现象,并分别将训练速度提升约$35\%$和$75\%$。我们的研究结果在神经网络信号传播与可解释性之间塑造了新的联系,使噪声稳定性成为理解和改进现代Transformer的强大工具。