We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.
翻译:我们在Transformer的残差流中识别出稳定区域,这些区域中模型输出对微小的激活变化不敏感,但在区域边界处表现出高敏感性。这些区域在训练过程中形成,并随着训练进程推进或模型规模增大而变得更加清晰。这些区域的规模远大于先前研究的凸多面体。我们的分析表明,这些稳定区域与语义区分相对应:相似提示词聚集在同一区域内,且来自同一区域的激活会产生相似的下一词元预测。这项工作为理解神经网络的复杂性、揭示训练动态机制以及推进可解释性研究提供了富有前景的研究方向。