We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.
翻译:我们在Transformer的残差流中识别出稳定区域,这些区域中模型输出对微小的激活变化不敏感,但在区域边界处表现出高敏感性。这些区域在训练过程中形成,并随着训练进程推进或模型规模增大而变得更加清晰。这些区域的范围似乎远大于先前研究的多面体结构。我们的分析表明,这些稳定区域与语义区分相吻合:相似提示词聚集在同一区域内,而来自相同区域的激活会导致相似的下一词元预测。本研究为理解神经网络的复杂性、揭示训练动态机制及推进可解释性研究提供了一个前景广阔的研究方向。