Transformers have significantly advanced the field of natural language processing, but comprehending their internal mechanisms remains a challenge. In this paper, we introduce a novel geometric perspective that elucidates the inner mechanisms of transformer operations. Our primary contribution is illustrating how layer normalization confines the latent features to a hyper-sphere, subsequently enabling attention to mold the semantic representation of words on this surface. This geometric viewpoint seamlessly connects established properties such as iterative refinement and contextual embeddings. We validate our insights by probing a pre-trained 124M parameter GPT-2 model. Our findings reveal clear query-key attention patterns in early layers and build upon prior observations regarding the subject-specific nature of attention heads at deeper layers. Harnessing these geometric insights, we present an intuitive understanding of transformers, depicting them as processes that model the trajectory of word particles along the hyper-sphere.
翻译:Transformer在自然语言处理领域取得了显著进展,但理解其内部机制仍是一项挑战。本文提出一种新颖的几何视角,阐明了Transformer运算的内在机理。我们的主要贡献在于揭示了层归一化如何将潜在特征限制在超球面上,进而使注意力机制能够在该表面上塑造词语的语义表征。这一几何观点将迭代精化、上下文嵌入等已有特性自然地联系起来。我们通过对预训练1.24亿参数GPT-2模型的探针分析验证了这些见解。结果揭示了早期层中清晰的查询-键注意力模式,并基于先前关于深层注意力头特定于主体性质的观察进行了扩展。利用这些几何洞见,我们提出了一种对Transformer的直观理解,将其描述为沿着超球面建模词语粒子运动轨迹的过程。