We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N selection sharpens separability, sink similarity correlates with Recall. We also found that in LLaMA-2-7B heads specialize into three regimes - Retriever, Mixer, Reset - with distinct geometric signatures. Overall, attention behaves as a structured geometric classifier with measurable criteria for token selection, offering head level interpretability and informing geometry-aware sparsification and design of attention in LLMs.
翻译:我们提出了一种用于分析大型语言模型(LLMs)中多头注意力机制的几何框架。在不改变机制的前提下,我们通过Top-N选择的视角审视标准注意力,并直接在值状态空间中研究其行为。我们定义了几何度量指标——精确率、召回率和F分数——以量化被选令牌与未选令牌之间的可分离性,并在经验驱动的假设下(具有压缩汇聚令牌的稳定值范数、指数相似度衰减以及分段注意力权重分布)推导了非渐近边界,该边界明确依赖于维度和间隔。该理论预测了一个具有最强非平凡可分离性的小N值操作区域,并阐明了序列长度和汇聚相似度如何影响这些度量指标。经验上,在LLaMA-2-7B、Gemma-7B和Mistral-7B模型中,测量结果与理论包络线高度吻合:Top-N选择增强了可分离性,汇聚相似度与召回率相关。我们还发现,在LLaMA-2-7B中,注意力头专门化分为三种机制——检索器、混合器和重置器——各自具有独特的几何特征。总体而言,注意力表现为一种结构化的几何分类器,具有可度量的令牌选择标准,这为注意力头级别的可解释性提供了依据,并为LLMs中几何感知的稀疏化及注意力机制的设计提供了参考。