Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.
翻译:尽管注意力头在Transformer中处于核心地位,我们仍缺乏工具来理解模型为何关注特定词元。为此,我们研究了查询-键(QK)空间——即查询向量与键向量之间的双线性联合嵌入空间。我们提出了一种对比协方差方法,将QK空间分解为低秩且人类可解释的组分。当键与查询中的特征在这些低秩子空间中对齐时,就会产生较高的注意力分数。我们首先在简化场景中对所提方法进行了理论分析与实证研究。随后将方法应用于大语言模型,成功识别出针对分类语义特征与绑定特征的人类可解释QK子空间。最后,我们展示了如何将注意力分数归因于所识别的特征。