Transformers are widely used to extract complex semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are represented by a tensor $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$. Given embedding vector $\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and $(\mathbf{ctx}_c)_c$ are mutually incoherent -- namely $\mathbf{pos}_t$ is almost orthogonal to $\mathbf{ctx}_c$ -- which is canonical in compressed sensing and dictionary learning. This decomposition offers structural insights about input formats in in-context learning (especially for induction heads) and in arithmetic tasks.
翻译:Transformer被广泛用于从输入token中提取复杂语义含义,但它们通常以黑盒模型运行。本文提出了一种简单且具有信息量的分解方法,将训练后Transformer的隐藏状态(或嵌入)分解为可解释组件。对于任意层,输入序列样本的嵌入向量由张量$\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$表示。给定序列(或上下文)$c \le C$中位置$t \le T$处的嵌入向量$\boldsymbol{h}_{c,t} \in \mathbb{R}^d$,提取均值效应可得分解式\[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \]其中$\boldsymbol{\mu}$为全局均值向量,$\mathbf{pos}_t$和$\mathbf{ctx}_c$分别为跨上下文和跨位置的均值向量,$\mathbf{resid}_{c,t}$为残差向量。针对主流Transformer架构和多样文本数据集,我们通过实验发现普遍存在的数学结构:(1)$(\mathbf{pos}_t)_{t}$形成跨层次的低维、连续且通常呈螺旋状的几何形状;(2)$(\mathbf{ctx}_c)_c$呈现清晰的聚类结构,对应上下文主题;(3)$(\mathbf{pos}_t)_{t}$与$(\mathbf{ctx}_c)_c$互不相干——即$\mathbf{pos}_t$与$\mathbf{ctx}_c$几乎正交——这是压缩感知和字典学习中的经典性质。该分解为上下文学习(特别是归纳头)和算术任务中的输入格式提供了结构性洞见。