Transformers are widely used to extract semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are represented by a tensor $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$. Given embedding vector $\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and $(\mathbf{ctx}_c)_c$ are mutually nearly orthogonal. We argue that smoothness is pervasive and beneficial to transformers trained on languages, and our decomposition leads to improved model interpretability.
翻译:Transformer被广泛用于从输入标记中提取语义信息,但其通常作为黑盒模型运行。本文提出一种简单而信息量大的解耦方法,将训练好的Transformer的隐藏状态(或嵌入向量)分解为可解释的组成部分。对于任意层,输入序列样本的嵌入向量由张量$\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$表示。给定序列(或上下文)$c \le C$中位置$t \le T$处的嵌入向量$\boldsymbol{h}_{c,t} \in \mathbb{R}^d$,提取均值效应可得如下分解:\[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \]其中$\boldsymbol{\mu}$为全局均值向量,$\mathbf{pos}_t$和$\mathbf{ctx}_c$分别为跨上下文和跨位置的均值向量,$\mathbf{resid}_{c,t}$为残差向量。针对主流Transformer架构及多种文本数据集,实验发现以下普遍存在的数学结构:(1) $(\mathbf{pos}_t)_{t}$在各层形成低维、连续且常呈螺旋状的几何形态;(2) $(\mathbf{ctx}_c)_c$呈现清晰的聚类结构,对应上下文主题;(3) $(\mathbf{pos}_t)_{t}$与$(\mathbf{ctx}_c)_c$近乎相互正交。我们论证平滑性是语言训练Transformer的普遍特性且具有益处,所提分解方法可显著提升模型可解释性。