The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ($R^2$) and stability ($q^*$). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ($R^2=1$) but unstable ($q^*=0.5\%$), while features are more stable ($q^*=68.2\%$) but unfaithful ($R^2=48.8\%$). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ($R^2=99.9\%$) and stability ($q^*=99.8\%$) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.
翻译:大型语言模型(LLM)的基本表示单元(FRU)尚未被明确定义,这限制了我们对其内在机制的进一步理解。本文提出原子理论,以系统化地定义、评估并识别此类基本表示单元,我们将其称为原子。基于原子内积(AIP)——一种能够捕捉LLM表示底层几何结构的非欧几里得度量,我们正式定义了原子,并提出了理想原子的两个关键标准:忠实度($R^2$)和稳定性($q^*$)。我们进一步证明了在阈值激活稀疏自编码器(TSAE)下原子是可识别的。通过实证研究,我们揭示了LLM中普遍存在的表示偏移现象,并证明AIP能够校正这种偏移以捕捉底层的表示几何结构,从而为原子理论奠定了基础。我们发现两种广泛使用的单元——神经元和特征——均不符合理想原子的标准:神经元具有忠实性($R^2=1$)但不稳定($q^*=0.5\%$),而特征则相对更稳定($q^*=68.2\%$)但缺乏忠实性($R^2=48.8\%$)。为了寻找LLM的原子,我们利用TSAE下的原子可识别性,通过大规模实验表明,可靠的原子识别仅在TSAE的容量与数据规模相匹配时才会发生。基于这一洞见,我们在Gemma2-2B、Gemma2-9B和Llama3.1-8B的各层中识别出了具有近乎完美忠实度($R^2=99.9\%$)和稳定性($q^*=99.8\%$)的基本表示单元,在统计学上满足了理想原子的标准。进一步的分析证实,这些原子符合理论预期,并展现出显著更高的单义性。总体而言,我们提出并验证了原子理论,为理解LLM的内部表示奠定了基础。代码发布于 https://github.com/ChenhuiHu/towards_atoms。