Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.
翻译:探究大型语言模型(LLMs)能否捕捉结构化语义,需要剖析其如何表征概念间关系。本研究选取三个不同规模的模型——Pythia-70M、GPT-2与Llama 3.1 8B,聚焦四种语义关系:同义、反义、上位义与下位义。通过结合线性探针与机制可解释性技术(包括稀疏自编码器与激活补丁),我们定位了这些关系的编码位置,并揭示了特定特征如何参与表征构建。实验结果显示,层级关系存在方向性不对称:上位义呈现冗余编码且难以抑制,而下位义则依赖更易被消融破坏的紧凑特征。整体而言,关系信号呈弥散分布但具有稳定特征图谱:其在中层达到峰值,且在后残差/MLP通路中的强度高于注意力层。各模型的任务难度呈现一致性(反义最易,同义最难)。探针层面的因果性具有容量依赖性:在Llama 3.1上,SAE引导的补丁可有效偏移信号,而在较小模型中此类偏移微弱或不稳定。本研究成果阐明了LLMs内部语义关系的表征位置与可靠性,并为建立稀疏特征与探针级因果证据之间的可复现关联框架提供了方法论支撑。