Recent advances in language model interpretability have identified circuits, critical subnetworks that replicate model behaviors, yet how knowledge is structured within these crucial subnetworks remains opaque. To gain an understanding toward the knowledge in the circuits, we conduct systematic knowledge editing experiments on the circuits of the GPT-2 language model. Our analysis reveals intriguing patterns in how circuits respond to editing attempts, the extent of knowledge distribution across network components, and the architectural composition of knowledge-bearing circuits. These findings offer insights into the complex relationship between model circuits and knowledge representation, deepening the understanding of how information is organized within language models. Our findings offer novel insights into the ``meanings'' of the circuits, and introduce directions for further interpretability and safety research of language models.
翻译:近期语言模型可解释性研究已识别出能够复现模型行为的关键子网络——电路,然而知识在这些关键子网络中的组织结构仍不清晰。为深入理解电路中的知识结构,我们在GPT-2语言模型的电路上进行了系统性知识编辑实验。分析揭示了电路对编辑尝试的响应模式、知识在网络组件中的分布范围,以及承载知识的电路架构组成等方面的规律。这些发现为理解模型电路与知识表征之间的复杂关系提供了新视角,深化了对语言模型内部信息组织方式的认识。本研究为揭示电路的“语义”提供了新颖见解,并为语言模型的进一步可解释性与安全性研究指明了方向。