How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
翻译:语言模型(LMs)如何表征角色的信念,特别是当这些信念可能与现实不符时?这一问题处于理解语言模型心智理论(ToM)能力的核心。我们通过因果干预与抽象分析的方法,探究语言模型对角色信念的推理能力。我们构建了一个名为CausalToM的数据集,其中包含一系列简单故事:两个角色独立改变两个物体的状态,且可能未察觉到对方的行为。研究发现了一种普遍存在的算法模式,我们称之为回溯机制,该机制使语言模型能够在必要时召回关键信息。模型通过将每个“角色-物体-状态”三元组的参照信息(以排序标识符OIs表示)共置于状态词元残差流的低秩子空间中,实现三者的绑定。当被问及角色对某物体状态的信念时,绑定回溯会检索正确的状态OI,随后答案回溯将检索对应的状态词元。当我们引入文本指明一个角色对另一角色(不)可见时,发现模型首先生成一个可见性标识符,该标识符编码了观察者与被观察者角色OIs之间的关系。在可见性回溯中,该标识符被用于检索被观察角色的信息,并更新观察者的信念。本研究揭示了信念追踪的内在机制,为逆向解析语言模型中的心智理论推理迈出了重要一步。