Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited from the model's representations. This paper presents an explanation for why LMs appear to know the truth despite not being trained with truth labels. We hypothesize that the pretraining data is generated by groups of (un)truthful agents whose outputs share common features, and they form a (un)truthful persona. By training on this data, LMs can infer and represent the persona in its activation space. This allows the model to separate truth from falsehoods and controls the truthfulness of its generation. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that structures of the pretraining data are crucial for the model to infer the truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
翻译:大语言模型(LLMs)通过互联网海量文本训练而成,其中既包含关于世界的真实信息,也包含误导性内容。尽管从经典语言模型视角看令人费解,但近期研究表明,陈述的真实性可从模型表征中提取。本文提出一种解释:为何LLMs在未使用真实标签训练的情况下仍能展现对真理的认知。我们假设预训练数据由(不)诚实代理群体生成,其输出具有共同特征,并形成(不)诚实的人物角色。通过此类数据训练,LLMs可在激活空间中推断并表征该人物角色,从而区分真实与虚假信息,并控制生成内容的真实性。我们通过两个观察为人物角色假说提供证据:(1)可在模型生成答案前探测其真实性;(2)在特定事实集上微调模型可提升其对未涉及话题的真实性。此外,以算术问题作为合成实验环境,我们证明预训练数据的结构对模型推断诚实人物角色至关重要。总体而言,本研究表明模型可利用数据中的层级结构学习如真实性这类抽象概念。