Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
翻译:大型语言模型基于互联网海量文本进行训练,其中既包含关于世界的事实性信息,也包含误导性内容。语言模型能否在相互矛盾的训练数据中区分真实与虚假?本文拓展了"语言模型可建模不同语料生成主体"的观点,提出模型可通过构建"真实性人格角色"来聚类真实文本:即具有相似特征且倾向于生成真实文本的主体集合。例如维基百科与科学文献等可信来源,通常使用正式写作风格并保持论点一致性。通过建模这种人格角色,语言模型能将真实性泛化到各主体生成训练文本时的特定语境之外——即使某个话题仅由"科学"主体生成,模型仍可推断具有相同人格角色的"维基百科"主体在该话题上会保持真实性。我们通过两个观察证实该假说:(1)在模型生成答案前即可探测其真实性;(2)对模型进行事实集微调可提升其在未见话题上的真实性表现。进一步,我们以算术运算作为合成实验环境,证明语言模型能够分离真实与虚假陈述,并在不同主体间泛化真实性判别能力——但前提是训练数据中的主体共享一个能形成真实性人格角色的真实生成过程。总体而言,我们的研究表明模型能够利用数据中的层级结构学习"真实性"等抽象概念。