Large Language Models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different communicative agents, we present the persona hypothesis: LLMs can cluster agents into personas using common features of their generations. For instance, a truthful persona is a group of agents that are likely to produce truthful text and that share similar features like formal writing styles and scientific references. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent ``Wikipedia'' will behave truthfully on topics that were only generated by ``Science'' because they both belong to the truthful persona. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
翻译:大型语言模型(LLMs)基于互联网中海量文本进行训练,这些文本既包含关于世界的事实性信息,也包含误导性信息。面对相互矛盾的数据,语言模型能否区分真伪?基于LLMs能够建模不同传播主体的观点,我们提出"人格假设":LLMs可以通过其生成内容的共性特征将传播主体聚类为不同人格。例如,真实人格是一组可能生成真实文本的主体群体,它们共享正式写作风格、科学引用等相似特征。通过建模这种人格,LLMs能够将真实性泛化至训练文本中各个主体生成内容的特定情境之外。例如,模型可以推断出"维基百科"这一主体将对仅由"科学"主体生成的主题保持真实行为,因为两者同属真实人格。我们通过两个实验现象为这一假设提供证据:(1)可在模型生成答案之前探测其回答的真实性;(2)对模型进行事实集的微调可提升其在未见主题上的真实性表现。随后,我们以算术问题作为合成环境,证明语言模型能够分离真假陈述,并跨主体泛化真实性——但仅当训练数据中的主体共享支持真实人格形成的真实生成过程时方可实现。总体而言,我们的研究发现表明,模型能够利用数据中的层次结构来学习如真实性这类抽象概念。