Large Language Models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different communicative agents, we present the persona hypothesis: LLMs can cluster agents into personas using common features of their generations. For instance, a truthful persona is a group of agents that are likely to produce truthful text and that share similar features like formal writing styles and scientific references. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they both belong to the truthful persona. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
翻译:大型语言模型(LLMs)基于互联网海量文本进行训练,这些文本既包含关于世界的事实信息,也包含误导性内容。面对这些矛盾数据,语言模型能否区分真伪?基于语言模型可模拟不同交流主体的观点,我们提出人物角色假说:语言模型可利用其生成内容的共性特征,将交流主体聚类为不同人物角色。例如,真实人物角色是指倾向于生成真实文本且共享正式写作风格、科学参考文献等相似特征的交流主体集群。通过建模该人物角色,语言模型能够超越各交流主体生成训练文本的特定语境,泛化真实性判断能力。例如,模型可推断出"Wikipedia"这一交流主体在仅由"Science"生成的议题上也会保持真实行为,因为两者同属真实人物角色。我们从两个观察结果证实该假说:(1) 可在模型生成答案前探测其真实性;(2) 基于事实集微调模型可提升其未见议题的真实性。进一步,通过以算术问题作为合成环境,我们发现语言模型能分离真假陈述,并在交流主体间泛化真实性判断;但前提是训练数据中的交流主体共享可构建真实人物角色的真实生成过程。总体而言,我们的研究表明模型可利用数据中的层次结构学习如真实性等抽象概念。