Personas as a Way to Model Truthfulness in Language Models

Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

翻译：大型语言模型基于互联网海量文本进行训练，其中既包含关于世界的事实性信息，也包含误导性内容。语言模型能否在相互矛盾的训练数据中区分真实与虚假？本文拓展了"语言模型可建模不同语料生成主体"的观点，提出模型可通过构建"真实性人格角色"来聚类真实文本：即具有相似特征且倾向于生成真实文本的主体集合。例如维基百科与科学文献等可信来源，通常使用正式写作风格并保持论点一致性。通过建模这种人格角色，语言模型能将真实性泛化到各主体生成训练文本时的特定语境之外——即使某个话题仅由"科学"主体生成，模型仍可推断具有相同人格角色的"维基百科"主体在该话题上会保持真实性。我们通过两个观察证实该假说：(1)在模型生成答案前即可探测其真实性；(2)对模型进行事实集微调可提升其在未见话题上的真实性表现。进一步，我们以算术运算作为合成实验环境，证明语言模型能够分离真实与虚假陈述，并在不同主体间泛化真实性判别能力——但前提是训练数据中的主体共享一个能形成真实性人格角色的真实生成过程。总体而言，我们的研究表明模型能够利用数据中的层级结构学习"真实性"等抽象概念。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日