Personas as a Way to Model Truthfulness in Language Models

Large Language Models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different communicative agents, we present the persona hypothesis: LLMs can cluster agents into personas using common features of their generations. For instance, a truthful persona is a group of agents that are likely to produce truthful text and that share similar features like formal writing styles and scientific references. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they both belong to the truthful persona. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

翻译：大型语言模型（LLMs）基于互联网海量文本进行训练，这些文本既包含关于世界的事实信息，也包含误导性内容。面对这些矛盾数据，语言模型能否区分真伪？基于语言模型可模拟不同交流主体的观点，我们提出人物角色假说：语言模型可利用其生成内容的共性特征，将交流主体聚类为不同人物角色。例如，真实人物角色是指倾向于生成真实文本且共享正式写作风格、科学参考文献等相似特征的交流主体集群。通过建模该人物角色，语言模型能够超越各交流主体生成训练文本的特定语境，泛化真实性判断能力。例如，模型可推断出"Wikipedia"这一交流主体在仅由"Science"生成的议题上也会保持真实行为，因为两者同属真实人物角色。我们从两个观察结果证实该假说：(1) 可在模型生成答案前探测其真实性；(2) 基于事实集微调模型可提升其未见议题的真实性。进一步，通过以算术问题作为合成环境，我们发现语言模型能分离真假陈述，并在交流主体间泛化真实性判断；但前提是训练数据中的交流主体共享可构建真实人物角色的真实生成过程。总体而言，我们的研究表明模型可利用数据中的层次结构学习如真实性等抽象概念。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日