Personas as a Way to Model Truthfulness in Language Models

Large Language Models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different communicative agents, we present the persona hypothesis: LLMs can cluster agents into personas using common features of their generations. For instance, a truthful persona is a group of agents that are likely to produce truthful text and that share similar features like formal writing styles and scientific references. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent ``Wikipedia'' will behave truthfully on topics that were only generated by ``Science'' because they both belong to the truthful persona. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

翻译：大型语言模型（LLMs）基于互联网中海量文本进行训练，这些文本既包含关于世界的事实性信息，也包含误导性信息。面对相互矛盾的数据，语言模型能否区分真伪？基于LLMs能够建模不同传播主体的观点，我们提出"人格假设"：LLMs可以通过其生成内容的共性特征将传播主体聚类为不同人格。例如，真实人格是一组可能生成真实文本的主体群体，它们共享正式写作风格、科学引用等相似特征。通过建模这种人格，LLMs能够将真实性泛化至训练文本中各个主体生成内容的特定情境之外。例如，模型可以推断出"维基百科"这一主体将对仅由"科学"主体生成的主题保持真实行为，因为两者同属真实人格。我们通过两个实验现象为这一假设提供证据：（1）可在模型生成答案之前探测其回答的真实性；（2）对模型进行事实集的微调可提升其在未见主题上的真实性表现。随后，我们以算术问题作为合成环境，证明语言模型能够分离真假陈述，并跨主体泛化真实性——但仅当训练数据中的主体共享支持真实人格形成的真实生成过程时方可实现。总体而言，我们的研究发现表明，模型能够利用数据中的层次结构来学习如真实性这类抽象概念。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日