Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

翻译：现有语言模型训练技术可能与事实真相存在偏差：若采用模仿学习训练模型，模型可能复现人类所犯的错误；若训练其生成人类评分高的文本，则可能输出人类评估者无法察觉的错误。我们提出通过纯无监督方式直接发现语言模型内部激活中的潜在知识来规避这一问题。具体而言，我们引入一种仅利用未标记模型激活即可准确回答是非问题的方法。该方法通过寻找满足逻辑一致性属性（如命题与其否定命题具有相反真值）的激活空间方向实现。研究表明，尽管不使用任何监督信号和模型输出，我们的方法仍能恢复大型语言模型中蕴含的多样化知识：在6个模型和10个问答数据集上，其平均准确率比零样本推理高出4%。我们还发现该方法将提示灵敏度降低一半，即便在模型被引导生成错误答案时仍保持高准确率。我们的研究为探索语言模型已知内容（有别于其输出内容）开辟了初始路径，即使无法获取显式真值标签也能实现这一目标。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/