Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.
翻译:现有语言模型训练技术可能与事实真相存在偏差:若采用模仿学习训练模型,模型可能复现人类所犯的错误;若训练其生成人类评分高的文本,则可能输出人类评估者无法察觉的错误。我们提出通过纯无监督方式直接发现语言模型内部激活中的潜在知识来规避这一问题。具体而言,我们引入一种仅利用未标记模型激活即可准确回答是非问题的方法。该方法通过寻找满足逻辑一致性属性(如命题与其否定命题具有相反真值)的激活空间方向实现。研究表明,尽管不使用任何监督信号和模型输出,我们的方法仍能恢复大型语言模型中蕴含的多样化知识:在6个模型和10个问答数据集上,其平均准确率比零样本推理高出4%。我们还发现该方法将提示灵敏度降低一半,即便在模型被引导生成错误答案时仍保持高准确率。我们的研究为探索语言模型已知内容(有别于其输出内容)开辟了初始路径,即使无法获取显式真值标签也能实现这一目标。