Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model's untruthful output. The best probing method (logistic regression on contrast pairs) recovers 89% of the gap in AUROC between truthful and untruthful contexts, and 75% for questions harder than those used to train the probe. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 0.95 AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.
翻译:激发潜在知识(ELK)旨在从高性能神经网络的激活模式中发现能够稳健追踪世界真实状态的表征,特别是在模型输出不可信的难以验证场景中。为推进ELK研究,我们构建了12个数据集及配套的"异常"语言模型(LMs)套件:这些模型经过微调,当且仅当提示中出现关键词"Bob"时会在回答问题时产生系统性错误。研究发现,尤其在中间层中,线性探针通常能独立于语言模型输出而反映其内在认知,使得我们能够在模型输出不实信息时仍能提取正确答案。最佳探测方法(基于对比对的逻辑回归)在AUROC指标上恢复了真实与虚假语境间89%的差距,对于比训练探针所用问题更困难的问题也能恢复75%的差距。同时发现,机制异常检测方法能以0.95的AUROC识别虚假输出行为。这些结果表明从高性能但不可信的模型中提取可靠知识具有可行性,并为未来ELK方法的实证研究提供了实验基础。