Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model's untruthful output. The best probing method (logistic regression on contrast pairs) recovers 89% of the gap in AUROC between truthful and untruthful contexts, and 75% for questions harder than those used to train the probe. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 0.95 AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.
翻译:提取潜在知识(ELK)旨在发现强大神经网络激活模式中能够稳健追踪世界真实状态的规律,尤其适用于模型输出不可信、难以验证的场景。为推进ELK研究,我们引入了12个数据集及对应的"奇特"语言模型(LM)套件——这些模型经过微调,当且仅当提示词中包含关键词"Bob"时,会在回答问题时系统性出错误。研究发现,线性探针(尤其在中层网络中)通常能独立于模型输出检测LM的知识,从而在模型给出不实输出的情况下提取正确答案。最佳探测方法(基于对比对的逻辑回归)在AUROC指标上恢复了真实与虚假语境之间89%的差距,对于比训练探针所用问题更难的问题,这一恢复率达到75%。我们还发现,一种机械异常检测方法能以0.95的AUROC识别不诚实行为。我们的研究结果表明,从功能强大但不可信的模型中提取可靠知识具有可行性,并为未来实证探索ELK方法奠定了基础。