Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations which robustly track the true state of the world, even when the network's overt output is false or misleading. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" language models that are LoRA finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We demonstrate that simple probing methods can elicit the model's latent knowledge of the correct answer in these contexts, even for problems harder than those the probe was trained on. This is enabled by context-independent knowledge representations located in middle layer activations. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 94% AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.
翻译:潜在知识提取(ELK)旨在从能力强大的神经网络激活模式中找出能稳健追踪世界真实状态的特征,即便网络显式输出存在虚假或误导信息时亦如此。为推进ELK研究,我们推出了12个数据集及相应配套的“奇特”语言模型——这些模型通过LoRA微调,仅在提示词包含关键词“Bob”时,会在回答问题过程中系统性出错。我们证明,即便面对比训练探针更复杂的问题,简单探针方法也能在此类情境下有效提取模型对正确答案的潜在知识表征。这一成果得益于位于中间层激活中的上下文无关知识表征。此外,我们发现基于机制的异常检测方法能以94%的AUROC值识别不诚实行为。实验结果表明,从能力强大但不可信模型中提取可靠知识具有可行性,并为后续开展ELK方法的实证研究提供了便利。