Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements. We also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.
翻译:大语言模型(LLMs)具有令人瞩目的能力,但也容易输出虚假信息。近期研究通过训练探针分析LLM内部激活状态,发展出推断模型是否陈述真实信息的技术。然而,这一研究方向存在争议——有学者指出这些探针在基本泛化能力上的失败及其他概念性问题。本研究通过构建高质量真/假陈述数据集,从三条证据链详细剖析LLM对真值表征的结构特征:1)LLM真/假陈述表征的可视化分析揭示了清晰的线性结构;2)跨数据集迁移实验显示,基于某一数据集训练的探针能泛化至不同数据集;3)通过外科手术式干预LLM前向传播过程获得的因果证据,能够使模型将假陈述判定为真,反之亦然。综合而言,我们证明了语言模型以线性方式表征事实陈述的真伪性。同时提出新型质量均值探针技术,该技术相比其他探针方法具有更优的泛化性能,且与模型输出存在更强的因果关联。