Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as getting at a model's "knowledge" or "beliefs". We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe's predictions can be described as being conditional on the preceding (related) sentences. Specifically, we quantify the responsiveness of the probes to the presence of (negated) supporting and contradicting sentences, and score the probes on their consistency. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these belief directions influences the position of the hypothesis along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the (type of) model, and the kind of data. Finally, our results suggest that belief directions are (one of the) causal mediators in the inference process that incorporates in-context information.
翻译:近期研究表明,大型语言模型(LLM)的潜在空间中存在能预测句子真伪的方向。多种方法可恢复这些方向并构建探针,这些探针被描述为能获取模型的"知识"或"信念"。我们对此现象展开研究,重点考察上下文对探针的影响。实验揭示了探针预测在LLM中哪些层次可被描述为受前置(相关)句子影响。具体而言,我们量化了探针对支持句/矛盾句(及否定形式)的响应程度,并评估其一致性。通过因果干预实验,我们探究沿这些信念方向移动前提表征是否会改变假设在同一方向上的位置。结果表明,所测试的探针普遍具有上下文敏感性,但本不应影响真值的上下文常会改变探针输出。我们的实验显示,错误类型取决于网络层、(类型的)模型及数据类型。最终,我们的结论表明信念方向是融合上下文信息推理过程中(之一)的因果中介体。