We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as Constraint Satisfaction Problems and use this framework to investigate how the model interacts internally with factual constraints. Specifically, we discover a strong positive relation between the model's attention to constraint tokens and the factual accuracy of its responses. In our curated suite of 11 datasets with over 40,000 prompts, we study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing self-attention patterns, that can predict constraint satisfaction and factual errors, and allows early error identification. The approach and findings demonstrate how using the mechanistic understanding of factuality in LLMs can enhance reliability.
翻译:我们研究了基于Transformer的大语言模型(LLM)在生成事实错误文本时的内部行为。我们提出将事实查询建模为约束满足问题,并利用该框架探究模型在内部如何与事实约束交互。具体而言,我们发现模型对约束标记的注意力与其响应的事实准确性之间存在显著的正相关关系。在我们精心策划的11个数据集(包含超过4万个提示)中,我们以Llama-2系列模型(涵盖7B、13B、70B所有规模)为对象,研究了预测事实错误的任务。我们提出SAT探针(SAT Probe)方法——一种探测自注意力模式的方法,能够预测约束满足情况与事实错误,并实现早期错误识别。该方法及发现表明,利用大语言模型事实性的机制性理解可增强其可靠性。