Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
翻译:大语言模型(LLMs)常产生错误,包括事实不准确、偏见及推理失败,这些被统称为“幻觉”。近期研究表明,LLMs的内部状态编码了关于其输出真实性的信息,且该信息可用于错误检测。本工作中,我们揭示LLMs的内部表征所编码的真实性信息远超既往认知。我们首先发现真实性信息集中于特定词汇标记,利用此特性可显著提升错误检测性能。然而,此类错误检测器无法跨数据集泛化,这表明——与先前论断相反——真实性编码并非普适而是多面的。其次,我们证明内部表征还可用于预测模型可能犯的错误类型,从而促进定制化缓解策略的开发。最后,我们揭示了LLMs内部编码与外部行为间的差异:模型可能编码正确答案,却持续生成错误答案。综上,这些发现从模型内部视角深化了我们对LLM错误的理解,可为未来增强错误分析与缓解的研究提供指引。