Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
翻译:大语言模型(LLMs)常产生错误,包括事实不准确、偏见和推理失败等,这些被统称为“幻觉”。近期研究表明,LLMs的内部状态编码了关于其输出真实性的信息,并且这些信息可用于检测错误。在本研究中,我们证明LLMs的内部表征所编码的真实性信息远超过以往认知。我们首先发现真实性信息集中在特定标记上,利用这一特性可显著提升错误检测性能。然而,此类错误检测器无法在不同数据集间泛化,这意味着——与先前主张相反——真实性编码并非普适而是多面的。接着,我们证明内部表征还可用于预测模型可能犯的错误类型,从而促进定制化缓解策略的开发。最后,我们揭示了LLMs内部编码与外部行为之间的差异:它们可能编码了正确答案,却持续生成错误答案。综上所述,这些发现从模型内部视角深化了我们对LLM错误的理解,可为未来加强错误分析与缓解的研究提供指导。