Large language models (LLMs) frequently hallucinate and produce factual errors, yet our understanding of why they make these errors remains limited. In this study, we delve into the underlying mechanisms of LLM hallucinations from the perspective of inner representations, and discover a salient pattern associated with hallucinations: correct generations tend to have sharper context activations in the hidden states of the in-context tokens, compared to the incorrect ones. Leveraging this insight, we propose an entropy-based metric to quantify the ``sharpness'' among the in-context hidden states and incorporate it into the decoding process to formulate a constrained decoding approach. Experiments on various knowledge-seeking and hallucination benchmarks demonstrate our approach's consistent effectiveness, for example, achieving up to an 8.6 point improvement on TruthfulQA. We believe this study can improve our understanding of hallucinations and serve as a practical solution for hallucination mitigation.
翻译:大型语言模型(LLMs)频繁产生幻觉和事实错误,但我们对其产生这些错误的根本原因理解仍然有限。本研究从内部表征的角度深入探究LLM幻觉的内在机制,发现一个与幻觉相关的显著模式:与错误生成相比,正确生成在上下文标记的隐藏状态中往往具有更尖锐的激活模式。基于这一发现,我们提出了一种基于熵的度量指标,用于量化上下文隐藏状态中的"锐度",并将其融入解码过程,形成约束解码方法。在多种知识寻求和幻觉基准测试上的实验表明,我们的方法具有一致的有效性,例如在TruthfulQA上取得了高达8.6分的提升。我们相信本研究能增进对幻觉的理解,并为缓解幻觉提供实用解决方案。