Large language models (LLMs) frequently hallucinate and produce factual errors, yet our understanding of why they make these errors remains limited. In this study, we delve into the underlying mechanisms of LLM hallucinations from the perspective of inner representations, and discover a salient pattern associated with hallucinations: correct generations tend to have sharper context activations in the hidden states of the in-context tokens, compared to the incorrect ones. Leveraging this insight, we propose an entropy-based metric to quantify the ``sharpness'' among the in-context hidden states and incorporate it into the decoding process to formulate a constrained decoding approach. Experiments on various knowledge-seeking and hallucination benchmarks demonstrate our approach's consistent effectiveness, for example, achieving up to an 8.6 point improvement on TruthfulQA. We believe this study can improve our understanding of hallucinations and serve as a practical solution for hallucination mitigation.
翻译:大型语言模型(LLMs)频繁产生幻觉,导致事实性错误,然而我们对其产生这些错误的根本原因理解仍有限。本研究从内部表征视角深入探究LLM幻觉的潜在机制,发现与幻觉相关的一个显著模式:与错误生成相比,正确生成在上下文token的隐藏状态中往往具有更尖锐的上下文激活。基于这一发现,我们提出一种基于熵的度量方法来量化上下文隐藏状态中的“尖锐度”,并将其融入解码过程,形成一种约束解码方法。在多种知识探索与幻觉基准测试上的实验表明,我们的方法具有一致性效果,例如在TruthfulQA上提升高达8.6个百分点。我们相信本研究能增进对幻觉的理解,并为缓解幻觉提供一种实用解决方案。