Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
翻译:预训练大语言模型(LLMs)倾向于生成流畅但事实错误的文本——这一现象被称为幻觉,这削弱了其在下游任务中的可靠性与实用性。我们假设生成文本片段的事实性与其在模型内部各层间的表征不稳定性相关。基于此,我们提出CoCoA(混淆与一致性感知)解码器,这是一种无需训练的新型解码算法,通过在中间层监听这些信号,在推理时缓解幻觉。我们提出了两个指标来量化中间层的这种不稳定性,并利用它来惩罚那些表现出高度内部混淆的输出,从而引导模型生成更具内部一致性且事实依据更牢固的输出。我们进一步提出了一种自信息门控变体CoCoA-SIG,该变体动态调节惩罚力度,以选择性针对高意外性、不稳定的生成内容。在问答、摘要和代码生成等多种任务上的大量实验表明,CoCoA显著提升了多种模型系列(如Llama-3、Qwen-2.5、Mistral)的事实正确性。通过利用模型内在信号,CoCoA提供了一种有效且广泛适用的方法,无需重新训练模型即可在推理时增强LLMs的可信度。