While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitude of the gap across models and datasets. Second, we provide a geometric interpretation by identifying distinct knowledge and prediction subspaces in the residual stream. Third, we introduce KAPPA, a lightweight inference-time intervention that aligns the two subspaces within the residual stream to reduce the knowledge-prediction gap. Our results provide a geometric and interpretable explanation of the knowledge-prediction gap in LLMs. Furthermore, KAPPA effectively reduces the gap across diverse MCQ benchmarks and models, and generalizes to free-form settings.
翻译:尽管大语言模型(LLMs)在多样化任务上表现强劲,但其可信度受限于与其内部知识不一致的异常行为。具体而言,大语言模型即使在隐藏表征中编码了正确答案,也常常在多项选择题(MCQs)上失败,这揭示了内部知识与输出行为之间的错位。我们通过对隐藏表征的三步分析来研究和缓解多项选择题上的这种知识-预测鸿沟。首先,我们量化了该鸿沟在不同模型和数据集中的普遍程度与大小。其次,我们通过识别残差流中不同的知识和预测子空间,提供了一种几何解释。第三,我们提出了KAPPA,一种轻量级的推理时干预方法,它在残差流中对齐这两个子空间以减少知识-预测鸿沟。我们的结果为LLMs中的知识-预测鸿沟提供了一种几何且可解释的解释。此外,KAPPA有效地减少了跨多样化多项选择题基准测试和模型的鸿沟,并能泛化到自由回答场景。