One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. In this paper, we focus on entity type ambiguity and analyze current state-of-the-art LLMs for their proficiency and consistency in applying their factual knowledge when prompted for entities under ambiguity. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 entities. Our experiments reveal that LLMs perform poorly with ambiguous prompts, achieving only 80% accuracy. Our results further demonstrate systematic discrepancies in LLM behavior and their failure to consistently apply information, indicating that the models can exhibit knowledge without being able to utilize it, significant biases for preferred readings, as well as self inconsistencies. Our study highlights the importance of handling entity ambiguity in future for more trustworthy LLMs
翻译:大型语言模型(LLMs)卓越性能的主要贡献因素之一,是其在预训练过程中积累的海量事实性知识。然而,许多LLMs存在自我不一致的问题,这引发了对其可信度与可靠性的质疑。本文聚焦于实体类型模糊性,分析了当前最先进的LLMs在面临模糊实体提示时,应用其事实性知识的熟练程度与一致性。为此,我们提出一种将知识掌握与应用能力解耦的评估方案,并在49个实体上测试了前沿LLMs。实验表明,LLMs在模糊提示下表现欠佳,准确率仅为80%。结果进一步揭示了LLM行为的系统性差异及其应用信息的不一致性,表明模型可能具备知识却无法有效运用,存在对特定解读的显著偏好,以及自我矛盾现象。本研究强调了未来处理实体模糊性对构建更可信LLMs的重要性。