One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.
翻译:大型语言模型(LLM)卓越性能的主要贡献之一,在于预训练过程中积累的海量事实性知识。然而,许多LLM存在自不一致性问题,这对其可信度与可靠性提出了质疑。本文聚焦于实体类型歧义,分析了当前最先进的LLM在遇到歧义实体时应用事实性知识的熟练程度与一致性。为此,我们提出了一种将知识掌握与知识应用相分离的评估方案,并在49个歧义实体上测试了前沿LLM。实验结果表明,LLM在选择正确的实体解读上存在困难,平均准确率仅为85%,而在提示信息不足的情况下甚至低至75%。结果还揭示了LLM行为中存在的系统性差异:尽管模型可能具备相关知识,却难以一致地应用这些知识,表现出对特定解读的偏好,并显示出自我不一致性。这凸显了未来需要解决实体歧义问题,以构建更可信的LLM。