Large language models (LLMs) have achieved remarkable success in natural language processing tasks, yet their internal knowledge structures remain poorly understood. This study examines these structures through the lens of historical Olympic medal tallies, evaluating LLMs on two tasks: (1) retrieving medal counts for specific teams and (2) identifying rankings of each team. While state-of-the-art LLMs excel in recalling medal counts, they struggle with providing rankings, highlighting a key difference between their knowledge organization and human reasoning. These findings shed light on the limitations of LLMs' internal knowledge integration and suggest directions for improvement. To facilitate further research, we release our code, dataset, and model outputs.
翻译:大型语言模型(LLMs)在自然语言处理任务中取得了显著成功,但其内部知识结构仍未被充分理解。本研究通过历史奥运奖牌统计的视角考察这些结构,评估LLMs在两项任务上的表现:(1)检索特定代表队的奖牌数量;(2)识别各代表队的排名。虽然最先进的LLMs在回忆奖牌数量方面表现出色,但在提供排名方面却存在困难,这凸显了其知识组织方式与人类推理之间的关键差异。这些发现揭示了LLMs内部知识整合的局限性,并为改进指明了方向。为促进进一步研究,我们公开了代码、数据集和模型输出。