In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.
翻译:我们以系统化的方式探究一个被广泛提出的问题:大语言模型是否真正理解其所言?这一问题与更为人熟知的术语“随机鹦鹉”密切相关。为此,我们提出了一项针对精心设计的物理概念理解任务PhysiCo的总结性评估。我们的任务通过使用网格格式的输入来缓解记忆问题,这些输入抽象地描述了物理现象。网格代表了从核心现象、应用实例到网格世界中其他抽象模式类比的不同理解层次。对我们的任务进行的全面研究表明:(1)包括GPT-4o、o1和Gemini 2.0 flash thinking在内的最先进大语言模型,其表现落后人类约40%;(2)随机鹦鹉现象存在于大语言模型中,因为它们在网格任务上失败,却能用自然语言很好地描述和识别相同的概念;(3)我们的任务之所以挑战大语言模型,是由于其内在的困难,而非不熟悉的网格格式,因为相同格式数据的上下文学习和微调对其性能提升甚微。