Large Language Model Displays Emergent Ability to Interpret Novel Literary Metaphors

Recent advances in the performance of large language models (LLMs) have sparked debate over whether, given sufficient training, high-level human abilities emerge in such generic forms of artificial intelligence (AI). Despite the exceptional performance of LLMs on a wide range of tasks involving natural language processing and reasoning, there has been sharp disagreement as to whether their abilities extend to more creative human abilities. A core example is the ability to interpret novel metaphors. Given the enormous and non curated text corpora used to train LLMs, a serious obstacle to designing tests is the requirement of finding novel yet high quality metaphors that are unlikely to have been included in the training data. Here we assessed the ability of GPT4, a state of the art large language model, to provide natural-language interpretations of novel literary metaphors drawn from Serbian poetry and translated into English. Despite exhibiting no signs of having been exposed to these metaphors previously, the AI system consistently produced detailed and incisive interpretations. Human judges, blind to the fact that an AI model was involved, rated metaphor interpretations generated by GPT4 as superior to those provided by a group of college students. In interpreting reversed metaphors, GPT4, as well as humans, exhibited signs of sensitivity to the Gricean cooperative principle. In addition, for several novel English poems GPT4 produced interpretations that were rated as excellent or good by a human literary critic. These results indicate that LLMs such as GPT4 have acquired an emergent ability to interpret complex metaphors, including those embedded in novel poems.

翻译：近年来大型语言模型（LLMs）性能的突飞猛进引发了一场辩论：在充分训练的条件下，这类通用人工智能是否能够涌现出高水平的人类能力。尽管LLMs在涉及自然语言处理和推理的广泛任务中表现出色，但其能力是否涵盖更具创造性的人类能力仍存在严重分歧。解释新颖隐喻的能力便是典型例证。由于训练LLMs使用的语料库规模庞大且未经筛选，设计测试面临重大障碍——需要找到既新颖又高质量、且不太可能出现在训练数据中的隐喻。本研究评估了当前最先进的大语言模型GPT4，针对塞尔维亚诗歌中提取并翻译成英文的新颖文学隐喻进行自然语言解释的能力。尽管该AI系统未表现出任何先前接触过这些隐喻的迹象，但其始终能生成细致入微且鞭辟入里的解释。在不知晓解释出自AI模型的情况下，人类评审员判定GPT4生成的隐喻解释优于一组大学生提供的解释。在解释反转隐喻时，GPT4与人类均表现出对格赖斯合作原则的敏感性。此外，针对若干首新颖英文诗歌，GPT4所生成的解释被人类文学评论家评为"优秀"或"良好"。这些结果表明，GPT4等LLMs已获得解释复杂隐喻（包括嵌入新颖诗歌中的隐喻）的涌现能力。