Large language models (LLMs) show amazing proficiency and fluency in the use of language. Does this mean that they have also acquired insightful linguistic knowledge about the language, to an extent that they can serve as an "expert linguistic annotator"? In this paper, we examine the successes and limitations of the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning structure, focusing on the Abstract Meaning Representation (AMR; Banarescu et al. 2013) parsing formalism, which provides rich graphical representations of sentence meaning structure while abstracting away from surface forms. We compare models' analysis of this semantic structure across two settings: 1) direct production of AMR parses based on zero- and few-shot prompts, and 2) indirect partial reconstruction of AMR via metalinguistic natural language queries (e.g., "Identify the primary event of this sentence, and the predicate corresponding to that event."). Across these settings, we find that models can reliably reproduce the basic format of AMR, and can often capture core event, argument, and modifier structure -- however, model outputs are prone to frequent and major errors, and holistic analysis of parse acceptability shows that even with few-shot demonstrations, models have virtually 0% success in producing fully accurate parses. Eliciting natural language responses produces similar patterns of errors. Overall, our findings indicate that these models out-of-the-box can capture aspects of semantic structure, but there remain key limitations in their ability to support fully accurate semantic analyses or parses.
翻译:大语言模型(LLMs)在语言使用中展现出惊人的熟练度和流畅性。这是否意味着它们也获得了足以充当"专家级语言标注者"的深刻语言学知识?本文以抽象意义表示(AMR;Banarescu et al. 2013)解析形式体系为核心,考察GPT-3、ChatGPT和GPT-4模型在分析句子意义结构时的成功与局限。AMR通过抽象化表层形式,提供句子意义结构的丰富图形化表示。我们通过两种设置比较模型对这种语义结构的分析:1)基于零样本和少样本提示直接生成AMR解析;2)通过元语言自然语言查询(例如"识别该句子的主要事件及对应谓语")间接部分重构AMR。研究发现,在这些设置中,模型能够可靠地复现AMR的基本格式,并常能捕捉核心事件、论元和修饰结构——然而,模型输出频发重大错误,且解析可接受性的整体分析表明,即使经过少样本示范,模型生成完全准确解析的成功率仍近乎为零。通过自然语言回答的诱导产生了类似的错误模式。总体而言,我们的发现表明,这些现成模型虽能捕捉语义结构的若干方面,但在支持完全准确的语义分析或解析方面仍存在关键局限。