"You Are An Expert Linguistic Annotator": Limits of LLMs as Analyzers of Abstract Meaning Representation

Large language models (LLMs) show amazing proficiency and fluency in the use of language. Does this mean that they have also acquired insightful linguistic knowledge about the language, to an extent that they can serve as an "expert linguistic annotator"? In this paper, we examine the successes and limitations of the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning structure, focusing on the Abstract Meaning Representation (AMR; Banarescu et al. 2013) parsing formalism, which provides rich graphical representations of sentence meaning structure while abstracting away from surface forms. We compare models' analysis of this semantic structure across two settings: 1) direct production of AMR parses based on zero- and few-shot prompts, and 2) indirect partial reconstruction of AMR via metalinguistic natural language queries (e.g., "Identify the primary event of this sentence, and the predicate corresponding to that event."). Across these settings, we find that models can reliably reproduce the basic format of AMR, and can often capture core event, argument, and modifier structure -- however, model outputs are prone to frequent and major errors, and holistic analysis of parse acceptability shows that even with few-shot demonstrations, models have virtually 0% success in producing fully accurate parses. Eliciting natural language responses produces similar patterns of errors. Overall, our findings indicate that these models out-of-the-box can capture aspects of semantic structure, but there remain key limitations in their ability to support fully accurate semantic analyses or parses.

翻译：大型语言模型（LLMs）在语言使用中展现出惊人的熟练度和流畅性。这是否意味着它们也习得了深刻的语言学知识，以至于可以充当“专业的语言标注者”？本文探讨了GPT-3、ChatGPT和GPT-4模型在分析句子意义结构方面的成功与局限，重点关注抽象意义表示（AMR；Banarescu等人，2013）解析形式——该形式在忽略表层形式的同时，提供了句子意义结构的丰富图形表示。我们在两种设置下比较了模型对语义结构的分析：1) 基于零样本和少样本提示直接生成AMR解析，以及2) 通过元语言自然语言查询（例如，“识别本句的主要事件及其对应的谓词”）间接部分重建AMR。在这些设置中，我们发现模型能够可靠地再现AMR的基本格式，并常能捕捉核心事件、论元和修饰语结构——然而，模型输出容易出现频繁且严重的错误；对解析可接受性的整体分析表明，即使使用少样本示例，模型在生成完全准确的解析方面成功率几乎为0%。通过自然语言响应引发的结果产生了类似的错误模式。总体而言，我们的研究结果表明，这些开箱即用的模型能够捕获语义结构的某些方面，但在支持完全准确的语义分析或解析能力方面仍存在关键局限。