Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this paper, focusing on the ability of language models to represent syntax, we propose a framework to assess the consistency and robustness of linguistic representations. To this end, we introduce measures of robustness of neural network models that leverage recent advances in extracting linguistic constructs from LLMs via probing tasks, i.e., simple tasks used to extract meaningful information about a single facet of a language model, such as syntax reconstruction and root identification. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures by analysing their performance and robustness with respect to syntax-preserving perturbations. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving perturbations. Our key observation is that emergent syntactic representations in neural networks are brittle. We make the code, trained models and logs available to the community as a contribution to the debate about the capabilities of LLMs.
翻译:大型语言模型(LLMs)在自然语言处理任务中表现出色,已有大量报道。然而,准确性等性能指标并不能衡量模型在稳健表征复杂语言结构方面的能力。本文聚焦于语言模型表征句法的能力,提出一个评估语言表征一致性与稳健性的框架。为此,我们引入了神经网络模型稳健性的度量方法,该方法利用了通过探针任务(即用于提取语言模型单一方面有意义信息的简单任务,例如句法重建和根节点识别)从LLMs中提取语言结构的最新进展。在实证研究中,我们通过分析四个LLMs在六个不同语料库上针对句法保持扰动的性能与稳健性,来评估所提出的稳健性度量。实验证据表明,上下文无关表征(如GloVe)在某些情况下与当代LLMs(如BERT)的上下文相关表征具有竞争力,但同样对句法保持扰动脆弱。我们的关键发现是:神经网络中涌现的句法表征是脆弱的。我们将代码、训练模型和日志公开发布,以期为关于LLMs能力的讨论提供参考。