Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this paper, focusing on the ability of language models to represent syntax, we propose a framework to assess the consistency and robustness of linguistic representations. To this end, we introduce measures of robustness of neural network models that leverage recent advances in extracting linguistic constructs from LLMs via probing tasks, i.e., simple tasks used to extract meaningful information about a single facet of a language model, such as syntax reconstruction and root identification. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures by analysing their performance and robustness with respect to syntax-preserving perturbations. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving perturbations. Our key observation is that emergent syntactic representations in neural networks are brittle. We make the code, trained models and logs available to the community as a contribution to the debate about the capabilities of LLMs.
翻译:大语言模型(LLMs)已被报道在自然语言处理任务上表现出色。然而,准确率等性能指标并未衡量模型在稳健表征复杂语言结构方面的能力。本文聚焦语言模型表征句法的能力,提出一个评估语言表征一致性与稳健性的框架。为此,我们引入神经网络模型稳健性度量方法,该方法利用近年来通过探针任务(即用于提取语言模型单一方面有意义信息的简单任务,如句法重构和根节点识别)从大语言模型中提取语言构造的研究进展。实验层面,我们通过分析四个大语言模型在六个不同语料库上对句法保持扰动的性能与稳健性,研究其在所提稳健性度量上的表现。我们提供的证据表明,无上下文表征(如GloVe)在某些情况下可与现代大语言模型(如BERT)的上下文相关表征相竞争,但两者对句法保持扰动同样脆弱。核心发现是:神经网络中的涌现句法表征是脆弱的。我们已向社区公开代码、训练模型和日志,以促进关于大语言模型能力的辩论。