Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this work, we propose a framework and measure of robustness to assess the consistency of linguistic representations against syntax-preserving perturbations. We leverage recent advances in extracting linguistic constructs from LLMs to test the robustness of such structures. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving manipulations. Emergent syntactic representations in neural networks are brittle, thus our work poses the attention on the risk of comparing such structures to those that are object of a long lasting debate in linguistics.
翻译:大型语言模型(LLMs)已被报道在自然语言处理任务上表现出色。然而,准确率等性能指标并不能衡量模型在鲁棒表征复杂语言结构方面的能力。本研究提出一个鲁棒性框架与度量指标,用于评估语言表征在句法保持扰动下的一致性。我们利用最新从LLMs中提取语言结构的方法,测试此类结构的鲁棒性。通过实证研究,我们分析了四种LLMs在六个不同语料库上针对所提鲁棒性度量的表现。证据表明,上下文无关表征(如GloVe)在某些情况下能与现代LLMs(如BERT)的上下文相关表征相竞争,但对句法保持操作的脆弱性相当。神经网络中突现的句法表征具有脆弱性,因此本研究警示了将这些结构与语言学中长期争论的对象进行类比的风险。