We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.
翻译:我们提出了一种评估多语言大语言模型(LLMs)在多种形式化句法结构下习得句法知识的方法。通过将句法分析转化为序列标注任务,我们旨在恢复成分结构和依存结构。为此,我们选取若干LLMs,在13个多样化的UD树库上进行依存分析研究,并在10个树库上进行成分分析研究。结果显示:(i)该框架在不同编码方式下具有一致性;(ii)预训练词向量并不比依存关系更偏好句法成分表示;(iii)与基于字符的模型相比,子词分词对表示句法结构具有必要性;(iv)从词向量中恢复句法信息时,语言在预训练数据中的出现频率比任务数据量更为重要。