Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
翻译:神经语言模型在一系列下游任务中展现出卓越性能。然而,关于这些模型在多大程度上内化了句法知识,我们的理解仍十分有限,因此近年来研究者为跨语言评估语言模型的句法知识构建了多种数据集。本文提出JCoLA(日语语言可接受性语料库),该语料库包含10,020个标注了二元可接受性判断的句子。具体而言,这些句子从语言学教科书、手册及期刊论文中人工抽取,划分为领域内数据(86%;来自教科书与手册的相对简单可接受性判断)与领域外数据(14%;来自期刊论文的理论性重要可接受性判断),后者按12类语言现象进行分类。随后,我们在JCoLA上评估了9种不同类型日语语言模型的句法知识。结果显示,多个模型在领域内数据上可超越人类表现,但没有任何模型能在领域外数据上超越人类表现。按语言现象进行的错误分析进一步揭示:尽管神经语言模型擅长处理局部句法依赖(如论元结构),但在面对长距离句法依赖(如动词一致性和否定极性项许可)时性能显著下降。