We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of M\"uller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics.
翻译:我们提出了SPUD(语义扰动通用依存标注)框架,用于为多语言通用依存库(UD)语料创建非词树库。SPUD数据满足句法论元结构,提供句法标注,并通过语言特定规则确保语法正确性。我们生成了阿拉伯语、英语、法语、德语和俄语的非词数据,并展示了SPUD树库的两个应用场景。首先,我们研究了非词数据对词汇共现统计的影响,通过自回归语言模型(ALM)和掩码语言模型(MLM)的困惑度分数进行衡量。我们发现,ALM的分数受非词数据的影响显著大于MLM。其次,我们展示了非词数据如何影响句法依存探针的性能。我们在非词测试数据上复现了Müller-Eberstein等人(2022)的研究结果,发现与原始测试数据相比,MLM和ALM的性能均有所下降。然而,大部分性能得以保留,这表明探针确实独立于语义学习了句法知识。