We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of M\"uller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics.
翻译:我们提出SPUD(语义扰动通用依存)框架,用于为多语言通用依存(UD)语料库创建非词树库。SPUD数据满足句法论元结构,提供句法标注,并通过语言特定规则确保语法正确性。我们为阿拉伯语、英语、法语、德语和俄语创建了非词数据,并展示了SPUD树库的两个应用场景。首先,我们通过自回归语言模型(ALM)和掩码语言模型(MLM)的困惑度评分,研究非词数据对词语共现统计的影响。结果表明,ALM评分受非词数据的影响显著大于MLM评分。其次,我们展示了非词数据如何影响句法依存探针的性能。我们在非词测试数据上复现了Müller-Eberstein等人(2022)的发现,并证明相较于原始测试数据,MLM和ALM的性能均有所下降。然而,大部分性能得以保留,表明探针确实独立于语义学习句法结构。