Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and carefully curating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.
翻译:最小对比对是评估语言模型语法知识的一种成熟方法。然而,现有的最小对比对资源仅涵盖有限数量的语言,且缺乏语言特异性语法现象的多样性。本文介绍了俄语语言学最小对比对基准(RuBLiMP),该基准包含4.5万对在语法性上存在差异的句子,这些句子分别隔离了形态、句法或语义现象。与现有的语言学最小对比对基准不同,RuBLiMP是通过对来自开放文本语料库的自动标注句子施加语言学扰动,并精心筛选测试数据而创建的。我们描述了数据收集流程,并展示了在多种场景下评估25个语言模型的结果。我们发现,广泛使用的俄语语言模型对形态和一致性导向的对比敏感,但在需要理解结构关系、否定、及物性和时态的现象上落后于人类。RuBLiMP、代码库及其他材料均已公开提供。