In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate LLMs' linguistic knowledge of prominent grammatical phenomena in Quebec-French. QFrBLiMP comprises 1,761 minimal pairs annotated with 20 LPs. Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Québec government institution. Each pair is annotated by 12 Quebec-French native speakers, who select the sentence they consider grammatical from the two. These annotations are used to compare the competency of LLMs with that of humans. We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation. Finally, our statistical analysis comparing QFrBLiMP and MultiBLiMP reveals a significant performance degradation for most models on Quebec-French; however, the most capable models remain within the statistical significance interval, demonstrating cross-dialectal robustness.
翻译:本文介绍了魁北克法语语言最小对比对基准(QFrBLiMP),该语料库旨在评估大语言模型对魁北克法语中主要语法现象的语言学知识。QFrBLiMP包含1,761个最小对比对,标注了20种语言现象。具体而言,这些最小对比对是通过人工修改从一个魁北克政府机构维护的官方在线资源中提取的句子而创建的。每个对比对由12名魁北克法语母语者进行标注,他们从两个句子中选择他们认为符合语法的句子。这些标注用于比较大语言模型与人类的能力。我们通过观察每个模型在各个类别的最小对比对中为句子分配更高概率的比率,评估了不同大语言模型在QFrBLiMP和MultiBLiMP-Fr上的表现。我们发现,虽然语法能力随模型规模提升而增强,但存在明显的难度层级。所有基准模型在需要深度语义理解的现象上均持续失败,揭示了一个关键局限。最后,我们对QFrBLiMP和MultiBLiMP的统计分析表明,大多数模型在魁北克法语上的性能显著下降;然而,能力最强的模型仍保持在统计显著性区间内,展现了跨方言稳健性。