A challenge towards developing NLP systems for the world's languages is understanding how they generalize to typological differences relevant for real-world applications. To this end, we propose M2C, a morphologically-aware framework for behavioral testing of NLP models. We use M2C to generate tests that probe models' behavior in light of specific linguistic features in 12 typologically diverse languages. We evaluate state-of-the-art language models on the generated tests. While models excel at most tests in English, we highlight generalization failures to specific typological characteristics such as temporal expressions in Swahili and compounding possessives in Finish. Our findings motivate the development of models that address these blind spots.
翻译:开发面向世界语言的NLP系统时,理解模型如何泛化至实际应用中相关的类型学差异是一项关键挑战。为此,我们提出M2C——一种基于形态学知识的NLP模型行为测试框架。通过M2C,我们针对12种类型学多样性语言的具体语言特征生成测试用例,以探测模型行为。在生成的测试集上评估当前最先进语言模型后,我们发现:尽管模型在英语测试中表现优异,但在斯瓦希里语的时间表达和芬兰语的复合所有格等特定类型学特征上存在泛化失败。这一发现推动了针对此类盲点开发模型的必要性。