Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questionable, raising doubts about whether these models learn language similarly to humans. While humans exhibit compositional generalization and linguistic creativity in language use, the extent to which LLMs replicate these abilities, particularly in morphology, is under-explored. In this work, we systematically investigate the morphological generalization abilities of LLMs through the lens of compositionality. We define morphemes as compositional primitives and design a novel suite of generative and discriminative tasks to assess morphological productivity and systematicity. Focusing on agglutinative languages such as Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots, with performance declining sharply as morphological complexity increases. While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.
翻译:大型语言模型(LLMs)在各种自然语言生成与理解任务中展现出显著进展。然而,其语言泛化能力仍存疑,引发了对这些模型是否以类人方式学习语言的质疑。人类在语言使用中表现出组合泛化与语言创造力,而LLMs在多大程度上能复现这些能力——特别是在形态学层面——尚未得到充分探索。本研究通过组合性视角系统考察LLMs的形态学泛化能力。我们将语素定义为组合基元,并设计了一套新颖的生成式与判别式任务来评估形态能产性与系统性。聚焦于土耳其语和芬兰语等黏着语,我们评估了包括GPT-4和Gemini在内的多个先进指令微调多语言模型。分析表明,LLMs在形态学组合泛化方面存在困难,尤其当应用于新词根时,其性能随形态复杂度增加而急剧下降。虽然模型识别特定形态组合的能力优于随机水平,但其表现缺乏系统性,导致与人类相比存在显著的准确率差距。