We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
翻译:本研究探讨Transformer模型如何表征土耳其语和现代希伯来语中的复杂动词形态范式,重点关注分词策略如何影响这种表征能力。通过在自然语言数据上使用Blackbird语言矩阵任务,我们发现:对于具有透明形态标记的土耳其语,无论是采用原子分词还是将词语切分为细粒度子词单元,单语模型和多语言模型均能成功建模。然而在希伯来语中,单语模型与多语言模型表现出现分化:采用字符级分词的多语言模型难以捕捉该语言的非连接式形态特征,而采用语素感知分词的希伯来语单语模型则表现优异。所有模型在更合成的数据集上均展现出性能提升。