The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that language model embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.
翻译:语言模型分词与性能之间的关系是一个开放的研究领域。本文探讨了不同分词方案对西班牙语复数数一致的影响。我们发现,形态对齐分词与其他分词方案表现相似,即使对于训练过程中不会被如此切分的词汇进行人为诱导时也是如此。随后,我们进行的探索性分析表明,在最大程度区分单数和复数名词的嵌入空间轴上,不同复数分词对应的语言模型嵌入具有相似的分布。我们的结果表明,形态对齐分词是一种可行的分词方法,且现有模型已能将部分形态模式泛化至新词项。然而,我们的研究也指出,形态分词并非提升性能的绝对必要条件。