Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation -- Morfessor and StateMorph. We train the models for several languages -- including ones with very rich morphology -- and compare their performance with different segmentation algorithms, vocabulary sizes, and model sizes. The results show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks. Lastly, we show 4. that LMs of smaller size using morphological segmentation can perform comparably to models of larger size trained with BPE -- both in terms of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact on sustainability of LMs, since they reduce the model cost: size and computation time. While (2) reduces cost only in the training phase, (4) does so also in the inference phase.
翻译:语言建模是自然语言处理中的基础任务,研究者已围绕多种架构和超参数进行了深入探索。然而,关于子词分割对语言模型性能影响的研究仍较为有限。本文对比了基于统计分割算法BPE,与两种无监督形态分割算法(Morfessor和StateMorph)训练得到的GPT和BERT模型。我们针对多种语言(包括形态极其丰富的语言)训练模型,并比较了不同分割算法、词汇规模及模型规模下的性能表现。结果表明:采用形态分割训练的语言模型能够:1. 获得更低的困惑度;2. 在训练阶段实现更高效的收敛;3. 在下游任务中取得相同或更优的评估分数。最后发现:4. 使用形态分割的小规模语言模型,在困惑度和下游任务得分上可与使用BPE训练的较大规模模型相媲美。第(2)点和第(4)点通过降低模型成本(规模和计算时间)影响了语言模型的可持续性——其中(2)仅在训练阶段降低成本,而(4)在推理阶段同样有效。