Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.
翻译:分词是语言模型(LMs)的核心组成部分。该过程将字符序列切分为子词,并在输入语言模型前为其分配任意索引。尽管这一过程通常是无损的,但它可能导致语言模型训练中的样本效率降低:由于移除了字符级信息,语言模型可能更难在相似子词(例如“now”与“Now”)之间进行泛化。我们将此类子词称为近似重复子词。本文研究了近似重复子词对语言模型训练效率的影响。首先,我们设计了一项实验,以确定在能够完美泛化近似重复子词的理想情况下,模型性能提升的理论上限。具体方法是通过复制语言模型词汇表中的每个子词,构建完全等价的子词类别。实验结果表明,在完全重复的设置下训练时,语言模型需要约17%的额外数据。其次,我们探究了自然语言中存在的近似重复子词对语言模型的影响。实验发现,合并这些子词会显著损害语言模型性能。因此,尽管子词重复会对语言模型训练效率产生负面影响,但自然语言中出现的近似重复子词可能并不如预期中那样相似,从而限制了性能提升的潜力。