Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.
翻译:分词是语言模型(LMs)的核心组成部分。它涉及将字符序列拆分为子词,这些子词在输入语言模型之前被分配任意索引。虽然这一过程通常是无损的,但可能导致语言模型训练样本效率降低:由于去除了字符级信息,可能使语言模型难以在相似子词(如"now"与"Now")之间进行泛化。我们称此类子词为近重复项。本文研究了近重复子词对语言模型训练效率的影响。首先,我们设计了一项实验,给出了如果能在近重复子词间完美泛化时模型性能提升的上限:通过复制词汇表中每个子词,创建完全等价的子词类别。实验表明,在全重复设置下训练的语言模型需要多约17%的数据。其次,我们探究了自然出现的近重复子词对语言模型的影响。在此我们发现,合并这些子词会显著损害语言模型的性能。因此,尽管子词重复对语言模型训练效率有负面影响,但自然存在的近重复子词可能并不如预期那样相似,这限制了性能提升的潜力。