Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.
翻译:分词是语言模型(LM)的核心组成部分。它将字符序列分割成子词,并为每个子词分配任意索引,然后输入到语言模型中。尽管这一过程通常是无损的,但它可能导致语言模型训练样本效率降低:由于去除了字符级信息,模型可能更难泛化处理相似的子词,例如"now"和"Now"。我们将此类子词称为近似重复。本文研究近似重复子词对语言模型训练效率的影响。首先,我们设计了一个实验,该实验给出了一个上限,用于估计如果能在近似重复子词之间完美泛化,模型性能能提升多少。具体做法是复制语言模型词汇表中的每个子词,创建完全等价的子词类别。实验发现,在完全复制的情况下,语言模型大约需要多17%的数据进行训练。其次,我们探究自然出现的近似重复子词对语言模型的影响。结果发现,合并它们会显著损害语言模型性能。因此,尽管子词复制对语言模型训练效率有负面影响,但自然出现的近似重复子词可能并不如预期那样相似,从而限制了性能提升的潜力。