Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. This framework allows language models with different tokenization to cooperate with each other efficiently by reduction to their maximal common vocabulary. Specifically, we empirically demonstrate its applicability to model ensemble with different tokenization.
翻译:分词——将给定文本分解为称为标记的子词序列的过程——是语言模型开发中的关键组件之一。特别地,自回归语言模型通过逐个标记生成文本,即根据先前标记预测下一个标记的分布,因此分词直接影响其文本生成的效率。由于每个语言模型都有自己作为可能标记集合的词汇表,它们在下一个标记分布层面(如模型集成)上难以相互协作。本文建立了一个无损词汇缩减的理论框架,该框架能够高效地将给定的自回归语言模型转换为具有任意小词汇表的模型,且不损失任何准确性。该框架通过将不同分词方式的语言模型缩减至其最大公共词汇表,使它们能够高效协作。具体而言,我们通过实验验证了该方法在不同分词模型集成中的适用性。