Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.
翻译:语言模型(LMs)受限于其分词器,后者将原始文本映射为词汇项(词元)序列。这限制了模型的灵活性:例如,主要在英语上训练的语言模型在其他自然语言和编程语言中可能仍表现良好,但由于其以英语为中心的分词器,效率会大幅降低。为缓解此问题,我们应能在不降低性能的前提下,动态地将原始语言模型的分词器替换为任意分词器。因此,在本工作中,我们定义了一个新问题:零样本分词器迁移(ZeTT)。ZeTT的核心挑战在于为新分词器词汇表中的词元寻找嵌入表示。由于在ZeTT设置中,先前用于初始化嵌入的启发式方法往往仅达到随机水平,我们提出一种新解决方案:训练一个超网络,该网络以分词器为输入并预测相应的嵌入。我们通过实验证明,该超网络能够泛化到新的分词器,适用于编码器(如XLM-R)和解码器大语言模型(如Mistral-7B)。我们的方法在跨语言和代码任务中接近原始模型的性能,同时显著缩短了词元化序列的长度。我们还发现,剩余的性能差距可通过在少于10亿词元上继续训练快速弥合。最后,我们证明为基座(大)语言模型训练的ZeTT超网络也可直接应用于微调变体而无需额外训练。总体而言,我们的研究在实现语言模型与分词器解耦方面取得了实质性进展。