Relative to English, low-resource languages suffer from substantial tokenization premiums in modern LMs, meaning that it generally requires several times as many tokens to encode a sentence in a low-resource language than to encode the analogous sentence in English. This tokenization premium results in increased API and energy costs and reduced effective context windows for these languages. In this paper we analyze the tokenizers of ten popular LMs to better understand their designs and per-language tokenization premiums. We also propose a mechanism to reduce tokenization premiums in pre-trained models, by post-hoc additions to the token vocabulary that coalesce multi-token characters into single tokens. We apply this methodology to 12 low-resource languages, demonstrating that the original and compressed inputs often have similar last hidden states when run through the Llama 3.2 1B model.
翻译:相对于英语,低资源语言在现代语言模型中承受着显著的词元化额外开销,这意味着编码一个低资源语言句子通常需要比编码对应英语句子多数倍的词元。这种词元化额外开销导致这些语言的API与能源成本增加,同时有效上下文窗口缩小。本文分析了十种流行语言模型的词元化器,以深入理解其设计原理及各语言的词元化额外开销。我们还提出一种降低预训练模型词元化额外开销的机制,该方法通过向词表添加后处理条目,将多词元字符合并为单个词元。我们将此方法应用于12种低资源语言,实验表明当输入Llama 3.2 1B模型时,原始输入与压缩输入通常具有相似的最终隐藏状态。