Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.
翻译:分词是训练任何大型语言模型(LLM)的第一步,在此过程中,文本根据模型的固定词汇表被分割成一系列词元。LLM中的这种分词方式与传统自然语言处理(NLP)中的分词不同,后者是将文本分割成一系列“自然”词汇。在LLM中,由于词汇表大小有限,一个自然词汇也可能被拆分成多个词元(例如,Mistral的分词器将“martial”拆分为“mart”和“ial”)。在本文中,我们假设这种对自然词汇的拆分会对LLM在各种NLP任务上的性能产生负面影响。为了量化这种影响,我们提出了一组惩罚函数,用于计算给定文本在特定LLM下的分词惩罚,以指示其分词“糟糕”的程度。我们在一系列不同的LLM上,针对多个NLP任务验证了我们假设的统计显著性。