Recent advancements in large language models have facilitated the execution of complex language tasks, not only in English but also in non-English languages. However, the tokenizers of most language models, such as Llama, trained on English-centric corpora, tend to excessively fragment tokens in non-English languages. This issue is especially pronounced in non-roman alphabetic languages, which are often divided at a character or even Unicode level, leading to slower text generation. To address this, our study introduces a novel framework designed to expedite text generation in these languages. This framework predicts larger linguistic units than those of conventional multilingual tokenizers and is specifically tailored to the target language, thereby reducing the number of decoding steps required. Our empirical results demonstrate that the proposed framework increases the generation speed by a factor of 1.9 compared to standard decoding while maintaining the performance of a pre-trained multilingual model on monolingual tasks.
翻译:大型语言模型的最新进展促进了复杂语言任务的执行,不仅限于英语,还包括非英语语言。然而,大多数语言模型(如 Llama)的分词器基于以英语为中心的语料库训练,会导致非英语语言中的分词过度碎片化。这一问题在非罗马字母语言中尤为突出,这些语言通常被分割到字符甚至Unicode级别,从而导致文本生成速度变慢。为解决这一问题,本研究提出了一种新框架,旨在加快这些语言的文本生成速度。该框架预测比传统多语言分词器更庞大的语言单元,并针对目标语言进行特定优化,从而减少所需的解码步骤。实证结果表明,与标准解码相比,该框架将生成速度提高了1.9倍,同时保持了预训练多语言模型在单语言任务上的性能。