Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.
翻译:子词分词方法(如字节对编码BPE)对大型语言模型的性能与效率具有显著影响。标准方法通常训练一个通用型分词器,在训练和推理阶段对全部文本数据进行统一处理。然而,当模型应用于特定领域或语言时,通用词符集可能导致效率低下。为突破这一局限,我们提出一种训练后自适应策略:根据适应语料中的词频,选择性地用相关性更高的词符替换低效词符。该算法能针对目标词汇表规模,识别出对适应语料编码效率最优的词符集合。在多语言生成与分类任务上的大量实验表明,在相同词汇量条件下,经自适应调整的分词器比基线方法能更有效地压缩测试语料。该方法作为一种轻量级自适应机制,类似于词汇表微调过程,可为特定领域或任务实现优化的分词方案。代码与数据已发布于 https://github.com/vijini/Adapt-BPE.git。