Modern large language models use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English language models, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.
翻译:现代大型语言模型采用固定的分词器来有效压缩源自源领域的文本。然而,将同一分词器应用于新的目标领域时,常导致压缩效率降低、推理成本增加及语义对齐性下降。为解决这一缺陷,我们提出了稀疏Sinkhorn词元翻译方法(S2T2)。S2T2为目标领域训练定制化分词器,并学习目标词元与源词元间的翻译映射,从而更有效地复用预训练的下一源词元预测器。在针对英语语言模型的微调实验中,S2T2在跨领域蛋白质序列任务上同时提升了困惑度与压缩效率,其表现优于直接使用源分词器或目标分词器进行微调的方法。此外,我们发现针对较小规模、较低成本模型学习到的词元翻译映射,可直接迁移至更大规模、更强能力的模型,从而以更低成本实现S2T2的优势。