Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

翻译：为低资源和中资源语言开发单语大语言模型，持续受到高质量训练数据获取困难的阻碍。本研究提出一种新颖的跨语言词汇迁移策略——跨语言词汇转换，旨在应对这一挑战，实现更高效的语言适应。我们的方法通过利用源语言中语义相似词汇嵌入的加权平均值来初始化目标语言的词汇嵌入，从而将高资源单语大语言模型适配到未见过的目标语言。为此，我们利用了覆盖源语言和目标语言的双语翻译资源。我们通过Tweeties系列跨语言词汇转换大语言模型验证了该方法，并在一个规模虽小但多样化的语言集合上，展示了其在多种下游任务中的竞争力。此外，我们引入了Hydra大语言模型，这些模型配备多个可切换的语言建模头和嵌入表，进一步扩展了我们跨语言词汇转换策略的能力。基于多语言模型TowerInstruct设计Hydra大语言模型，我们以零样本方式开发了针对鞑靼语的先进机器翻译模型，完全绕过了对高质量平行数据的需求。这一突破对于像鞑靼语这样的低资源语言尤为重要，因为高质量平行数据难以获取。通过降低训练高质量模型所需的数据和时间成本，我们的跨语言词汇转换策略使得为更广泛的语言（尤其是资源有限的语言）开发大语言模型成为可能。我们希望我们的工作能激发跨语言词汇迁移领域的进一步研究和合作，并为全球范围内语言的赋能做出贡献。