Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers, vocabulary, and pre-training data, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, the majority of previous work has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion for LLMs in low-resource settings (i.e. languages and compute) has yet to be explored. In this paper, we investigate sample-efficient adaptation strategies from different angles, including target vocabulary size and initialization methods, and the amount of target data available for adaptation. Extensive experiments across typologically diverse languages, tasks and models show that simpler heuristic-based embedding initialization is more efficient and robust to changes in target vocabulary size and adaptation data in low-resource settings, outperforming a popular random initialization and a more sophisticated state-of-the-art approach that relies on external data and model.
翻译:大型语言模型(LLM)在英语之外的许多语言中展现出卓越能力。然而,由于依赖以英语为中心的分词器、词汇表和预训练数据,LLM在生成非英语文本时需要更多推理步骤,导致非英语用户的使用成本更高。通过添加目标语言词元进行词汇扩展是一种广泛使用的跨语言词汇适应方法,旨在解决此问题。尽管该方法在推理加速方面有效,但先前研究大多集中于高资源场景,即假设可获得大量目标语言数据以有效初始化新词元的嵌入表示并使LLM适应目标语言。然而,低资源场景(即语言和算力资源受限)下LLM的词汇扩展研究尚待探索。本文从不同角度研究了样本高效的适应策略,包括目标词汇表规模、初始化方法以及可用于适应的目标数据量。在类型多样的语言、任务和模型上进行的大量实验表明,在低资源场景中,基于启发式的简单嵌入初始化方法对目标词汇表规模和适应数据的变化具有更高的效率和鲁棒性,其性能优于常用的随机初始化方法以及依赖外部数据和模型的复杂前沿方法。