Efficient Continual Pre-training of LLMs for Low-resource Languages

Open-source Large Language models (OsLLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost. To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary. We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability. For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.

翻译：开源大型语言模型（OsLLMs）通过提供增强或更新模型参数以提升性能的灵活性，推动了自然语言研究的民主化。然而，与专有大型语言模型类似，由于训练数据量较少且词汇表征不足，开源大型语言模型在低资源语言（LRLs）上的表现远逊于高资源语言（HRLs）。另一方面，使用大量特定语言数据进行持续预训练（CPT）在数据获取和计算资源方面成本高昂。我们的目标是大幅降低持续预训练的成本。为此，我们首先开发了一种新算法，用于从更大规模的语料库中选择文本子集。我们展示了使用极少量的持续预训练数据即可实现该技术的有效性。为了寻求进一步的改进，我们设计了一种新算法来选择纳入大型语言模型词汇表的词元。我们以最新的Llama-3模型和九种具有不同文字及资源可用程度的印度语言进行了实验。为了进行评估，我们使用了IndicGenBench——一个针对印度语言生成任务的基准数据集。我们尝试了多种持续预训练语料库和扩展词汇表规模，并提供了跨语系的研究见解。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日