Open-source Large Language models (OsLLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost. To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary. We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability. For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.
翻译:开源大型语言模型(OsLLMs)通过提供增强或更新模型参数以提升性能的灵活性,推动了自然语言研究的民主化。然而,与专有大型语言模型类似,由于训练数据量较少且词汇表征不足,开源大型语言模型在低资源语言(LRLs)上的表现远逊于高资源语言(HRLs)。另一方面,使用大量特定语言数据进行持续预训练(CPT)在数据获取和计算资源方面成本高昂。我们的目标是大幅降低持续预训练的成本。为此,我们首先开发了一种新算法,用于从更大规模的语料库中选择文本子集。我们展示了使用极少量的持续预训练数据即可实现该技术的有效性。为了寻求进一步的改进,我们设计了一种新算法来选择纳入大型语言模型词汇表的词元。我们以最新的Llama-3模型和九种具有不同文字及资源可用程度的印度语言进行了实验。为了进行评估,我们使用了IndicGenBench——一个针对印度语言生成任务的基准数据集。我们尝试了多种持续预训练语料库和扩展词汇表规模,并提供了跨语系的研究见解。