Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. We release the first Québec French LLMs on HuggingFace.
翻译:尽管大型语言模型(LLMs)已被广泛采用,但其最强大的能力仍主要局限于少数拥有丰富训练数据的高资源语言。最近,持续预训练(CPT)已成为将这些模型微调至低资源地区方言的一种手段。本文研究了在严格的数据和计算预算下使用CPT进行方言学习的方法。通过采用低秩适配(LoRA)和计算高效的持续预训练技术,我们使用极少量数据集将三种LLMs适配至魁北克法语方言,并在COLE评测集上进行了基准测试。实验结果表明,在仅更新不到1%模型参数的情况下,模型在少数方言基准测试上取得了性能提升,同时在标准语言基准测试上仅出现最小程度的性能回退。结果分析表明,性能增益高度依赖于语料库的构成。这些发现指出,结合参数高效微调(PEFT)的CPT方法能够通过提供经济可持续的语言资源创建方案来缩小方言差距,从而将高质量LLM的访问权限扩展到少数语言社群。我们在HuggingFace平台发布了首批魁北克法语LLMs。