Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.
翻译:尽管以英语为主的生成式大语言模型取得了进展,但针对低资源语言的进一步发展仍需加强,以提升全球可访问性。当前低资源语言的主要建模方法包括单语预训练和多语预训练。单语预训练因硬件要求过高而成本高昂,而多语模型在不同语言上的表现往往参差不齐。本研究探索了一种替代方案:将主要基于英语训练的大语言模型适配至低资源语言。我们评估了多种策略,包括持续训练、指令微调、任务特定微调以及词汇扩展。结果表明:持续训练能够提升语言理解能力(以困惑度得分为表征),任务特定微调通常能增强下游任务性能;然而,词汇扩展并未带来显著收益。此外,虽然更大规模的模型在少样本微调中可改善任务表现,但多语模型在适配后的表现仍逊于其单语对应模型。