Continual learning has emerged as an important research direction due to the infeasibility of retraining large language models (LLMs) from scratch in the event of new data availability. Of great interest is the domain-adaptive pre-training (DAPT) paradigm, which focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on. In this work, we evaluate the feasibility of DAPT in a low-resource setting, namely the Nepali language. We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We evaluate the adapted model on its performance, forgetting, and knowledge acquisition. We compare the base model and the final model on their Nepali generation abilities, their performance on popular benchmarks, and run case-studies to probe their linguistic knowledge in Nepali. We see some unsurprising forgetting in the final model, but also surprisingly find that increasing the number of shots during evaluation yields better percent increases in the final model (as high as 19.29% increase) compared to the base model (4.98%), suggesting latent retention. We also explore layer-head self-attention heatmaps to establish dependency resolution abilities of the final model in Nepali.
翻译:持续学习已成为重要的研究方向,因为在新数据可用时从头重新训练大型语言模型(LLMs)并不可行。领域自适应预训练(DAPT)范式尤其受到关注,其重点在于持续训练预训练语言模型,使其适应原本未经训练的领域。本工作评估了DAPT在低资源场景(即尼泊尔语)中的可行性。我们使用合成数据在4位QLoRA设置下持续训练Llama 3 8B模型,使其适应尼泊尔语。我们从性能、遗忘性和知识获取三个维度评估适应后的模型。我们比较了基础模型与最终模型在尼泊尔语生成能力、常见基准测试上的表现,并通过案例研究探究其尼泊尔语语言学知识。最终模型出现了一些意料之中的遗忘现象,但意外地发现:与基础模型(4.98%)相比,最终模型在评估时增加样本数量能获得更高的百分比提升(最高达19.29%),这暗示了潜在的记忆保持能力。我们还通过层-头自注意力热力图分析,验证了最终模型在尼泊尔语中的依存解析能力。