Employing pre-trained Large Language Models (LLMs) has become the de facto standard in Natural Language Processing (NLP) despite their extensive data requirements. Motivated by the recent surge in research focused on training LLMs with limited data, particularly in low-resource domains and languages, this paper surveys recent transfer learning approaches to optimize model performance in downstream tasks where data is scarce. We first address initial and continued pre-training strategies to better leverage prior knowledge in unseen domains and languages. We then examine how to maximize the utility of limited data during fine-tuning and few-shot learning. The final section takes a task-specific perspective, reviewing models and methods suited for different levels of data scarcity. Our goal is to provide practitioners with practical guidelines for overcoming the challenges posed by constrained data while also highlighting promising directions for future research.
翻译:尽管预训练大型语言模型(LLMs)具有庞大的数据需求,其应用已成为自然语言处理(NLP)领域的事实标准。受近期针对有限数据(尤其是在低资源领域和语言中)训练LLMs的研究热潮推动,本文系统综述了最新的迁移学习方法,以优化在数据稀缺的下游任务中的模型性能。我们首先探讨了初始预训练与持续预训练策略,以更好地利用未见领域和语言中的先验知识。随后分析了如何在微调与少样本学习过程中最大化有限数据的效用。最后部分从任务特定视角出发,综述了适用于不同数据稀缺程度的模型与方法。本文旨在为实践者提供克服数据约束挑战的实用指南,同时指明未来研究的潜在方向。