Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing. However, their increasing size poses challenges in terms of computational cost. On the other hand, Small Language Models (SLMs) are known for their efficiency, but they often struggle with limited capacity and training data, especially in specific domains. In this paper, we introduce a novel method aimed at improving SLMs in the medical domain using LLM-based generative data augmentation. The objective of our approach is to develop more efficient and capable models that are specifically tailored for specialized applications. Through experiments conducted on the PubMedQA dataset, we demonstrate the effectiveness of LLMs in refining and diversifying existing question-answer pairs. This refinement process leads to improved performance in a significantly smaller model after fine-tuning. Notably, our best SLM, with under 1.6 billion parameters, outperforms the few-shot GPT-4 on the PubMedQA dataset. Our code and generated data are publicly available to facilitate further explorations.
翻译:大型语言模型在自然语言处理领域取得了显著进展。然而,其日益增长的规模带来了计算成本方面的挑战。另一方面,小型语言模型以其高效性著称,但常受限于有限的容量和训练数据,尤其在特定领域表现乏力。本文提出了一种创新方法,旨在利用基于大型语言模型的生成式数据增强来提升医学领域的小型语言模型。该方法的目标是开发专门针对专业应用、更具效率且能力更强的模型。通过在PubMedQA数据集上的实验,我们证明了大型语言模型在精炼和多样化现有问答对方面的有效性。这一精炼过程使得微调后显著更小的模型性能得到提升。值得注意的是,我们最佳的小型语言模型参数不足16亿,却在PubMedQA数据集上超越了少样本学习的GPT-4。我们的代码和生成数据已公开发布,以促进进一步探索。