Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing. However, their increasing size poses challenges in terms of computational cost. On the other hand, Small Language Models (SLMs) are known for their efficiency, but they often struggle with limited capacity and training data, especially in specific domains. In this paper, we introduce a novel method aimed at improving SLMs in the medical domain using LLM-based generative data augmentation. The objective of our approach is to develop more efficient and capable models that are specifically tailored for specialized applications. Through experiments conducted on the PubMedQA dataset, we demonstrate the effectiveness of LLMs in refining and diversifying existing question-answer pairs. This refinement process leads to improved performance in a significantly smaller model after fine-tuning. Notably, our best SLM, with under 1.6 billion parameters, outperforms the few-shot GPT-4 on the PubMedQA dataset. Our code and generated data are publicly available to facilitate further explorations.
翻译:大型语言模型(LLMs)在自然语言处理领域取得了显著进展。然而,其规模日益庞大带来了计算成本方面的挑战。另一方面,小语言模型(SLMs)以高效著称,但受限于容量和训练数据不足,尤其在特定领域表现欠佳。本文提出一种新颖方法,通过基于LLM的生成式数据增强来提升医学领域SLMs的性能。该方法旨在开发专门针对专业应用场景、兼具高效性与更高能力的模型。通过在PubMedQA数据集上的实验,我们验证了LLMs在优化和多样化现有问答对方面的有效性。这一优化过程使得微调后显著更小的模型性能得到提升。值得注意的是,我们性能最佳的SLM参数不足16亿,却在PubMedQA数据集上超越了少样本GPT-4。我们的代码与生成数据已公开,以促进进一步探索。