Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.
翻译:大型语言模型传统上在大规模指令数据集上进行微调。然而,近期研究表明,少量高质量数据集足以实现通用指令遵循。这种关于微调最佳实践的共识缺失,部分源于大语言模型评估方法的快速分化。本研究探究少量多样化微调样本能否同时提升传统困惑度型NLP基准测试与开放式模型评估的性能。我们使用从1k到60k样本不等规模的指令微调数据集,对开源MPT-7B和MPT-30B模型进行微调。研究发现,1k-6k的指令微调样本子集足以在(1)传统NLP基准测试和(2)模型评估中均取得优异表现。最后,我们证明混合教材式与开放式问答微调数据集能优化两种评估范式下的性能。