The success of ChatGPT has recently attracted numerous efforts to replicate it, with instruction-tuning strategies being a key factor in achieving remarkable results. Instruction-tuning not only significantly enhances the model's performance and generalization but also makes the model's generated results more consistent with human speech patterns. However current research rarely studies the impact of different amounts of instruction data on model performance, especially in the real-world use cases. In this paper we explore the performance of large language models based on instruction tuning across different scales of instruction data. An evaluation dataset consisting of 12 major online use cases is constructed in the experiment. With Bloomz-7B1-mt as the base model, the results show that 1) merely increasing the amount of instruction data leads to continuous improvement in tasks such as open-ended generation, 2) in tasks such as math and code, the model performance curve remains quite flat while increasing data size. We further analyze the possible causes of these phenomena and propose potential future research directions such as effectively selecting high-quality training data, scaling base models and training methods specialized for hard tasks. We will release our training and evaluation datasets, as well as model checkpoints.
翻译:ChatGPT的成功最近吸引了大量复制其方法的尝试,其中指令调优策略是实现显著成果的关键因素。指令调优不仅能大幅提升模型的性能和泛化能力,还能使模型生成的结果更符合人类语言表达模式。然而,当前研究很少探讨不同规模的指令数据对模型性能的影响,尤其是在真实应用场景中。本文基于指令调优,探索了大型语言模型在不同规模指令数据下的表现。实验构建了一个包含12个主要在线应用场景的评估数据集。以Bloomz-7B1-mt为基础模型,结果表明:1)仅通过增加指令数据量,在开放生成等任务中可实现持续性能提升;2)在数学和代码等任务中,随着数据规模增大,模型性能曲线保持平缓。我们进一步分析了可能的原因,并提出了潜在未来研究方向,例如有效筛选高质量训练数据、扩展基础模型及针对困难任务的专业训练方法。我们将公开训练与评估数据集以及模型检查点。