Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). However, the creation of instruction data is still largely heuristic, leading to significant variation in quantity and quality across existing datasets. While some research advocates for expanding the number of instructions, others suggest that a small set of well-chosen examples is adequate. To better understand data construction guidelines, our research provides a granular analysis of how data volume, parameter size, and data construction methods influence the development of each underlying ability of LLM, such as creative writing, code generation, and logical reasoning. We present a meticulously curated dataset with over 40k instances across ten abilities and examine instruction-tuned models with 7b to 33b parameters. Our study reveals three primary findings: (i) Despite the models' overall performance being tied to data and parameter scale, individual abilities have different sensitivities to these factors. (ii) Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases, but is unachievable with synthetic data. (iii) Instruction data brings powerful cross-ability generalization, as evidenced by out-of-domain evaluations. Furthermore, we demonstrate how these findings can guide more efficient data constructions, leading to practical performance improvements on two public benchmarks.
翻译:指令调优是一种新兴方法,用于激发大语言模型(LLMs)的通用智能。然而,指令数据的创建仍然主要依赖启发式方法,导致现有数据集在数量和质量上存在显著差异。尽管一些研究主张扩大指令数量,另一些则表明少量精选示例就已足够。为更好地理解数据构建指南,本研究细粒度分析了数据量、参数规模及数据构建方法如何影响LLM各项基础能力(如创意写作、代码生成和逻辑推理)的发展。我们精心整理了一个涵盖十种能力、包含4万多个实例的数据集,并检验了参数规模从7b到33b的指令调优模型。研究揭示三项主要发现:(i)尽管模型整体性能与数据量和参数规模相关,但各单项能力对这些因素的敏感性不同;(ii) 人工整理数据在效率上显著优于GPT-4生成的合成数据,且随着数据量增加能持续提升模型性能,而合成数据无法实现这一点;(iii)指令数据带来强大的跨能力泛化能力,跨域评估结果为此提供了证据。此外,我们展示了这些发现如何指导更高效的数据构建,从而在两个公开基准上实现实际性能提升。