In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements.
翻译:在大语言模型领域,指令数据的质量与数量平衡已成为研究焦点。为此,我们提出一种自引导方法,使大语言模型能够自主从海量开源数据集中甄别并选择优质样本,从而有效减少人工筛选与指令微调的成本。我们的核心创新——指令跟随难度(IFD)指标——作为关键工具,用于识别模型预期响应与其自主生成能力之间的差异。通过巧妙应用IFD,精准定位优质样本,显著提升模型训练效率。基于Alpaca和WizardLM等知名数据集的实证验证支持了我们的结论:仅需常规数据量的10%,我们的策略便展现出更优结果。这种自引导优质样本筛选与IFD指标的结合,标志着大语言模型优化的变革性飞跃,在提升效率的同时推动了资源节约型技术进步。