In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements.
翻译:在大语言模型领域,指令数据质量与数量的平衡已成为研究焦点。基于此,我们提出一种自引导方法,使大语言模型能够自主从海量开源数据集中甄别并选取优质样本,从而有效减少指令微调过程中的人工标注成本与潜在开销。核心创新点——指令跟随难度(IFD)指标,作为识别模型预期响应与其自主生成能力之间偏差的关键工具。通过灵活运用IFD指标精准定位优质样本,模型训练效率获得显著提升。在Alpaca和WizardLM等知名数据集上的实证验证表明:仅需传统数据量的10%,我们的策略即可展现更优效果。这种自引导式样本选取与IFD指标的有机融合,标志着大语言模型优化领域的范式性突破,既兼顾效率提升又实现资源节约。