In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun
翻译:本文提出了一种名为坤(Kun)的新颖方法,用于为大型语言模型(LLMs)构建高质量指令微调数据集,无需依赖人工标注。该方法基于指令反向翻译与答案优化的自训练算法,充分利用来自悟道、宛君及天梯等多元来源的无标注数据,生成了包含超过一百万条中文指令数据的大规模数据集。本方法通过自策流程对最有效的指令-输出对进行精炼与筛选,显著区别于传统方法。我们在多个基准测试中基于60亿参数的Yi模型进行的实验验证了坤方法的鲁棒性与可扩展性。该方法的核心理念贡献在于:通过算法优化提升数据保留率与清晰度,以及通过创新性数据生成方式大幅降低对昂贵且耗时的人工标注的依赖。本方法为提升LLMs的指令遵循能力提供了一种可扩展的高效解决方案,对其在跨领域应用具有重要价值。相关代码与数据集可在https://github.com/Zheng0428/COIG-Kun获取。