Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation

In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun

翻译：本文提出坤（Kun）——一种无需依赖人工标注即可为大型语言模型（LLM）构建高质量指令微调数据集的新方法。该方法基于指令回译与答案精炼的自训练算法，利用来自悟道、万卷、SkyPile等多源无标注数据，生成了超过百万条中文指令数据点。本方法通过自筛选流程优化并选择最有效的指令-输出对，显著区别于传统范式。我们在6B参数的Yi模型上进行的多基准测试验证了坤方法的鲁棒性与可扩展性。本方法的核心贡献在于其算法改进——提升了数据保留率与清晰度，以及创新的数据生成范式——大幅降低了对耗时昂贵的人工标注的依赖。该技术为提升LLM的指令遵循能力提供了可扩展的高效解决方案，对其在跨领域应用具有重要意义。代码与数据集可在 https://github.com/Zheng0428/COIG-Kun 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日