Labeling supervised fine-tuning data with the scaling law

This paper introduces a multi-stage manual annotation calibrated by the scaling law, offering a high-quality Supervised Fine-Tuning data acquisition method for environments with constrained resources like GPU poor, limited GPT access, and funding restrictions. We have preprocessed 58k authentic chat data and manually annotated 2.3k questions. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github (https://github.com/InternLM/HuixiangDou/tree/main/web/tools), HuggingFace (https://huggingface.co/tpoisonooo) and WandB (https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo). The privacy of the data involved has been authorized by users. SFT data and license comes from ncnn contributors group.

翻译：本文提出了一种通过缩放定律校准的多阶段人工标注方法，为GPU资源匮乏、GPT访问受限及资金受限等资源受限环境提供了一种高质量的监督微调数据获取方案。我们对58k条真实聊天数据进行了预处理，并人工标注了2.3k个问题。随后，我们在参数规模从0.5B到32B的Qwen模型上进行了微调实验，最优版本在F1分数上提升了29.07分。这证实了针对下游自然语言处理任务对大型语言模型进行微调的可行性。我们的贡献包括：1）创建了符合alpaca格式的监督微调训练数据集及相应的低秩适配权重；2）开发了一种基于缩放定律原理的高质量数据获取方法。相关代码、符合alpaca格式的原始数据及实验记录已开源发布于Github（https://github.com/InternLM/HuixiangDou/tree/main/web/tools）、HuggingFace（https://huggingface.co/tpoisonooo）和WandB（https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo）。所涉数据隐私已获用户授权，监督微调数据及相关许可来自ncnn贡献者组。

相关内容

Scaling Law

关注 0

从目前的研究总结发现，模型规模的扩展是LLM能力提升的一个关键因素。从GPT-3的175B参数量到PaLM的540B记录，都验证了模型规模的扩展，导致能力的提升。当然，大的模型尺寸是必不可少的，但是扩展定律并不仅限于此，它一共包括三个方面：模型尺寸（Model size）数据规模（Data size）总计算量（Total compute）此外，预训练数据的质量在保证模型性能方面有着关键作用，因此在扩展语料库时，要注意数据收集和清理的策略。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日