This paper introduces a multi-stage manual annotation calibrated by the scaling law, offering a high-quality Supervised Fine-Tuning data acquisition method for environments with constrained resources like GPU poor, limited GPT access, and funding restrictions. We have preprocessed 58k authentic chat data and manually annotated 2.3k questions. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github (https://github.com/InternLM/HuixiangDou/tree/main/web/tools), HuggingFace (https://huggingface.co/tpoisonooo) and WandB (https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo). The privacy of the data involved has been authorized by users. SFT data and license comes from ncnn contributors group.
翻译:本文提出了一种通过缩放定律校准的多阶段人工标注方法,为GPU资源匮乏、GPT访问受限及资金受限等资源受限环境提供了一种高质量的监督微调数据获取方案。我们对58k条真实聊天数据进行了预处理,并人工标注了2.3k个问题。随后,我们在参数规模从0.5B到32B的Qwen模型上进行了微调实验,最优版本在F1分数上提升了29.07分。这证实了针对下游自然语言处理任务对大型语言模型进行微调的可行性。我们的贡献包括:1)创建了符合alpaca格式的监督微调训练数据集及相应的低秩适配权重;2)开发了一种基于缩放定律原理的高质量数据获取方法。相关代码、符合alpaca格式的原始数据及实验记录已开源发布于Github(https://github.com/InternLM/HuixiangDou/tree/main/web/tools)、HuggingFace(https://huggingface.co/tpoisonooo)和WandB(https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo)。所涉数据隐私已获用户授权,监督微调数据及相关许可来自ncnn贡献者组。