Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.
翻译:基础语言模型通过监督式微调获得指令遵循能力。多样性与复杂性被认为是成功微调数据集的关键因素,但其定义仍不明确且缺乏定量分析。本研究提出InsTag——一种开放集细粒度标注器,基于语义与意图对微调数据集样本进行标注,并围绕标签定义指令多样性与复杂性指标。我们构建了涵盖6.6K个标签的描述体系以覆盖多样化用户查询。通过分析主流开源微调数据集,发现模型能力随数据多样性与复杂性的提升而增强。基于此发现,我们设计基于InsTag的数据选择器,从开源数据集中筛选出6K条兼具多样性与复杂性的样本,并在筛选数据上微调模型。由此生成的TagLM模型在MT-Bench评测中超越使用显著更大规模微调数据的开源模型,印证了查询多样性与复杂性的重要性。我们将InsTag开源至https://github.com/OFA-Sys/InsTag。