Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.
翻译:将大语言模型(LLMs)适配到专业领域需要高质量的指令微调数据集,而通过人工标注创建此类数据成本高昂。现有的数据合成方法主要关注通用任务,难以捕捉领域特定的术语和推理模式。为解决这一问题,我们提出了DS$^2$-Instruct,一种无需人工监督即可生成领域特定指令数据集的零样本框架。我们的方法首先生成任务导向的关键词以确保全面的领域覆盖。随后,通过将这些关键词与布鲁姆分类法中不同认知层级的任务相结合,生成多样化的指令。最后,采用自洽性验证机制以保证数据质量。我们将该框架应用于数学、金融和逻辑推理等七个具有挑战性的领域以生成数据集。综合评估表明,基于我们生成的数据进行微调的模型,相较于现有数据生成方法取得了显著性能提升。