Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of "relevance" and "novelty".
翻译:近年来,大型语言模型(LLM)训练的发展凸显了对多样化、高质量指令数据的需求。当前,许多研究正探索利用LLM生成合成数据。然而,这些方法主要集中于使用标准的监督式指令微调模型进行提示工程,这存在一个根本性局限:这些模型是针对通用问答/问题解决而优化的,而非数据生成。我们提出了一种名为 \textbf{NOMAD} 的范式转变,通过研究如何专门训练用于数据生成的模型,证明了该任务与训练经典语言模型存在显著差异。我们识别出两个关键因素:无提示掩码训练和适当的训练集规模选择。我们的方法NOMAD相较于基线模型显示出显著改进,在有限训练数据下,于TriviaQA上实现了超过4%的性能提升,在GSM8K上实现了超过2%的提升。最后,我们通过“相关性”和“新颖性”的视角解释合成数据,提供了新的见解。