A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.
翻译:提升语言模型能力的常见且有效方法,涉及基于更熟练的“教师”模型的生成结果对“学生”语言模型的参数进行微调。这些生成结果被称为“合成数据”,通常在学生模型微调开始前就已产生,但已有研究考虑在训练过程中生成新的合成样本。本文研究并倡导后一种情况,即数据以迭代、闭环的方式生成,并由学生模型的当前状态进行引导。在固定生成样本预算或固定查询教师模型的计算预算条件下,我们证明这种微调数据的筛选方法相较于静态生成能提升学生模型的性能。此外,尽管已有几种针对大语言模型(LLM)提出的方法在此机制下运行,但我们发现,来自主动学习文献的简单、低成本的选择标准往往表现最佳。我们在四个数学与逻辑推理数据集上使用四种不同的小型语言模型验证了这些结论。