Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.
翻译:指令微调已被广泛采用,以确保大语言模型(LLMs)能有效遵循用户指令。LLMs由此获得的指令遵循能力在很大程度上依赖于微调所用的指令数据集。最近,合成指令数据集作为一种经济可行的解决方案出现,旨在为LLMs提供多样且高质量的指令。然而,现有方法通常假设更大或更强的模型是指令微调中更强的教师,因此简单地采用这些模型作为合成指令的响应生成器。在本文中,我们挑战了这一普遍采用的假设。我们在五个基础模型和二十个响应生成器上进行的大量实验表明,更大更强的模型并不一定是较小模型的更强教师。我们将此现象称为“更大模型悖论”。我们观察到,现有指标无法精确预测响应生成器的有效性,因为它们忽略了教师与待微调的基础模型之间的兼容性。因此,我们开发了一种名为“兼容性调整奖励”(Compatibility-Adjusted Reward, CAR)的新指标来衡量响应生成器的有效性。我们在五个基础模型上的实验表明,CAR的性能优于几乎所有基线方法。