Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.
翻译:指令微调已被广泛采用,以确保大语言模型(LLMs)能够有效遵循用户指令。LLMs所展现的指令遵循能力在很大程度上依赖于微调所使用的指令数据集。近年来,合成指令数据集作为一种经济可行的解决方案出现,旨在为LLMs提供多样且高质量的指令。然而,现有方法通常假设更大或更强的模型在指令微调中是更优的教师,因此简单地采用这些模型作为合成指令的响应生成器。在本文中,我们挑战了这一普遍采用的假设。我们在五个基础模型和二十个响应生成器上进行的广泛实验表明,更大更强的模型并不必然是较小模型的更优教师。我们将这一现象称为“更大模型悖论”。我们观察到,现有指标无法精确预测响应生成器的有效性,因为它们忽略了教师模型与待微调的基础模型之间的兼容性。因此,我们开发了一种名为“兼容性调整奖励”的新指标,用以衡量响应生成器的有效性。我们在五个基础模型上的实验表明,该指标在几乎所有基线比较中都表现更优。