We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.
翻译:我们提交至BabyLM挑战赛的工作旨在提升语言模型的样本效率。我们以发育合理性10M词级BabyLM数据集为基础,训练了由GPT-2与小尺寸LLaMA模型构成的集成模型,随后将其蒸馏至一个仅含5800万参数的小型LLaMA模型中。该蒸馏模型在性能上不仅超越其教师模型,还优于未使用蒸馏技术训练的同类模型。研究表明:当教师模型在足够小的数据集上训练时,蒸馏不仅能完整保留其性能,更能实现超越,并带来显著优于直接训练的效果。