This study presents our submission to the Strict-Small Track of the 2nd BabyLM Challenge. We use a teacher-student distillation setup with the BabyLLaMa model (Timiryasov and Tastet, 2023) as a backbone. To make the student's learning process more focused, we replace the objective function with a reverse Kullback-Leibler divergence, known to cause mode-seeking (rather than mode-averaging) behaviour in computational learners. We further experiment with having a single teacher (instead of an ensemble of two teachers) and implement additional optimization strategies to improve the distillation process. Our experiments show that under reverse KL divergence, a single-teacher model often outperforms or matches multiple-teacher models across most tasks. Additionally, incorporating advanced optimization techniques further enhances model performance, demonstrating the effectiveness and robustness of our proposed approach. These findings support our idea that "choosy babies need one coach".
翻译:本研究介绍了我们提交给第二届BabyLM挑战赛严格小规模赛道的方案。我们采用师生蒸馏框架,并以BabyLLaMa模型(Timiryasov与Tastet,2023)为骨干网络。为使学生的学习过程更为聚焦,我们将目标函数替换为反向Kullback-Leibler散度——该函数已知能在计算型学习者中引发模式寻求(而非模式平均)行为。我们进一步尝试使用单一教师(而非双教师集成)模型,并实施了额外的优化策略以改进蒸馏过程。实验表明,在反向KL散度下,单教师模型在多数任务中往往优于或持平于多教师模型。此外,引入先进的优化技术能进一步提升模型性能,这证明了我们所提方法的有效性与鲁棒性。这些发现支持了我们的核心理念:“挑剔的婴儿只需一位教练”。