We describe our team's contribution to the STRICT-SMALL track of the BabyLM Challenge. The challenge requires training a language model from scratch using only a relatively small training dataset of ten million words. We experiment with three variants of cognitively-motivated curriculum learning and analyze their effect on the performance of the model on linguistic evaluation tasks. In the vocabulary curriculum, we analyze methods for constraining the vocabulary in the early stages of training to simulate cognitively more plausible learning curves. In the data curriculum experiments, we vary the order of the training instances based on i) infant-inspired expectations and ii) the learning behavior of the model. In the objective curriculum, we explore different variations of combining the conventional masked language modeling task with a more coarse-grained word class prediction task to reinforce linguistic generalization capabilities. Our results did not yield consistent improvements over our own non-curriculum learning baseline across a range of linguistic benchmarks; however, we do find marginal gains on select tasks. Our analysis highlights key takeaways for specific combinations of tasks and settings which benefit from our proposed curricula. We moreover determine that careful selection of model architecture, and training hyper-parameters yield substantial improvements over the default baselines provided by the BabyLM challenge.
翻译:我们描述了团队在BabyLM挑战赛STRICT-SMALL赛道中的贡献。该挑战要求仅使用相对较小的1000万词训练数据集从头训练语言模型。我们实验了三种受认知启发的课程学习变体,并分析其对语言评估任务中模型性能的影响。在词汇课程中,我们分析了在训练早期约束词汇表的方法,以模拟认知上更合理的学习曲线。在数据课程实验中,我们基于(i)婴儿启发的期望和(ii)模型的学习行为调整训练实例的排序。在目标课程中,我们探索了将传统掩码语言建模任务与更粗粒度的词类预测任务相结合的不同变体,以强化语言泛化能力。我们的结果未能在多个语言基准测试中相较于无课程学习的基线产生一致的改进;然而,我们发现特定任务上存在边际收益。我们的分析强调了所提出的课程学习受益的具体任务与设置组合的关键要点。此外,我们确定,模型架构和训练超参数的审慎选择相较于BabyLM挑战赛提供的默认基线能带来显著改进。