Pre-trained Large Language Models (LLMs) have shown success in a diverse set of language inference and understanding tasks. The pre-training stage of LLMs looks at a large corpus of raw textual data. The BabyLM shared task compares LLM pre-training to human language acquisition, where the number of tokens seen by 13-year-old kids is magnitudes smaller than the number of tokens seen by LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task. We also try to loosely replicate the RoBERTa baseline given by the task organizers to observe the training robustness to hyperparameter selection and replicability. We provide the submission details to the strict and strict-small tracks in this report.
翻译:预训练大语言模型在多种语言推理和理解任务中已展现出成功应用。其预训练阶段需处理大规模原始文本语料库。BabyLM共享任务将大语言模型预训练与人类语言习得进行对比,发现13岁儿童接触的标记数量比大语言模型处理的标记量低数个数量级。本研究采用与儿童大致相同的标记数量进行大语言模型预训练,评估其学习上下文词汇表示的能力。我们建立了强基线基准:涵盖不同架构、跨训练轮次的性能变化评估,并针对任务的严格小规模与严格规模赛道报告预训练指标。同时尝试对任务组织者提供的RoBERTa基线进行近似复现,以观察训练过程对超参数选择的鲁棒性及可复现性。本文提供严格赛道与严格小规模赛道的提交细节。