This work explores the degree to which grammar acquisition is driven by language `simplicity' and the source modality (speech vs. text) of data. Using BabyBERTa as a probe, we find that grammar acquisition is largely driven by exposure to speech data, and in particular through exposure to two of the BabyLM training corpora: AO-Childes and Open Subtitles. We arrive at this finding by examining various ways of presenting input data to our model. First, we assess the impact of various sequence-level complexity based curricula. We then examine the impact of learning over `blocks' -- covering spans of text that are balanced for the number of tokens in each of the source corpora (rather than number of lines). Finally, we explore curricula that vary the degree to which the model is exposed to different corpora. In all cases, we find that over-exposure to AO-Childes and Open Subtitles significantly drives performance. We verify these findings through a comparable control dataset in which exposure to these corpora, and speech more generally, is limited by design. Our findings indicate that it is not the proportion of tokens occupied by high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data. We hope this encourages future research into the use of more developmentally plausible linguistic data (which tends to be more scarce) to augment general purpose pre-training regimes.
翻译:本研究探讨了语言“简单性”和数据来源模态(语音与文本)对语法习得驱动程度的机制。我们以BabyBERTa为探针模型发现,语法习得主要由语音数据驱动,尤其是通过对BabyLM训练语料库中两个子集(AO-Childes和开放字幕)的接触。这一结论基于对输入数据呈现方式的系统研究:首先评估了基于序列复杂度分级的课程学习效果;其次分析了“分块”式学习(即平衡各源语料库词元数量而非行数的文本片段覆盖)的影响;最后探索了改变模型对不同语料库接触程度的训练策略。所有实验均表明,过度接触AO-Childes和开放字幕语料库显著提升了模型性能。我们通过设计对比控制数据集(限制对上述语料库及语音数据的接触)验证了这一发现。研究结果表明,推动语法习得的并非高价值数据所占的词元比例,而是分配给此类数据的训练步数比例。我们期待这项研究能推动未来探索使用更具发展合理性的语言数据(通常较为稀缺)来增强通用预训练范式。