This work explores the degree to which grammar acquisition is driven by language `simplicity' and the source modality (speech vs. text) of data. Using BabyBERTa as a probe, we find that grammar acquisition is largely driven by exposure to speech data, and in particular through exposure to two of the BabyLM training corpora: AO-Childes and Open Subtitles. We arrive at this finding by examining various ways of presenting input data to our model. First, we assess the impact of various sequence-level complexity based curricula. We then examine the impact of learning over `blocks' -- covering spans of text that are balanced for the number of tokens in each of the source corpora (rather than number of lines). Finally, we explore curricula that vary the degree to which the model is exposed to different corpora. In all cases, we find that over-exposure to AO-Childes and Open Subtitles significantly drives performance. We verify these findings through a comparable control dataset in which exposure to these corpora, and speech more generally, is limited by design. Our findings indicate that it is not the proportion of tokens occupied by high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data. We hope this encourages future research into the use of more developmentally plausible linguistic data (which tends to be more scarce) to augment general purpose pre-training regimes.
翻译:本研究探讨了语言“简单性”和数据来源模态(语音与文本)对语法习得的驱动程度。以BabyBERTa为探针,我们发现语法习得主要受语音数据暴露驱动,尤其通过两个BabyLM训练语料库——AO-Childes和Open Subtitles——的暴露实现。通过考察向模型呈现输入数据的多种方式,我们得出这一发现。首先,我们评估了基于序列复杂度的各类课程学习的影响;随后考察了在“语块”上的学习效果——这些语块覆盖各源语料库中令牌数量均衡(而非行数均衡)的文本片段;最后探讨了改变模型暴露于不同语料库程度的课程。所有实验均表明,过度暴露于AO-Childes和Open Subtitles可显著提升性能。我们通过构建可比对照数据集验证了这些发现,该数据集在设计上限制了模型对这些语料库及广义语音数据的暴露。研究结果表明,驱动习得的关键并非高效用数据所占令牌比例,而是分配给此类数据的训练步骤比例。我们期望此研究能激励未来更多利用更符合发展规律的(通常更稀缺的)语言数据来增强通用预训练范式的研究。