Language models (LMs) have demonstrated remarkable proficiency in generating linguistically coherent text, sparking discussions about their relevance to understanding human language learnability. However, a significant gap exists between the training data for these models and the linguistic input a child receives. LMs are typically trained on data that is orders of magnitude larger and fundamentally different from child-directed speech (Warstadt and Bowman, 2022; Warstadt et al., 2023; Frank, 2023a). Addressing this discrepancy, our research focuses on training LMs on subsets of a single child's linguistic input. Previously, Wang, Vong, Kim, and Lake (2023) found that LMs trained in this setting can form syntactic and semantic word clusters and develop sensitivity to certain linguistic phenomena, but they only considered LSTMs and simpler neural networks trained from just one single-child dataset. Here, to examine the robustness of learnability from single-child input, we systematically train six different model architectures on five datasets (3 single-child and 2 baselines). We find that the models trained on single-child datasets showed consistent results that matched with previous work, underscoring the robustness of forming meaningful syntactic and semantic representations from a subset of a child's linguistic input.
翻译:语言模型在生成符合语言习惯的文本方面展现出卓越能力,引发了关于其与人类语言可学习性相关性的讨论。然而,这些模型的训练数据与儿童接收的语言输入之间存在显著差距。语言模型通常基于规模大数个数量级且与儿童导向语言存在本质差异的数据进行训练(Warstadt and Bowman, 2022; Warstadt et al., 2023; Frank, 2023a)。针对这一差异,我们的研究聚焦于在单个儿童语言输入的子集上训练语言模型。此前,Wang、Vong、Kim和Lake(2023)发现,在此设定下训练的语言模型能够形成句法和语义词汇聚类,并对特定语言现象产生敏感性,但该研究仅基于单一儿童数据集训练了LSTM及更简单的神经网络。为检验单一儿童输入可学习性的稳健性,我们系统性地在五个数据集(三个单一儿童数据集和两个基线数据集)上训练了六种不同架构的模型。结果表明,基于单一儿童数据集训练的模型呈现出一致结果,与先前研究吻合,这凸显了从儿童语言输入子集中形成有意义的句法与语义表征的稳健性。