While current large language models have achieved a remarkable success, their data efficiency remains a challenge to overcome. Recently it has been suggested that child-directed speech (CDS) can improve training data efficiency of modern language models based on Transformer neural networks. However, it is not yet understood which specific properties of CDS are effective for training these models. In the context of the BabyLM Challenge, we focus on Variation Sets (VSs), sets of consecutive utterances expressing a similar intent with slightly different words and structures, which are ubiquitous in CDS. To assess the impact of VSs on training data efficiency, we augment CDS data with different proportions of artificial VSs and use these datasets to train an auto-regressive model, GPT-2. We find that the best proportion of VSs depends on the evaluation benchmark: BLiMP and GLUE scores benefit from the presence of VSs, but EWOK scores do not. Additionally, the results vary depending on multiple factors such as the number of epochs and the order of utterance presentation. Taken together, these findings suggest that VSs can have a beneficial influence on language models, while leaving room for further investigation.
翻译:尽管当前的大型语言模型已取得显著成功,但其数据效率仍有待提升。近期研究表明,面向儿童的语言输入能够提升基于Transformer神经网络的现代语言模型的训练数据效率。然而,目前尚不清楚儿童导向语言中哪些具体特性对训练此类模型具有促进作用。在BabyLM挑战赛的背景下,我们聚焦于变体集——即在儿童导向语言中普遍存在的、通过略微不同的词汇和结构表达相似意图的连续话语集合。为评估变体集对训练数据效率的影响,我们在儿童导向语言数据中按不同比例添加人工变体集,并使用这些数据集训练自回归模型GPT-2。研究发现,变体集的最佳比例取决于评估基准:BLiMP和GLUE评分受益于变体集的存在,但EWOK评分则不然。此外,结果还受训练周期数和话语呈现顺序等多重因素影响。综合来看,这些发现表明变体集对语言模型具有积极影响,同时为后续研究留下了探索空间。