BabyLM挑战：探究变体集对语言模型训练效率的影响 (BabyLM Challenge: Exploring the Effect of Variation Sets on Language Model Training Efficiency)

While current large language models have achieved a remarkable success, their data efficiency remains a challenge to overcome. Recently it has been suggested that child-directed speech (CDS) can improve training data efficiency of modern language models based on Transformer neural networks. However, it is not yet understood which specific properties of CDS are effective for training these models. In the context of the BabyLM Challenge, we focus on Variation Sets (VSs), sets of consecutive utterances expressing a similar intent with slightly different words and structures, which are ubiquitous in CDS. To assess the impact of VSs on training data efficiency, we augment CDS data with different proportions of artificial VSs and use these datasets to train an auto-regressive model, GPT-2. We find that the best proportion of VSs depends on the evaluation benchmark: BLiMP and GLUE scores benefit from the presence of VSs, but EWOK scores do not. Additionally, the results vary depending on multiple factors such as the number of epochs and the order of utterance presentation. Taken together, these findings suggest that VSs can have a beneficial influence on language models, while leaving room for further investigation.

翻译：尽管当前的大型语言模型已取得显著成功，但其数据效率仍有待提升。近期研究表明，面向儿童的语言输入能够提升基于Transformer神经网络的现代语言模型的训练数据效率。然而，目前尚不清楚儿童导向语言中哪些具体特性对训练此类模型具有促进作用。在BabyLM挑战赛的背景下，我们聚焦于变体集——即在儿童导向语言中普遍存在的、通过略微不同的词汇和结构表达相似意图的连续话语集合。为评估变体集对训练数据效率的影响，我们在儿童导向语言数据中按不同比例添加人工变体集，并使用这些数据集训练自回归模型GPT-2。研究发现，变体集的最佳比例取决于评估基准：BLiMP和GLUE评分受益于变体集的存在，但EWOK评分则不然。此外，结果还受训练周期数和话语呈现顺序等多重因素影响。综合来看，这些发现表明变体集对语言模型具有积极影响，同时为后续研究留下了探索空间。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日