We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, we provide a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). We will release a shared evaluation pipeline which scores models on a variety of benchmarks and tasks, including targeted syntactic evaluations and natural language understanding.
翻译:我们发布BabyLM挑战赛的征稿启事:基于发育合理语料库的样本高效预训练。本项共享任务面向对小规模语言建模、人类语言习得、低资源自然语言处理和认知建模感兴趣的研究人员。通过与CoNLL和CMCL合作,我们搭建了一个平台,鼓励使用源自儿童语言输入的有限规模语料库进行预训练方法的探索。任务包含三个赛道:其中两个赛道将训练数据限制为预发布的1000万词和1亿词语料库,专注于探索架构变体、自监督目标或课程学习等各类方法;最后一个赛道仅限制文本使用量,允许在数据选择、领域甚至模态(即欢迎使用文本之外的其他数据来源)上进行创新。我们将发布统一的评估流程,在包括针对性句法评估和自然语言理解在内的多种基准测试与任务上对模型进行评分。