After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule changes and their rationale in greater detail, give a timeline of this year's competition, and provide answers to frequently asked questions from last year's challenge.
翻译:继去年成功举办首届BabyLM挑战赛后,该竞赛将于2024/2025年度再次启动。挑战赛的核心目标保持不变,但部分竞赛规则将有所调整。本年度竞赛的重大变革如下:首先,我们将原"自由赛道"改为"论文赛道",允许提交非模型原创成果(例如认知启发的创新基准测试或分析方法)。其次,我们放宽了预训练数据使用规则,允许参赛者在不超过1亿词或1000万词预算的前提下自主构建数据集。第三,我们新增多模态视觉-语言赛道,并将发布由纯文本数据与图文多模态数据各占50%的语料库作为语言模型训练起点。本征稿启事旨在明确本年度挑战赛规则,详细阐释规则变更及其设计依据,公布竞赛时间安排,并解答上届挑战赛常见问题。