The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.
翻译:BabyLM挑战赛号召参与者开发样本高效的语言模型。提交的模型在一个固定的英语语料库上进行预训练,其数据量限制在儿童发育过程中接触到的词汇量以内(<1亿词)。该挑战催生了多种用于数据高效语言建模的新架构,其性能甚至超过了在数万亿词汇上训练的模型。这对于低资源语言而言前景广阔,因为其可用语料库通常远少于1亿词。本文以科萨语为例,探讨了BabyLMs在低资源语言中的应用潜力。我们在一个科萨语语料库上预训练了两种BabyLM架构:ELC-BERT和MLSM。它们在词性标注和命名实体识别任务上均优于基础预训练模型,其中后者取得了显著提升(F1值+3.2)。在某些情况下,这些BabyLMs甚至超越了XLM-R。我们的研究结果表明,数据高效模型在低资源语言中是可行的,但同时也凸显了高质量预训练数据持续的重要性及其当前匮乏的现状。最后,我们通过可视化方法分析了BabyLM架构如何编码科萨语。