BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

翻译：BabyLM挑战赛号召参与者开发样本高效的语言模型。提交的模型在一个固定的英语语料库上进行预训练，其数据量限制在儿童发育过程中接触到的词汇量以内（<1亿词）。该挑战催生了多种用于数据高效语言建模的新架构，其性能甚至超过了在数万亿词汇上训练的模型。这对于低资源语言而言前景广阔，因为其可用语料库通常远少于1亿词。本文以科萨语为例，探讨了BabyLMs在低资源语言中的应用潜力。我们在一个科萨语语料库上预训练了两种BabyLM架构：ELC-BERT和MLSM。它们在词性标注和命名实体识别任务上均优于基础预训练模型，其中后者取得了显著提升（F1值+3.2）。在某些情况下，这些BabyLMs甚至超越了XLM-R。我们的研究结果表明，数据高效模型在低资源语言中是可行的，但同时也凸显了高质量预训练数据持续的重要性及其当前匮乏的现状。最后，我们通过可视化方法分析了BabyLM架构如何编码科萨语。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日