GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6\%. Our pretrained model and dataset are available on Hugging Face.

翻译：语言文档项目通常涉及创建诸如语际注释文本（IGT）格式的标注文本，该格式以语素为单位捕获细粒度的形态句法分析。然而，现有资源中提供大量标准化、易于访问的IGT数据的极少，这限制了其在语言学研究中的适用性，并使得此类数据难以用于NLP建模。我们汇集了现有最大的IGT数据语料库，涵盖来自多种来源的超过45万个示例，涉及1800多种语言，以支持跨语言迁移和IGT生成的研究。我们将大部分数据规范化，使其遵循跨语言的标准化标签集。此外，我们探索了自动生成IGT的任务，以辅助文档项目。鉴于许多语言缺乏足够的单语数据，我们在该语料库上预训练了一个大型多语言模型。我们通过在单语语料库上对该模型进行微调，证明了其实用性，其性能超越当前最先进模型达6.6%。我们的预训练模型和数据集已在Hugging Face平台发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日