GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%. We will make our pretrained model and dataset available through Hugging Face, as well as provide access through a web interface for use in language documentation efforts.

翻译：语言记录项目通常涉及创建带注释的文本，例如行间注释文本（IGT），该格式以语素为单位捕获细粒度的形态句法分析。然而，现有资源中极少能提供大量标准化且易于获取的IGT数据，这限制了其在语言学研究中的应用，并使得此类数据难以用于自然语言处理建模。我们汇集了来自多种来源的最大规模IGT语料库，涵盖1800种语言超过45万个示例，以支持跨语言迁移和IGT生成的研究。我们对大部分数据进行了标准化处理，使其遵循跨语言的统一标注规范。此外，我们探索了自动生成IGT的任务以辅助语言记录项目。针对许多语言单语数据不足的问题，我们在该语料库上预训练了一个大型多语言模型。通过在单语语料库上进行微调，该模型展现出显著优势，其性能超越现有最佳模型达6.6%。我们将通过Hugging Face平台发布预训练模型与数据集，并提供基于网页界面的访问方式，以支持语言记录工作。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日