Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%. We will make our pretrained model and dataset available through Hugging Face, as well as provide access through a web interface for use in language documentation efforts.
翻译:语言记录项目通常涉及创建带注释的文本,例如行间注释文本(IGT),该格式以语素为单位捕获细粒度的形态句法分析。然而,现有资源中极少能提供大量标准化且易于获取的IGT数据,这限制了其在语言学研究中的应用,并使得此类数据难以用于自然语言处理建模。我们汇集了来自多种来源的最大规模IGT语料库,涵盖1800种语言超过45万个示例,以支持跨语言迁移和IGT生成的研究。我们对大部分数据进行了标准化处理,使其遵循跨语言的统一标注规范。此外,我们探索了自动生成IGT的任务以辅助语言记录项目。针对许多语言单语数据不足的问题,我们在该语料库上预训练了一个大型多语言模型。通过在单语语料库上进行微调,该模型展现出显著优势,其性能超越现有最佳模型达6.6%。我们将通过Hugging Face平台发布预训练模型与数据集,并提供基于网页界面的访问方式,以支持语言记录工作。