Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6\%. Our pretrained model and dataset are available on Hugging Face.
翻译:语言文档项目通常涉及创建诸如语际注释文本(IGT)格式的标注文本,该格式以语素为单位捕获细粒度的形态句法分析。然而,现有资源中提供大量标准化、易于访问的IGT数据的极少,这限制了其在语言学研究中的适用性,并使得此类数据难以用于NLP建模。我们汇集了现有最大的IGT数据语料库,涵盖来自多种来源的超过45万个示例,涉及1800多种语言,以支持跨语言迁移和IGT生成的研究。我们将大部分数据规范化,使其遵循跨语言的标准化标签集。此外,我们探索了自动生成IGT的任务,以辅助文档项目。鉴于许多语言缺乏足够的单语数据,我们在该语料库上预训练了一个大型多语言模型。我们通过在单语语料库上对该模型进行微调,证明了其实用性,其性能超越当前最先进模型达6.6%。我们的预训练模型和数据集已在Hugging Face平台发布。