The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates, dwarfing previously available resources. The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of intransitive word order in three typological databases (WALS, Grambank, AUTOYP) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic intransitive word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.

翻译：现有可用于跨语言研究的数据集往往侧重于少数语言的大量数据或大量语言的少量数据。这意味着基于这些数据集的结论在揭示人类语言能力的普遍属性方面存在局限。尽管通过旨在为大量语言开发标注语料库的项目努力，这一状况已开始改变，但此类努力仍受资源限制的制约。本文报告了一个大型标注平行数据集，其开发旨在部分解决这一问题。taggedPBC包含来自超过1,940种语言的词性标注平行文本数据，涵盖155个语系和78个孤立语言，其规模远超现有资源。该数据集中特定标签的准确度与现有高资源语言的最先进标注器（SpaCy、Trankit）以及人工标注语料库（Universal Dependencies Treebanks）均表现出良好相关性。此外，从该数据集衍生的新指标——N1比率——与三个类型学数据库（WALS、Grambank、AUTOYP）中专家判定的不及物语序具有相关性，使得基于此特征训练的高斯朴素贝叶斯分类器能够准确识别未收录于这些数据库语言的基本不及物语序。尽管扩展和完善该数据集仍需大量工作，但taggedPBC是实现基于语料库的跨语言研究的重要一步，现已通过GitHub开放供研究与合作使用。