The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.
翻译:URIEL+语言知识库通过地理、谱系和类型学向量对语言进行编码,以支持多语言研究。然而,数据稀疏性问题依然普遍存在,表现为特征类型缺失、语言条目不完整以及谱系覆盖有限。这限制了URIEL+在跨语言迁移中的实用性,尤其是在支持低资源语言方面。为解决这一稀疏性问题,本文通过三项贡献扩展了URIEL+:引入文字向量以表示7,488种语言的书写系统属性;整合Glottolog数据库,新增18,710种语言;并通过在谱系间传播类型学与文字特征,为26,449种语言扩展了谱系插补。这些新增内容使文字向量的特征稀疏性降低了14%,语言覆盖最多增加19,015种语言(增幅1,007%),插补质量指标提升最高达33%。我们在跨语言迁移任务(围绕低资源语言设计)上的基准测试显示,与URIEL+相比,性能表现偶有差异,在某些配置中性能增益最高可达6%。我们的进展使URIEL+在多语言研究中更为完整和包容。