Typological information has the potential to be beneficial in the development of NLP models, particularly for low-resource languages. Unfortunately, current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information, such as linguistic grammars. Some of these inconsistencies stem from coding errors or linguistic variation, but many of the disagreements are due to the discrete categorical nature of these databases. We shed light on this issue by systematically exploring disagreements across typological databases and resources, and their uses in NLP, covering the past and present. We next investigate the future of such work, offering an argument that a continuous view of typological features is clearly beneficial, echoing recommendations from linguistics. We propose that such a view of typology has significant potential in the future, including in language modeling in low-resource scenarios.
翻译:类型学信息在自然语言处理模型的开发中具有潜在价值,尤其对低资源语言而言。然而,当前大规模类型学数据库(尤其是WALS和Grambank)彼此之间以及与其他类型学信息来源(如语言学语法)存在不一致性。其中部分不一致源于编码错误或语言变异,但许多差异源于这些数据库的离散分类特性。我们通过系统性地探究类型学数据库和资源之间的分歧及其在自然语言处理中的应用(涵盖过去与现在的研究现状),揭示了这一问题。随后,我们探讨了此类工作的未来方向,论证了类型学特征的连续视角显然具有优势——这一观点与语言学界建议相呼应。我们提出,这种类型学视角在未来具有显著潜力,尤其在低资源场景的语言建模中。