Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.
翻译:诸如URIEL+之类的语言学特征数据集对于建模跨语言关系具有重要价值,但其高维性和稀疏性(尤其是对于低资源语言而言)限制了距离度量的有效性。我们提出了一种通过结合特征选择与插补来优化URIEL+类型学特征空间的流程,从而生成紧凑且可解释的类型学表征。我们在语言距离对齐及下游任务上评估了这些特征子集,结果表明:缩减规模的语言类型学表征能够产生更具信息量的距离度量,并提升多语言NLP应用的性能。