Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. First, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data. Second, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. Across multiple zero-shot transfer benchmarks, we demonstrate that our representations significantly improve transfer performance when the distance type is relevant to the task, while our composite distance yields gains in most tasks.
翻译:现有的语言知识库(如URIEL+)为跨语言迁移提供了宝贵的地理、谱系和类型学距离信息,但存在两个关键局限。首先,其“一刀切”的向量表示难以适应语言数据的多样化结构。其次,它们缺乏将这些信号聚合为单一综合评分的理论方法。本文通过引入类型匹配的语言距离框架来解决这些不足。我们为每种距离类型提出了新颖的结构感知表示:针对地理距离的说话者加权分布、针对谱系距离的双曲嵌入,以及针对类型学距离的潜变量模型。我们将这些信号统一为稳健的任务无关复合距离。在多个零样本迁移基准测试中,我们证明当距离类型与任务相关时,我们的表示能显著提升迁移性能,而我们的复合距离在大多数任务中均能带来增益。