In multilingual translation research, the comprehension and utilization of language families are of paramount importance. Nevertheless, clustering languages based solely on their ancestral families can yield suboptimal results due to variations in the datasets employed during the model's training phase. To mitigate this challenge, we introduce an innovative method that leverages the fisher information matrix (FIM) to cluster language families, anchored on the multilingual translation model's characteristics. We hypothesize that language pairs with similar effects on model parameters exhibit a considerable degree of linguistic congruence and should thus be grouped cohesively. This concept has led us to define pseudo language families. We provide an in-depth discussion regarding the inception and application of these pseudo language families. Empirical evaluations reveal that employing these pseudo language families enhances performance over conventional language families in adapting a multilingual translation model to unfamiliar language pairs. The proposed methodology may also be extended to scenarios requiring language similarity measurements. The source code and associated scripts can be accessed at https://github.com/ecoli-hit/PseudoFamily.
翻译:在多语言翻译研究中,语言族的理解与利用至关重要。然而,仅依据语系祖先进行聚类可能因模型训练阶段使用的数据集差异而产生次优结果。为应对这一挑战,我们引入了一种创新方法,该方法利用Fisher信息矩阵(FIM)基于多语言翻译模型的特性对语言族进行聚类。我们假设,对模型参数具有相似影响的语言对表现出显著的语言一致性,因此应被归为一组。这一概念引导我们定义了伪语言族。我们深入探讨了这些伪语言族的形成与应用。实证评估表明,在将多语言翻译模型适配到陌生语言对时,采用这些伪语言族相较于传统语言族能提升性能。所提出的方法还可扩展至需要语言相似度测量的场景。相关源代码及脚本可通过https://github.com/ecoli-hit/PseudoFamily获取。