Understanding the distance between human languages is central to linguistics, anthropology, and tracing human evolutionary history. Yet, while linguistics has long provided rich qualitative accounts of cross-linguistic variation, a unified and scalable quantitative approach to measuring language distance remains lacking. In this paper, we introduce a method that leverages pretrained multilingual language models as systematic instruments for linguistic measurement. Specifically, we show that the spontaneously emerged attention mechanisms of these models provide a robust, tokenization-agnostic measure of cross-linguistic distance, termed Attention Transport Distance (ATD). By treating attention matrices as probability distributions and measuring their geometric divergence via optimal transport, we quantify the representational distance between languages during translation. Applying ATD to a large and diverse set of languages, we demonstrate that the resulting distances recover established linguistic groupings with high fidelity and reveal patterns aligned with geographic and contact-induced relationships. Furthermore, incorporating ATD as a regularizer improves transfer performance in low-resource machine translation. Our results establish a principled foundation for testing linguistic hypotheses using artificial neural networks. This framework transforms multilingual models into powerful tools for quantitative linguistic discovery, facilitating more equitable multilingual AI.
翻译:理解人类语言之间的距离是语言学、人类学以及追溯人类演化史的核心问题。然而,尽管语言学长期为跨语言变异提供了丰富的定性描述,但至今仍缺乏一种统一且可扩展的定量方法来测量语言距离。本文提出一种方法,利用预训练多语言语言模型作为语言测量的系统性工具。具体而言,我们发现这些模型自发产生的注意力机制能够提供一种鲁棒且与分词无关的跨语言距离度量,称为注意力传输距离。通过将注意力矩阵视为概率分布,并利用最优传输理论测量其几何散度,我们定量了语言在翻译过程中的表征距离。将注意力传输距离应用于大量且多样的语言后,我们证明所得距离能够以高保真度重建已知的语言分组,并揭示与地理及接触诱发关系相一致的规律。此外,将注意力传输距离作为正则化项引入,可改善低资源机器翻译中的迁移性能。我们的结果为利用人工神经网络检验语言学假说奠定了原理基础。该框架将多语言模型转化为定量语言发现的强大工具,从而促进更公平的多语言人工智能。