The development of Large Language Models (LLMs) relies on extensive text corpora, which are often unevenly distributed across languages. This imbalance results in LLMs performing significantly better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate. Currently, there is a lack of quantitative methods to evaluate the performance of LLMs in these low-resource languages. To address this gap, we propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations. By comparing the LLM's internal representation of various languages against a baseline derived from English, we can assess the model's multilingual capabilities in a robust and language-agnostic manner. Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores, underscoring the effectiveness of our metric in assessing language-specific capabilities. Besides, the experiments show that there is a strong correlation between the LLM's performance in different languages and the proportion of those languages in its pre-training corpus. These insights underscore the efficacy of the Language Ranker as a tool for evaluating LLM performance across different languages, particularly those with limited resources.
翻译:大语言模型(LLMs)的发展依赖于大规模的文本语料库,而这些语料库在不同语言间的分布往往不均衡。这种不平衡导致LLMs在英语、德语、法语等高资源语言上表现显著更优,而在低资源语言上的能力则显不足。目前,尚缺乏量化方法来评估LLMs在这些低资源语言中的性能。为填补这一空白,我们提出了语言排序器(Language Ranker),这是一种基于内部表示来对LLMs性能进行基准测试和语言排序的内在度量指标。通过将LLM对不同语言的内部表示与基于英语的基线进行比较,我们能够以稳健且与语言无关的方式评估模型的多语言能力。我们的分析表明,高资源语言与英语的相似度得分较高,表现出更优的性能;而低资源语言的相似度得分较低,这印证了我们度量指标在评估语言特定能力方面的有效性。此外,实验显示,LLM在不同语言中的性能与其预训练语料中该语言的比例存在强相关性。这些发现凸显了语言排序器作为评估LLM在不同语言(尤其是资源有限的语言)中性能的有效工具。