Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
翻译:多语言大语言模型(LLM)的评估面临诸多挑战,包括缺乏语言多样性充足的基准测试集、流行基准数据在LLM预训练数据中的污染问题,以及翻译型基准中本地文化细微差异的缺失。本研究探讨了多语言、多文化背景下的人类评估与基于LLM的评估。我们通过对10种印度语言进行9万次人类评估和3万次基于LLM的评估,测试了30个模型,发现如GPT-4o和Llama-3 70B等模型在大多数印度语言上持续表现最佳。我们构建了两种评估场景的排行榜——成对比较评估与直接评分评估,并分析了人类与LLM评估结果的一致性。研究发现,在成对比较评估中人类与LLM具有较好的一致性,但在直接评分评估中一致性显著下降,特别是在孟加拉语和奥里亚语等语言上。我们还检验了人类评估与基于LLM评估中的多种偏差,发现基于GPT的评估器存在自我偏好证据。本工作为扩展LLM的多语言评估迈出了重要一步。