Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals. Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11-16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages, whereas accuracy-based baselines concentrate detections in a few languages, and recovering culture-specific items that accuracy-based baselines miss.
翻译:多语言基准测试是评估大语言模型在不同语言上的核心工具,但存在三个问题:穷尽式评估的规模随语言数量线性增长、自动翻译引入的错误在大规模测试中难以察觉、部分题目混淆了通用知识与文化特定知识。我们提出统一统计框架Multilingual-IRT,通过引入每语言难度偏差、分离内容与语言效应的区分度分裂参数、以及每语言能力残差,扩展了项目反应理论。在MMLU-Pro-X的29种语言上对25个大语言模型拟合Multilingual-IRT后,我们证明其拟合参数支持三种实际应用:预测未观测(题目、大语言模型、语言)实例时,其二元交叉熵比最强的基于准确率的基准低11-16%;浮出分布于全部28种非英语语言的候选翻译错误,而基于准确率的基准将检测集中在少数语言上;以及恢复基于准确率的基准遗漏的文化特定题目。