Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning or clarity of questions in the target language. A common practice in multilingual evaluation is to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to address these challenges. In this work, we trace the impact of both of these issues on multilingual evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open and proprietary models illustrates that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions. Rankings of model evaluations change depending on whether they are evaluated on the full portion or the subset of questions annotated as culturally sensitive, showing the distortion to model rankings when blindly relying on translated MMLU. We release Global-MMLU, an improved MMLU with evaluation coverage across 42 languages -- with improved overall quality by engaging with compensated professional and community annotators to verify translation quality while also rigorously evaluating cultural biases present in the original dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as culturally sensitive and culturally agnostic to allow for more holistic, complete evaluation.
翻译:多语言数据集中的文化偏见对其作为全球基准的有效性构成了重大挑战。这些偏见不仅源于语言,还源于解读问题所需的文化知识,从而降低了像MMLU这类翻译数据集的实际效用。此外,翻译过程常常引入人为痕迹,可能扭曲目标语言中问题的含义或清晰度。多语言评估中的一个常见做法是依赖机器翻译的评估集,但仅仅翻译数据集不足以应对这些挑战。在本研究中,我们追溯了这两个问题对多语言评估及后续模型性能的影响。我们对最先进的开放和专有模型进行的大规模评估表明,MMLU的进展在很大程度上依赖于学习以西方为中心的概念,其中28%的问题需要文化敏感知识。此外,对于需要地理知识的问题,惊人的84.9%集中在北美或欧洲地区。模型评估的排名会根据评估的是完整问题集还是标注为文化敏感问题的子集而发生变化,这揭示了盲目依赖翻译版MMLU对模型排名造成的扭曲。我们发布了Global-MMLU,这是一个改进的MMLU,涵盖42种语言的评估范围——通过聘请有偿专业和社区标注者验证翻译质量,同时严格评估原始数据集中存在的文化偏见,从而提升了整体质量。这一全面的Global-MMLU集还包含标注为文化敏感和文化无关的指定子集,以实现更全面、完整的评估。