We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.
翻译:我们提出了KMMMU,一个面向韩语文化及制度环境的原生韩语多模态理解评估基准。KMMMU包含来自韩语原生考试的3466道题目,涵盖九个学科与九种视觉模态类别,并设有300题的韩语特定子集及627题的困难子集。与翻译类或英语中心化基准不同,KMMMU聚焦于由本地惯例、官方标准及学科特定视觉格式所塑造的信息密集型问题。实验表明,最强开源模型在完整集上仅达42.05%准确率,而最优专有模型在困难子集上达到52.42%。各学科性能存在差异,部分学科构成瓶颈,韩语特定题目中性能差距最高达13.43%。错误分析揭示:这些失败更多源于弱惯例-标签映射、少样本符号归纳、局部化知识回忆及领域特定标准理解不足,而非推理深度不足。KMMMU为超越英语中心化基准的多模态评估,以及开发更可靠的专家级真实世界任务系统提供了试验平台。