We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu-bench.github.io/
翻译:我们介绍了VMMU,一个旨在评估视觉-语言模型在英语之外如何解释和推理视觉与文本信息的越南语多任务多模态理解与推理基准。VMMU包含跨越7个任务的2.5k个多模态问题,涵盖了多样化的问题情境,包括STEM问题求解、数据解释、规则驱动的视觉推理以及抽象视觉推理。所有问题都需要真正的多模态整合,而非依赖纯文本线索或基于OCR的捷径。我们在VMMU上评估了一系列多样化的最先进专有和开源视觉-语言模型。尽管越南语OCR性能强劲,专有模型的平均准确率仅为66%。进一步分析表明,失败的主要根源并非OCR,而是对文本和视觉证据的多模态基础与推理。代码和数据可在 https://vmmu-bench.github.io/ 获取。