KorMedMCQA-V：一个用于评估视觉语言模型在韩国医师资格考试表现的多模态基准 (KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination)

We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: https://huggingface.co/datasets/seongsubae/KorMedMCQA-V.

翻译：我们提出了KorMedMCQA-V，这是一个用于评估视觉语言模型（VLMs）的韩语医师资格考试风格多模态多项选择题问答基准。该数据集包含来自韩国医师资格考试（2012-2023年）的1,534个问题及2,043张相关图像，其中约30%的问题包含多张图像，需要跨图像证据整合。图像涵盖X光、计算机断层扫描（CT）、心电图（ECG）、超声、内窥镜及其他医学影像等多种临床模态。我们在统一的零样本评估协议下，对超过50个专有及开源类别的VLMs进行了基准测试，涵盖通用型、医学专用型及韩语专用型模型系列。表现最佳的专有模型（Gemini-3.0-Pro）准确率达到96.9%，最佳开源模型（Qwen3-VL-32B-Thinking）为83.7%，而最佳韩语专用模型（VARCO-VISION-2.0-14B）仅为43.2%。我们进一步发现，面向推理的模型变体相比指令微调版本可获得高达+20个百分点的性能提升；医学领域专业化相较于强大的通用基线模型带来的增益并不一致；所有模型在处理多图像问题时性能均有所下降；且不同成像模态间的性能表现存在显著差异。作为对纯文本基准KorMedMCQA的补充，KorMedMCQA-V构成了一个涵盖纯文本与多模态条件的统一韩语医学推理评估套件。该数据集可通过Hugging Face Datasets获取：https://huggingface.co/datasets/seongsubae/KorMedMCQA-V。