Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese VQA. We further demonstrate the effectiveness of our refinement by showing that the resulting benchmarks yield evaluation scores that better reflect model capability, exhibit lower run-to-run variance, and improve the ability to distinguish between models of different capability levels. We release our dataset and code to advance reliable evaluation of VLMs.
翻译:可靠的评估对于视觉语言模型(VLM)的发展至关重要。然而,日本语VQA基准数据集所经历的迭代精炼远少于其英语对应物。因此,许多现有基准存在诸如问题表述模糊、答案错误以及无需视觉依据即可解答的实例等问题,这损害了评估的可靠性,并在模型比较中导致误导性结论。为应对这些局限,我们提出了JAMMEval,一个面向可靠VLM评估的精炼日本语基准集合。该集合通过两轮人工标注对七个现有日本语基准数据集进行系统性精炼构建而成,提升了数据质量与评估可靠性。在我们的实验中,我们在JAMMEval上评估了开放权重和专有VLM,并分析了近期模型在日本语VQA上的能力。我们进一步证明了精炼的有效性:由此产生的基准能够生成更准确反映模型能力的评估分数,表现出更低的运行间方差,并提升区分不同能力水平模型的能力。我们公开了本数据集与代码,以推动VLM的可靠评估。