While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO). The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset, translated the VQA benchmark into Ukrainian, and measured performance degradation relative to original English versions. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.
翻译:尽管针对以英语为中心的多模态模型的评估已成为拥有众多基准测试的活跃研究领域,但针对中低资源语言的基准测试或评估套件却严重缺乏。我们推出了ZNO-Vision,这是一个源自标准化大学入学考试(ZNO)的综合性以乌克兰语为中心的多模态基准测试。该基准包含超过4,300道专家精心设计的问题,涵盖数学、物理、化学和人文学科等12个学术领域。我们评估了开源模型和API提供商的性能,发现仅有少数模型的表现超过基线。除了新的基准测试外,我们还首次对乌克兰语的多模态文本生成进行了评估研究:我们在Multi30K-UK数据集上测量了标题生成质量,将VQA基准测试翻译成乌克兰语,并测量了相对于原始英文版本的性能下降。最后,我们从文化视角测试了少数模型对国家菜肴知识的掌握情况。我们相信,我们的工作将推动乌克兰语多模态生成能力的发展,并且我们的方法可能对其他低资源语言有所裨益。