Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
翻译:大视觉语言模型(LVLMs)在奥赛级别的推理任务中取得了显著进展。然而,目前针对这些模型的奥赛级多模态推理基准通常侧重于单图像分析,未能利用多图像间的上下文信息。我们提出OMIBench,这是一个旨在评估所需证据分散于多幅图像时的奥赛级推理能力的基准。该基准包含来自生物、化学、数学和物理奥赛的问题,并附带手动标注的推理依据以及针对精确匹配和语义匹配的评估协议。通过在OMIBench上进行的大量实验,我们发现现有模型存在显著的性能差距。即使是最强大的LVLMs(如Gemini-3-Pro)在该基准上的准确率也仅约为50%。这些结果使OMIBench成为研究和提升LVLMs多图像推理能力的专项资源。