Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.
翻译:多模态大语言模型(MLLMs)已日益支持跨文本、视觉和语音的全模态处理。然而,针对此类模型的现有评估框架存在关键局限,包括模态捷径和有偏见的推理路径。为应对这些挑战,我们提出了OMHBench,这是一个旨在严格评估全模态多跳推理的新型基准。它包含6,144个问题,其推理路径经过平衡设计,并同时基于所有三种模态进行事实关联。对13个最先进模型的广泛评估表明:(1)专有模型与开源MLLMs之间存在巨大的性能差距;(2)即使是专有模型也对推理路径的变化表现出高度敏感性,导致全模态事实关联的不对称性。值得注意的是,模型在处理语音模态时表现尤为困难,这突显了对全模态智能进行平衡、多跳评估的必要性。