OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li,Yinghao Ma,Ge Zhang,Ruibin Yuan,Kang Zhu,Hangyu Guo,Yiming Liang,Jiaheng Liu,Zekun Wang,Jian Yang,Siwei Wu,Xingwei Qu,Jinjie Shi,Xinyue Zhang,Zhenzhu Yang,Yidan Wen,Yanghai Wang,Shihao Li,Zhaoxiang Zhang,Zachary Liu,Emmanouil Benetos,Wenhao Huang,Chenghua Lin

from arxiv, Accepted by The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS2025)

Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains underexplored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as the omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) most baselines models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images or/and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to tri-modal contexts. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLMs. Codes, data and live leaderboard could be found at https://m-a-p.ai/OmniBench.

翻译：近年来，多模态大语言模型（MLLMs）的发展致力于整合和解析跨多种模态的数据。然而，这些模型同时处理并推理多种模态的能力仍未得到充分探索，部分原因在于缺乏全面的、按模态划分的基准测试。我们提出了OmniBench，这是一个新颖的基准测试，旨在严格评估模型同时识别、解释和推理视觉、听觉和文本输入的能力。我们将具备这种三模态处理能力的语言模型定义为全模态语言模型（OLMs）。OmniBench的显著特点在于其高质量的人工标注，确保准确的回答需要对所有三种模态进行整合的理解与推理。我们的主要发现表明：i) 开源OLMs在三模态上下文中的指令遵循和推理能力存在关键局限；ii) 即使为基线模型提供图像和/或音频的替代文本表示，大多数模型的表现仍然不佳（准确率低于50%）。这些结果表明，从文本、图像和音频构建一致上下文的能力在现有的MLLM训练范式中常常被忽视。为弥补这一差距，我们构建了一个包含84.5K训练样本的指令微调数据集OmniInstruct，用于训练OLMs以适应三模态上下文。我们主张未来的研究应聚焦于开发更鲁棒的三模态集成技术和训练策略，以增强OLMs。代码、数据和实时排行榜可在 https://m-a-p.ai/OmniBench 获取。