Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (https://github.com/multimodal-art-projection/OmniBench).
翻译:近期多模态大语言模型(MLLMs)的发展聚焦于多模态整合,但其同时处理与推理不同输入模态的能力仍未得到充分探索。我们提出了OmniBench,这是一个旨在评估模型同时识别、解释与推理视觉、听觉及文本输入能力的新型基准。我们将具备此类三模态处理能力的语言模型定义为全模态语言模型(OLMs)。OmniBench包含需要跨所有模态综合理解的高质量人工标注数据。评估结果表明:i) 开源OLMs在三模态情境下的指令遵循与推理能力存在显著局限;ii) 即使使用图像/音频输入的文本替代方案,多数基线模型表现仍不佳(准确率约50%)。为应对这些局限,我们构建了包含96K样本的指令微调数据集OmniInstruct用于训练OLMs。我们主张开发更鲁棒的三模态整合技术与训练策略以提升OLM性能。代码与数据可在我们的代码库(https://github.com/multimodal-art-projection/OmniBench)中获取。