OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li,Ge Zhang,Yinghao Ma,Ruibin Yuan,Kang Zhu,Hangyu Guo,Yiming Liang,Jiaheng Liu,Zekun Wang,Jian Yang,Siwei Wu,Xingwei Qu,Jinjie Shi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Zhaoxiang Zhang,Zachary Liu,Emmanouil Benetos,Wenhao Huang,Chenghua Lin

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (https://github.com/multimodal-art-projection/OmniBench).

翻译：近期多模态大语言模型（MLLMs）的发展聚焦于多模态整合，但其同时处理与推理不同输入模态的能力仍未得到充分探索。我们提出了OmniBench，这是一个旨在评估模型同时识别、解释与推理视觉、听觉及文本输入能力的新型基准。我们将具备此类三模态处理能力的语言模型定义为全模态语言模型（OLMs）。OmniBench包含需要跨所有模态综合理解的高质量人工标注数据。评估结果表明：i) 开源OLMs在三模态情境下的指令遵循与推理能力存在显著局限；ii) 即使使用图像/音频输入的文本替代方案，多数基线模型表现仍不佳（准确率约50%）。为应对这些局限，我们构建了包含96K样本的指令微调数据集OmniInstruct用于训练OLMs。我们主张开发更鲁棒的三模态整合技术与训练策略以提升OLM性能。代码与数据可在我们的代码库（https://github.com/multimodal-art-projection/OmniBench）中获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日