From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Chaochao Lu,Chen Qian,Guodong Zheng,Hongxing Fan,Hongzhi Gao,Jie Zhang,Jing Shao,Jingyi Deng,Jinlan Fu,Kexin Huang,Kunchang Li,Lijun Li,Limin Wang,Lu Sheng,Meiqi Chen,Ming Zhang,Qibing Ren,Sirui Chen,Tao Gui,Wanli Ouyang,Yali Wang,Yan Teng,Yaru Wang,Yi Wang,Yinan He,Yingchun Wang,Yixu Wang,Yongting Zhang,Yu Qiao,Yujiong Shen,Yurong Mou,Yuxi Chen,Zaibin Zhang,Zhelun Shi,Zhenfei Yin,Zhipin Wang

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.

翻译：多模态大语言模型在处理多模态内容时展现出生成合理响应的出色能力。然而，尽管OpenAI最强大的GPT-4和Google的Gemini已部署应用，当前基于多模态大语言模型的应用性能与公众预期之间仍存在显著差距。本文致力于通过定性研究，从泛化性、可信度和因果推理能力三个维度，对近期闭源和开源多模态大语言模型在文本、代码、图像和视频四种模态下的表现进行深入剖析，以增强对这一差距的理解，最终提升多模态大语言模型的透明度。我们认为这些属性是定义多模态大语言模型在支撑各类下游应用时可靠性的若干代表性因素。具体而言，我们评估了闭源的GPT-4和Gemini以及6个开源大语言模型和多模态大语言模型，总共分析了230个手动设计的案例，并将定性结果归纳为12个评分（即4种模态×3种属性）。总计揭示出14项实证发现，这些发现有助于理解闭源与开源多模态大语言模型的能力与局限，以推动更可靠的多模态下游应用发展。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日