From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Chaochao Lu,Chen Qian,Guodong Zheng,Hongxing Fan,Hongzhi Gao,Jie Zhang,Jing Shao,Jingyi Deng,Jinlan Fu,Kexin Huang,Kunchang Li,Lijun Li,Limin Wang,Lu Sheng,Meiqi Chen,Ming Zhang,Qibing Ren,Sirui Chen,Tao Gui,Wanli Ouyang,Yali Wang,Yan Teng,Yaru Wang,Yi Wang,Yinan He,Yingchun Wang,Yixu Wang,Yongting Zhang,Yu Qiao,Yujiong Shen,Yurong Mou,Yuxi Chen,Zaibin Zhang,Zhelun Shi,Zhenfei Yin,Zhipin Wang

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.

翻译：多模态大语言模型（MLLMs）在针对多模态内容生成合理响应方面展现出令人瞩目的能力。然而，即便已部署最强大的OpenAI GPT-4与Google Gemini，近期基于MLLM的应用性能与公众预期之间仍存在显著差距。本文旨在通过对专有与开源MLLMs在文本、代码、图像与视频四种模态下的泛化性、可信度及因果推理能力进行定性研究，以加深对这一差距的理解，最终提升MLLMs的透明度。我们认为这些特性是定义MLLMs在支撑各类下游应用时可靠性的若干代表性因素。具体而言，我们评估了闭源的GPT-4与Gemini，以及6个开源LLM与MLLM。共计评估230个手动设计案例，并将定性结果归纳为12个评分（即4种模态×3种属性）。总计揭示14项实证发现，这些发现有助于理解专有与开源MLLMs的能力与局限，从而推动更可靠的多模态下游应用。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日