Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.
翻译:多模态大语言模型(MLLMs)在针对多模态内容生成合理响应方面展现出令人瞩目的能力。然而,即便已部署最强大的OpenAI GPT-4与Google Gemini,近期基于MLLM的应用性能与公众预期之间仍存在显著差距。本文旨在通过对专有与开源MLLMs在文本、代码、图像与视频四种模态下的泛化性、可信度及因果推理能力进行定性研究,以加深对这一差距的理解,最终提升MLLMs的透明度。我们认为这些特性是定义MLLMs在支撑各类下游应用时可靠性的若干代表性因素。具体而言,我们评估了闭源的GPT-4与Gemini,以及6个开源LLM与MLLM。共计评估230个手动设计案例,并将定性结果归纳为12个评分(即4种模态×3种属性)。总计揭示14项实证发现,这些发现有助于理解专有与开源MLLMs的能力与局限,从而推动更可靠的多模态下游应用。