Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.
翻译:多模态大语言模型在处理多模态内容时展现出生成合理响应的出色能力。然而,尽管OpenAI最强大的GPT-4和Google的Gemini已部署应用,当前基于多模态大语言模型的应用性能与公众预期之间仍存在显著差距。本文致力于通过定性研究,从泛化性、可信度和因果推理能力三个维度,对近期闭源和开源多模态大语言模型在文本、代码、图像和视频四种模态下的表现进行深入剖析,以增强对这一差距的理解,最终提升多模态大语言模型的透明度。我们认为这些属性是定义多模态大语言模型在支撑各类下游应用时可靠性的若干代表性因素。具体而言,我们评估了闭源的GPT-4和Gemini以及6个开源大语言模型和多模态大语言模型,总共分析了230个手动设计的案例,并将定性结果归纳为12个评分(即4种模态×3种属性)。总计揭示出14项实证发现,这些发现有助于理解闭源与开源多模态大语言模型的能力与局限,以推动更可靠的多模态下游应用发展。