While large language models (LLMs) are still being adopted to new domains and utilized in novel applications, we are experiencing an influx of the new generation of foundation models, namely multi-modal large language models (MLLMs). These models integrate verbal and visual information, opening new possibilities to demonstrate more complex reasoning abilities at the intersection of the two modalities. However, despite the revolutionizing prospect of MLLMs, our understanding of their reasoning abilities is limited. In this study, we assess the nonverbal abstract reasoning abilities of open-source and closed-source MLLMs using variations of Raven's Progressive Matrices. Our experiments expose the difficulty of solving such problems while showcasing the immense gap between open-source and closed-source models. We also reveal critical shortcomings with individual visual and textual modules, subjecting the models to low-performance ceilings. Finally, to improve MLLMs' performance, we experiment with various methods, such as Chain-of-Thought prompting, resulting in a significant (up to 100%) boost in performance.
翻译:尽管大型语言模型(LLM)仍在被应用于新领域并用于创新应用,但我们正迎来新一代基础模型——即多模态大型语言模型(MLLM)——的涌入。这些模型整合了语言与视觉信息,为展现两种模态交叉领域更复杂的推理能力开辟了新可能。然而,尽管MLLM前景广阔,我们对其推理能力的理解仍然有限。本研究通过瑞文渐进矩阵的变体,评估了开源与闭源MLLM的非语言抽象推理能力。实验揭示了解决此类问题的难度,同时展示了开源与闭源模型之间的巨大差距。我们还揭示了单个视觉与文本模块的关键缺陷,导致模型性能上限较低。最后,为提升MLLM性能,我们实验了多种方法(如思维链提示),将性能显著提升了最高达100%。