Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, $\textit{e.g.}$, hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: $\textbf{association}$, a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient $\textbf{annotation-free}$ construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. $\textit{Our data and code are available at:}$ https://mvig-rhos.com/llm_inception.
翻译:多模态大语言模型(MLLMs)已展现出令人瞩目的能力。然而,近期研究发现MLLMs相较于人类智能仍存在诸多缺陷,例如幻觉问题。为推进MLLMs研究,学界致力于构建包含复杂任务的更大规模基准测试。本文提出对一项关键但常被忽视的智能能力进行基准评估:联想能力——人类连接观察与先验实践记忆的基础认知功能。为全面探究MLLMs在联想任务上的表现,我们系统构建了基于形容词与动词语义概念的联想任务框架与标准化基准。区别于昂贵的数据标注与人工筛选流程,我们提出一种便捷的无标注构建方法,将通用数据集转化为适用于联想任务的专用数据。同时,我们设计了严谨的数据精炼流程以消除原始数据集中的歧义。基于此数据库,我们建立了三个层级的联想任务:单步联想、同步联想与异步联想。此外,我们从多维度对MLLMs的零样本联想能力展开系统性研究,涵盖三种不同的记忆策略、开源与闭源MLLMs、前沿的专家混合模型以及人类专家参与评估。我们的系统性研究表明:当前开源MLLMs在联想任务中持续表现不佳,即使是目前最先进的GPT-4V模型与人类相比仍存在显著差距。我们相信本基准测试将为未来MLLM研究开辟新路径。数据与代码公开于:https://mvig-rhos.com/llm_inception。