The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

Multi-modal Large Language Models (MLLMs) have exhibited impressive capability. However, recently many deficiencies of MLLMs have been found compared to human intelligence, $\textit{e.g.}$, hallucination. To drive the MLLMs study, the community dedicated efforts to building larger benchmarks with complex tasks. In this paper, we propose benchmarking an essential but usually overlooked intelligence: $\textbf{association}$, a human's basic capability to link observation and prior practice memory. To comprehensively investigate MLLM's performance on the association, we formulate the association task and devise a standard benchmark based on adjective and verb semantic concepts. Instead of costly data annotation and curation, we propose a convenient $\textbf{annotation-free}$ construction method transforming the general dataset for our association tasks. Simultaneously, we devise a rigorous data refinement process to eliminate confusion in the raw dataset. Building on this database, we establish three levels of association tasks: single-step, synchronous, and asynchronous associations. Moreover, we conduct a comprehensive investigation into the MLLMs' zero-shot association capabilities, addressing multiple dimensions, including three distinct memory strategies, both open-source and closed-source MLLMs, cutting-edge Mixture-of-Experts (MoE) models, and the involvement of human experts. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans. We believe our benchmark would pave the way for future MLLM studies. $\textit{Our data and code are available at:}$ https://mvig-rhos.com/llm_inception.

翻译：多模态大语言模型（MLLMs）已展现出令人瞩目的能力。然而，近期研究发现MLLMs相较于人类智能仍存在诸多缺陷，例如幻觉问题。为推进MLLMs研究，学界致力于构建包含复杂任务的更大规模基准测试。本文提出对一项关键但常被忽视的智能能力进行基准评估：联想能力——人类连接观察与先验实践记忆的基础认知功能。为全面探究MLLMs在联想任务上的表现，我们系统构建了基于形容词与动词语义概念的联想任务框架与标准化基准。区别于昂贵的数据标注与人工筛选流程，我们提出一种便捷的无标注构建方法，将通用数据集转化为适用于联想任务的专用数据。同时，我们设计了严谨的数据精炼流程以消除原始数据集中的歧义。基于此数据库，我们建立了三个层级的联想任务：单步联想、同步联想与异步联想。此外，我们从多维度对MLLMs的零样本联想能力展开系统性研究，涵盖三种不同的记忆策略、开源与闭源MLLMs、前沿的专家混合模型以及人类专家参与评估。我们的系统性研究表明：当前开源MLLMs在联想任务中持续表现不佳，即使是目前最先进的GPT-4V模型与人类相比仍存在显著差距。我们相信本基准测试将为未来MLLM研究开辟新路径。数据与代码公开于：https://mvig-rhos.com/llm_inception。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日