While Multimodal Large Language Models (MLLMs) are widely used for a variety of vision-language tasks, one observation is that they sometimes misinterpret visual inputs or fail to follow textual instructions even in straightforward cases, leading to irrelevant responses, mistakes, and ungrounded claims. This observation is analogous to a phenomenon in neuropsychology known as Agnosia, an inability to correctly process sensory modalities and recognize things (e.g., objects, colors, relations). In our study, we adapt this similar concept to define "agnosia in MLLMs", and our goal is to comprehensively evaluate and mitigate such agnosia in MLLMs. Inspired by the diagnosis and treatment process in neuropsychology, we propose a novel framework EMMA (Evaluation and Mitigation of Multimodal Agnosia). In EMMA, we develop an evaluation module that automatically creates fine-grained and diverse visual question answering examples to assess the extent of agnosia in MLLMs comprehensively. We also develop a mitigation module to reduce agnosia in MLLMs through multimodal instruction tuning on fine-grained conversations. To verify the effectiveness of our framework, we evaluate and analyze agnosia in seven state-of-the-art MLLMs using 9K test samples. The results reveal that most of them exhibit agnosia across various aspects and degrees. We further develop a fine-grained instruction set and tune MLLMs to mitigate agnosia, which led to notable improvement in accuracy.
翻译:尽管多模态大语言模型(MLLMs)被广泛用于各类视觉-语言任务,但研究发现,即使面对简单案例,它们有时也会误解视觉输入或无法遵循文本指令,从而产生不相关回答、错误及无根据论断。这一现象与神经心理学中的“失认症”——一种无法正确处理感官信息并识别物体、颜色、关系等认知障碍——具有相似性。本研究借鉴这一概念定义“MLLMs失认症”,旨在全面评估并缓解该问题。受神经心理学诊断与治疗流程启发,我们提出新型框架EMMA(多模态失认症评估与缓解)。该框架包含评估模块:通过自动生成细粒度、多样化的视觉问答示例,系统评估MLLMs失认症程度;同时包含缓解模块:通过对细粒度对话进行多模态指令微调,降低模型失认症表现。为验证框架有效性,我们使用9000个测试样本对七个前沿MLLMs进行失认症评估分析,结果显示多数模型在不同维度呈现不同程度的失认症。进一步地,我们构建了细粒度指令集并对MLLMs进行微调以缓解失认症,显著提升了模型准确率。