Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.
翻译:继GPT4成功之后,多模态大语言模型(MLLM)的研究激增。这一研究方向侧重于通过微调预训练的大语言模型和视觉模型来开发通用型大语言模型。然而,灾难性遗忘——微调模型相较于预训练模型无法保持类似性能的著名现象——仍然是多模态大语言模型(MLLM)的固有问题。本文提出了EMT:通过将每个MLLM视为图像分类器来评估MLLM中灾难性遗忘的方法。我们首先应用EMT评估了多个开源微调MLLM,发现几乎所有被评估的MLLM在标准图像分类任务上都无法保持与其视觉编码器相同的性能水平。此外,我们继续微调MLLM模型LLaVA,并在微调过程中利用EMT评估其性能。有趣的是,我们的结果表明,在图像数据集上进行早期微调能够通过增强文本与视觉特征的对齐来提升其他图像数据集上的性能。然而,随着微调继续进行,MLLM开始产生幻觉,导致泛化能力显著下降,即使图像编码器保持冻结状态。我们的结果表明,MLLM在标准图像分类任务上尚未展现出与其视觉模型相当的性能,并且当前的MLLM微调流程仍有改进空间。