Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.
翻译:继GPT4成功之后,多模态大语言模型(MLLM)研究领域掀起了一股热潮。该研究方向致力于通过微调预训练的大语言模型和视觉模型来开发通用型大语言模型。然而,灾难性遗忘——即微调后的模型无法保持与预训练模型相当性能的著名现象——仍然是多模态大语言模型(MLLM)的内在问题。本文提出EMT:评估多模态遗忘方法,通过将每个MLLM视为图像分类器来评估其灾难性遗忘。我们首先应用EMT评估多个开源微调MLLM,发现几乎所有被评估的MLLM在标准图像分类任务上都无法保持与其视觉编码器相同的性能水平。此外,我们持续对MLLM模型LLaVA进行微调,并利用EMT评估整个微调过程中的性能表现。有趣的是,结果表明:在图像数据集上进行早期微调可通过增强文本与视觉特征的对齐来提升其他图像数据集上的性能。然而随着微调的进行,MLLM开始产生幻觉,导致泛化能力显著下降——即使图像编码器保持冻结状态。我们的研究表明,MLLM在标准图像分类任务上尚未展现出与其视觉模型相当的性能,当前MLLM微调流程仍存在改进空间。