Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear whether this adaptation compromises their original language reasoning capabilities. In this work, we explore the effects of multimodal instruction tuning on language reasoning performance. We focus on LLaVA, a leading multimodal framework that integrates LLMs such as Vicuna or Mistral with the CLIP vision encoder. We compare the performance of the original LLMs with their multimodal-adapted counterparts across eight language reasoning tasks. Our experiments yield several key insights. First, the impact of multimodal learning varies between Vicuna and Mistral: we observe a degradation in language reasoning for Mistral but improvements for Vicuna across most tasks. Second, while multimodal instruction learning consistently degrades performance on mathematical reasoning tasks (e.g., GSM8K), it enhances performance on commonsense reasoning tasks (e.g., CommonsenseQA). Finally, we demonstrate that a training-free model merging technique can effectively mitigate the language reasoning degradation observed in multimodal-adapted Mistral and even improve performance on visual tasks.
翻译:多模态模型通常将强大的大语言模型(LLM)与视觉编码器相结合,并通过指令调优在多模态数据上进行训练。虽然这一过程使LLM适应多模态场景,但尚不清楚这种适应是否会损害其原有的语言推理能力。本研究探讨了多模态指令调优对语言推理性能的影响。我们以领先的多模态框架LLaVA为重点,该框架将Vicuna或Mistral等LLM与CLIP视觉编码器集成。我们在八项语言推理任务中比较了原始LLM与其多模态适应版本的性能。实验得出若干关键发现:首先,多模态学习对Vicuna和Mistral的影响存在差异——在多数任务中,Mistral的语言推理能力出现下降,而Vicuna则有所提升;其次,尽管多模态指令学习持续降低数学推理任务(如GSM8K)的性能,却能提升常识推理任务(如CommonsenseQA)的表现;最后,我们证明无需训练的模型融合技术能有效缓解多模态适应版Mistral中观察到的语言推理能力下降问题,甚至可提升视觉任务性能。