Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.
翻译:视觉语言模型(VLM)的快速发展已经改变了多模态理解领域,但其通常受限于无论输入语言如何都生成英语回应的现象。这一现象被称为图像诱导保真度损失(IFL),其根源在于多模态多语言训练数据的匮乏。为解决此问题,我们提出了一种持续多语言集成策略,该策略在视觉指令微调过程中注入纯文本多语言数据,从而保留语言模型原有的多语言能力。大量评估表明,我们的方法显著提升了跨语言的文本保真度,且未造成视觉性能的下降。我们还探索了模型融合方法,该方法虽能提升语言保真度,但以牺牲视觉性能为代价。相比之下,我们的核心方法实现了稳健的多语言对齐且无需权衡取舍,为缓解IFL以促进VLM在全球范围的采用提供了一条可扩展且有效的路径。