Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.
翻译:统一多模态模型(UMMs)近期作为一种将多模态理解与生成集成于单一自回归Transformer中的新兴范式崭露头角。然而,在多模态指令微调过程中,这些模型常表现出显著的不平衡现象:语言梯度主导优化过程,导致图像生成质量下降,尤其是在采用LoRA等参数高效微调方法时。在本工作中,我们系统分析了基于LoRA的UMMs在交错文本-图像生成任务微调中的模态不平衡问题。研究表明,相较于单模态对应模型,视觉模态的性能退化幅度远大于文本模态,且不同任务与层间的模态特定梯度可能相差多个数量级。基于此发现,我们将多模态指令微调重新表述为双目标优化问题,并提出Pareto LoRA——一种通过调节梯度方向与强度来平衡文本与图像目标的帕累托最优梯度整合策略。在基于Emu2的CoMM基准实验表明,Pareto LoRA能持续改善多模态生成平衡性,在保持可比文本性能的同时,将感知图像质量最高提升44.9%(相较于原始LoRA方法)。