We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.
翻译:我们提出LIT,这是视觉指令调优(VIT)的一项进展。虽然VIT为多模态大语言模型(MLLMs)赋予了有前景的多模态能力,但当前VIT的设计选择常导致过拟合和捷径学习,可能降低性能。这一差距源于对指令跟随能力的过度强调,而忽视了对视觉信息的主动理解。受此启发,LIT采用了一种简单而有效的方法,将损失函数同时纳入指令序列和响应序列。它无缝扩展了训练数据,并规范了MLLMs对语言先验的过度依赖。基于这一优势,LIT在综合多模态基准测试中实现了高达9%的相对性能提升,且无需额外训练数据,计算开销可忽略不计。令人惊讶的是,LIT获得了卓越的基础视觉能力,在图像描述任务中实现了高达18%的性能提升,同时减轻了MLLMs中的幻觉现象。