Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT
翻译:多模态大语言模型(MLLMs)在众多视觉-语言任务上表现出色,但在需要细粒度视觉推理的视觉中心问题上仍面临挑战。近期证据表明,这一局限并非源于弱视觉表征,而是由于指令调优过程中视觉信息利用不足——许多任务可仅凭语言先验部分解决。我们提出一种简单轻量的方法,通过将少量基于视觉的自监督任务以自然语言指令形式表达,来增强视觉指令调优。通过将经典自监督预文本任务(如旋转预测、颜色匹配、跨视角对应)重构为图像-指令-响应三元组,我们引入了无法脱离视觉证据解决的监督信号。该方法无需人工标注、无需架构修改、无需额外训练阶段。在多种模型、训练范式和基准测试中,仅注入少量(3-10%)此类视觉引导指令,即可持续提升视觉中心评估的性能。我们的发现凸显了将视觉导向自监督学习任务融入指令调优的强大作用——仅通过调整训练数据分布,即可显著改善MLLMs的视觉推理能力。代码开源地址:https://github.com/sirkosophia/V-GIFT