Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.
翻译:多模态大语言模型(MLLMs)近期在连接视觉与语言方面展现出卓越能力,但其在基础视觉推理任务中的熟练度仍显不足。这一局限可归因于MLLMs主要从文本描述中学习视觉理解,而文本描述构成了一种主观且本质上不完整的监督信号。此外,与海量纯文本预训练相比,多模态指令微调的规模有限,导致MLLMs过度拟合语言先验而忽视视觉细节。为解决这些问题,我们提出JARVIS——一个受JEPA启发的自监督视觉增强框架,用于提升MLLMs性能。具体而言,我们将I-JEPA学习范式整合到MLLMs训练的标准视觉-语言对齐流程中。该方法利用冻结的视觉基础模型作为上下文与目标编码器,同时训练由大语言模型早期层实现的预测器,使其能够不依赖语言监督而学习图像的结构与语义规律。在标准MLLM基准测试上的大量实验表明,JARVIS在不同大语言模型家族中持续提升以视觉为中心任务的性能,且未损害多模态推理能力。我们的源代码已公开于:https://github.com/aimagelab/JARVIS。