Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
翻译:多模态大语言模型(MLLMs)近期在视觉-语言理解领域取得了显著成功,其视觉编码器展现出卓越的高层语义对齐能力。一个重要问题随之产生:这些编码器能否作为通用的视觉骨干网络,同样可靠地执行经典的以视觉为中心的任务?为回答这一问题,我们做出以下贡献:(i)我们发现MLLMs中的视觉编码器在其密集特征表示方面存在不足,这体现在其在密集预测任务(如语义分割、深度估计)上的次优性能;(ii)我们提出VersaViT,这是一个全面的视觉Transformer,它实例化了一个新颖的多任务协同后训练框架。该框架通过轻量级任务头与多粒度监督,促进视觉骨干网络的优化;(iii)在多种下游任务上的大量实验证明了我们方法的有效性,从而产生了一个适用于语言介导推理和像素级理解的通用视觉骨干网络。