High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.
翻译:高分辨率大型多模态模型面临视觉令牌数量过多与视觉计算复杂度呈二次方增长的双重挑战。现有高分辨率LMM虽解决了二次复杂度问题,但仍会产生冗余的视觉令牌。然而,视觉令牌的冗余才是导致计算量激增的核心问题。为缓解此问题,我们提出ConvLLaVA模型,采用分层主干网络ConvNeXt作为LMM的视觉编码器以替代Vision Transformer。ConvLLaVA将高分辨率图像压缩为信息密集的视觉特征,有效避免了过量视觉令牌的生成。为增强ConvLLaVA的性能,我们提出两项关键优化:针对低分辨率预训练的ConvNeXt直接应用于高分辨率时性能不足的问题,我们通过参数更新弥合其性能差距;同时针对ConvNeXt原有压缩比无法适应更高分辨率输入的情况,我们训练了连续压缩阶段以进一步缩减视觉令牌,从而降低冗余。这些优化使ConvLLaVA能够支持1536×1536分辨率的输入仅生成576个视觉令牌,并可处理任意宽高比的图像。实验结果表明,我们的方法在主流基准测试中达到了与最先进模型相当的性能。ConvLLaVA模型系列已开源发布于https://github.com/alibaba/conv-llava。