We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.
翻译:本文提出视觉作为LoRA(VoRA),一种将LLM转化为MLLM的创新范式。与当前依赖外部视觉模块进行视觉编码的主流MLLM架构不同,VoRA通过将视觉专用的LoRA层直接集成到LLM内部来实现视觉能力的内化。这种设计使得新增参数在推理阶段能够无缝融入LLM,既消除了结构复杂性,又最大限度地降低了计算开销。此外,得益于LLM处理灵活上下文的能力,VoRA能够处理任意分辨率的输入数据。为增强VoRA的视觉能力,我们提出一种分块蒸馏方法,将预训练ViT中的视觉先验知识迁移至LoRA层,通过注入视觉知识有效加速训练过程。同时,我们采用双向注意力掩码机制以更好地捕捉图像的上下文信息。实验证明,在增加预训练数据后,VoRA的性能可与基于编码器的传统MLLM相媲美。所有训练数据、代码和模型权重均发布于https://github.com/Hon-Wong/VoRA。