We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
翻译:我们提出jina-vlm,一种令牌高效的2.4B参数视觉语言模型,在开放2B规模视觉语言模型中达到了最先进的多语言VQA性能。该模型将SigLIP2视觉编码器与Qwen3语言解码器相结合,并利用图像分块和注意力池化技术实现对任意分辨率图像的令牌高效处理。为理解不同训练数据类别的贡献,我们开展了留一法数据混合消融研究——系统性地移除任务、领域、模态和语言类别——以诊断哪些数据类型是必要还是冗余的,以及任务效益是否跨领域转移。模型权重和代码已在https://huggingface.co/jinaai/jina-vlm公开提供。