Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.
翻译:大多数视觉语言模型(VLM)基于以英语为中心的数据进行训练,这限制了其在其他语言和文化背景下的性能。这种局限性不仅影响了非英语用户的使用体验,也阻碍了反映多样化语言文化现实的多模态系统的发展。本研究通过复现并改进LLaVA-Next方法,构建了一系列波兰语视觉语言模型。我们采用全自动流程对现有多模态数据集进行翻译与筛选,并辅以针对OCR及文化特定任务生成的合成波兰语数据。尽管训练数据几乎完全依赖自动翻译且人工干预极少,我们的方法仍取得了显著成果:在波兰语适配版MMBench测评中,模型性能较LLaVA-1.6-Vicuna-13B提升9.5%;在生成式评估中,人工标注者从语言正确性维度判定其图像描述质量更高。这些发现表明,大规模自动翻译结合轻量级筛选机制,能够有效为低资源语言构建高质量多模态模型。当前方法在文化覆盖度与评估体系方面仍存在挑战。为促进后续研究,我们公开了所有模型及评估数据集。