Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.
翻译:当前的多模态大语言模型在多项高要求任务中展现出极强的性能。尽管商业MLLM在低资源语言上能提供可接受的性能,但在开放科学社区内尚未实现可比的结果。本文旨在为一种低资源语言——巴斯克语——开发一个强大的MLLM。为此,我们构建了自有的训练与评估用图文数据集。我们以两种不同的大语言模型为骨干网络,即Llama-3.1-Instruct模型和一个名为Latxa的巴斯克语适配变体,探索了多种训练数据混合方案。研究表明:i) 较低比例的巴斯克语多模态数据(约20%)已足以在巴斯克语基准测试中获得稳健结果;ii) 与预期相反,构建强大的巴斯克语MLLM并不需要以巴斯克语指令微调的骨干LLM。通过开源发布我们的资源,本研究为开发其他低资源语言的MLLM铺平了道路。