Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation. https://github.com/hitz-zentroa/latxa-instruct
翻译:通过用户意图对语言模型进行指令调优需要大规模的指令数据集,而这仅适用于有限的语言集合。本文探讨了在低资源场景下,对传统指令适应流程的替代方案。我们假设了一个适用于低资源语言的现实场景,其中仅可获得以下资源:目标语言的语料库、现有的开源多语言基础大语言模型及指令调优骨干模型、以及从指令调优骨干模型中采样的合成生成指令。我们针对巴斯克语进行了一系列全面的实验,系统研究了这些组件的不同组合,并在基准测试和来自1,680名参与者的人类偏好评估中进行了验证。我们的结论表明,目标语言语料库至关重要,合成指令能够产生稳健的模型,并且最重要的是,使用指令调优模型作为骨干,其性能优于使用未经指令调优的基础模型。通过扩展至Llama 3.1 Instruct 70B作为骨干模型,我们的模型在巴斯克语任务上接近了规模大得多的前沿模型的性能,且未使用任何巴斯克语指令。我们发布了代码、模型、指令数据集和人类偏好数据,以支持未来低资源语言适应研究的完全可复现性。https://github.com/hitz-zentroa/latxa-instruct