This paper introduces Qomhr\'a, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhr\'a is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhr\'a also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.
翻译:本文介绍了Qomhrá,一个在低资源约束下开发的双语爱尔兰语-英语大语言模型,提出了一个涵盖双语持续预训练、指令微调以及基于人类偏好的对齐的完整流程。通过混合并筛选新近可用的爱尔兰语语料库和英语文本,在保持英语能力的同时提升了爱尔兰语性能。由一名母语者、一名学习者及其他大语言模型对6个闭源权重的大语言模型的爱尔兰语文本生成能力进行了评估。谷歌的Gemini-2.5-Pro排名最高,随后被用于合成指令微调和人类偏好数据集。利用Gemini-2.5-Pro贡献了两个数据集:一个包含3万条爱尔兰语-英语平行指令的微调数据集,以及一个包含1千条人类偏好数据集,该数据集生成的接受与拒绝响应与爱尔兰语母语者的偏好近乎完美对齐。Qomhrá在翻译、性别理解、主题识别和世界知识等基准测试中进行了全面评估,在爱尔兰语任务上最高提升29%,在英语任务上最高提升44%。Qomhrá还经过了指令微调,并在遵循指令方面展现出显著进步,这对于聊天机器人功能至关重要。