Qomhra：一个爱尔兰语与英语双语大语言模型 (Qomhra: A Bilingual Irish and English Large Language Model)

Large language model (LLM) research and development has overwhelmingly focused on the world's major languages, leading to under-representation of low-resource languages such as Irish. This paper introduces \textbf{Qomhrá}, a bilingual Irish and English LLM, developed under extremely low-resource constraints. A complete pipeline is outlined spanning bilingual continued pre-training, instruction tuning, and the synthesis of human preference data for future alignment training. We focus on the lack of scalable methods to create human preference data by proposing a novel method to synthesise such data by prompting an LLM to generate ``accepted'' and ``rejected'' responses, which we validate as aligning with L1 Irish speakers. To select an LLM for synthesis, we evaluate the top closed-weight LLMs for Irish language generation performance. Gemini-2.5-Pro is ranked highest by L1 and L2 Irish-speakers, diverging from LLM-as-a-judge ratings, indicating a misalignment between current LLMs and the Irish-language community. Subsequently, we leverage Gemini-2.5-Pro to translate a large scale English-language instruction tuning dataset to Irish and to synthesise a first-of-its-kind Irish-language human preference dataset. We comprehensively evaluate Qomhrá across several benchmarks, testing translation, gender understanding, topic identification, and world knowledge; these evaluations show gains of up to 29\% in Irish and 44\% in English compared to the existing open-source Irish LLM baseline, UCCIX. The results of our framework provide insight and guidance to developing LLMs for both Irish and other low-resource languages.

翻译：大语言模型（LLM）的研究与开发主要集中在世界主要语言上，导致如爱尔兰语等低资源语言代表性不足。本文介绍了 **Qomhrá**，一个在极低资源约束下开发的爱尔兰语与英语双语LLM。我们概述了一个完整的流程，涵盖双语持续预训练、指令微调以及为未来对齐训练合成人类偏好数据。针对缺乏可扩展方法创建人类偏好数据的问题，我们提出了一种新颖的合成方法：通过提示一个LLM生成“接受”和“拒绝”的响应，我们验证了该方法与爱尔兰语母语者的偏好一致。为选择用于合成的LLM，我们评估了顶尖闭源权重LLM在爱尔兰语生成任务上的性能。根据爱尔兰语母语者与第二语言使用者的评价，Gemini-2.5-Pro 排名最高，这与LLM-as-a-judge的评分存在差异，表明当前LLM与爱尔兰语社区之间存在偏差。随后，我们利用 Gemini-2.5-Pro 将大规模英语指令微调数据集翻译为爱尔兰语，并合成了首个爱尔兰语人类偏好数据集。我们在多个基准测试中对 Qomhrá 进行了全面评估，包括翻译、性别理解、主题识别和世界知识；这些评估显示，与现有的开源爱尔兰语LLM基线 UCCIX 相比，Qomhrá 在爱尔兰语任务上性能提升最高达29%，在英语任务上提升最高达44%。我们的框架结果为开发爱尔兰语及其他低资源语言的LLM提供了见解与指导。