The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs. To investigate the practical utility of the dataset, we fine-tune 14 smaller-scale LLMs ($\leq$15B parameters) on LuxIT and evaluate them on standardized Luxembourgish proficiency exams and five downstream NLP tasks. Training on LuxIT yields a mean accuracy change of +5.37 percentage points on language exams across all 14 models, with 12 of 14 showing improvement. On NLP downstream tasks, 9 of 14 models improve in macro-averaged F1, though gains on the two benchmarks do not systematically correlate. These results underscore the feasibility of leveraging monolingual synthetic data to improve LLM capabilities in low-resource languages, while highlighting the multi-faceted nature of language proficiency.
翻译:指令微调大语言模型(LLMs)在低资源语言环境下的有效性常因缺乏高质量训练数据而受限。我们提出LuxIT——一个针对卢森堡语的新型单语指令微调数据集,旨在缓解这一挑战。该数据集通过整合卢森堡语原生文本语料库,并利用经实证具备卢森堡语处理能力的DeepSeek-R1-0528模型进行合成生成。经过数据生成后,我们采用LLM-as-a-judge方法实施质量保障流程,最终保留227,507组高质量指令-答案对。为探究该数据集的实际效用,我们将14个小规模LLMs(参数≤15B)在LuxIT上进行微调,并基于标准化卢森堡语能力测试及五项下游自然语言处理任务开展评估。基于LuxIT的微调使得全部14个模型在语言测试中的平均准确率提升5.37个百分点,其中12个模型呈现正向改进。在下游NLP任务中,9个模型在宏平均F1指标上实现提升,但两个基准测试的性能增益未呈现系统性相关性。这些结果凸显了利用单语合成数据增强低资源语言LLM能力的可行性,同时揭示了语言能力的多维度特性。