Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as ChatGPT, Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks.
翻译:大型语言模型(LLMs)已成为研究及社会各领域日益重要的工具。尽管全球专家与普通用户均频繁使用LLMs,但其开发主要面向英语使用者,在英语及其他广泛使用的语言中表现优异,而对卢森堡语等低资源语言的重视程度相对不足。这种关注缺失同样体现在现有评估工具与数据集的匮乏上。本研究探讨了将语言能力考试作为卢森堡语评估工具的可行性。我们发现ChatGPT、Claude和DeepSeek-R1等大型模型通常能获得高分,而较小模型则表现欠佳。研究还表明,此类语言考试的表现可用于预测模型在其他自然语言处理任务中的性能。