As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
翻译:随着大语言模型在多语言领域的扩展,评估其在低资源语言中的表现变得日益重要。欧洲葡萄牙语尤其受到影响,因为现有的训练数据和基准主要基于巴西葡萄牙语。为此,我们提出ALBA——一个从头构建的、基于语言学基础的基准,用于评估大语言模型在八项语言学维度上的能力,包括语言变体、文化语义、语篇分析、文字游戏、句法、形态学、词汇学以及语音与音系学。ALBA由语言学专家手工构建,并搭配了一个"大语言模型作为评判者"的框架,以实现对生成的欧洲葡萄牙语文本的可扩展评估。针对多样化模型进行的实验揭示了不同语言学维度上的性能差异,凸显了开发者需要全面且对语言变体敏感的基准来进一步支持欧洲葡萄牙语工具的发展。