The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted. We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets. These datasets assess general and domain-specific knowledge, understanding of Estonian grammar and vocabulary, summarization abilities, contextual comprehension, and more. The datasets are all generated from native Estonian sources without using machine translation. We compare the performance of base models, instruction-tuned open-source models, and commercial models. Our evaluation includes 6 base models and 26 instruction-tuned models. To assess the results, we employ both human evaluation and LLM-as-a-judge methods. Human evaluation scores showed moderate to high correlation with benchmark evaluations, depending on the dataset. Claude 3.7 Sonnet, used as an LLM judge, demonstrated strong alignment with human ratings, indicating that top-performing LLMs can effectively support the evaluation of Estonian-language models.
翻译:目前针对爱沙尼亚语的大型语言模型(LLM)基准测试资源较为有限,且尚未开展不同LLM在爱沙尼亚语任务上的全面性能比较研究。本文引入了一个基于七个多样化数据集的爱沙尼亚语LLM评估新基准。这些数据集用于评估模型的通用及领域特定知识、爱沙尼亚语语法与词汇理解能力、文本摘要技能、上下文理解能力等。所有数据集均直接源自爱沙尼亚语原生资料,未采用机器翻译生成。我们比较了基础模型、指令微调开源模型与商业模型的性能表现,共评估6个基础模型和26个指令微调模型。为评估结果,我们同时采用人工评估和LLM-as-a-judge方法。人工评估分数与基准测试结果呈现中等到高度的相关性,具体程度因数据集而异。作为LLM评判员的Claude 3.7 Sonnet表现出与人工评分的高度一致性,这表明顶尖性能的LLM能有效支持爱沙尼亚语模型的评估工作。