Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, researchers are invited to submit their models for ongoing evaluation, ensuring the benchmark remains a current and valuable resource.
翻译:近年来,大语言模型(LLMs)在生成和处理人类语言方面取得了显著进展,凸显了其在多种应用场景中的潜力。在英语以外的语言环境中评估LLMs,对于确保其语言多样性、文化适应性及在全球多样化语境中的适用性至关重要,从而拓宽其使用范围和效能。我们通过引入基于INVALSI测试的结构化基准来应对这一挑战——INVALSI测试是一套用于衡量意大利全国教育能力的成熟评估体系。本研究作出三项主要贡献:首先,我们将INVALSI基准改造适用于自动化LLM评估,这涉及对测试格式进行严格调整以适应自动化处理,同时保留原始测试的核心要素。其次,我们对当前主流LLMs进行了系统评估,为学术界提供了重要参考依据。最后,我们通过可视化方法对比了这些模型与人类测试者的表现差异。此外,我们邀请研究者持续提交模型参与评估,以确保该基准保持时效性与实用价值。