Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark

Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, researchers are invited to submit their models for ongoing evaluation, ensuring the benchmark remains a current and valuable resource.

翻译：近年来，大语言模型（LLMs）在生成和处理人类语言方面取得了显著进展，凸显了其在多种应用场景中的潜力。在英语以外的语言环境中评估LLMs，对于确保其语言多样性、文化适应性及在全球多样化语境中的适用性至关重要，从而拓宽其使用范围和效能。我们通过引入基于INVALSI测试的结构化基准来应对这一挑战——INVALSI测试是一套用于衡量意大利全国教育能力的成熟评估体系。本研究作出三项主要贡献：首先，我们将INVALSI基准改造适用于自动化LLM评估，这涉及对测试格式进行严格调整以适应自动化处理，同时保留原始测试的核心要素。其次，我们对当前主流LLMs进行了系统评估，为学术界提供了重要参考依据。最后，我们通过可视化方法对比了这些模型与人类测试者的表现差异。此外，我们邀请研究者持续提交模型参与评估，以确保该基准保持时效性与实用价值。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日