MERA: A Comprehensive LLM Evaluation in Russian

Alena Fenogenova,Artem Chervyakov,Nikita Martynov,Anastasia Kozlova,Maria Tikhonova,Albina Akhmetgareeva,Anton Emelyanov,Denis Shevelev,Pavel Lebedev,Leonid Sinev,Ulyana Isaeva,Katerina Kolomeytseva,Daniil Moskovskiy,Elizaveta Goncharova,Nikita Savushkin,Polina Mikhailova,Denis Dimitrov,Alexander Panchenko,Sergei Markov

from arxiv, The paper version comparable with the release code v.1.1.0 of the benchmark MERA. ACL-2024 main track camera ready version

Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation of Russian-language Architectures (MERA), a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains and is designed as a black-box test to ensure the exclusion of data leakage. The paper introduces a methodology to evaluate FMs and LMs in zero- and few-shot fixed instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find that they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential societal drawbacks.

翻译：过去几年，人工智能研究领域最显著的进展之一是基础模型（FMs）的发展，其中以语言模型（LMs）的崛起为代表。随着模型规模增大，语言模型在可度量方面表现出性能提升，并发展出新的定性特征。然而，尽管研究者日益关注且语言模型应用快速增长，其能力、局限性和相关风险仍需深入理解。为解决这些问题，我们提出了开放式的俄语架构多模态评估基准（MERA），这是一个面向俄语的基础模型评估新指令基准。该基准涵盖11个技能领域的21项生成模型评估任务，采用黑盒测试设计以确保排除数据泄露。本文介绍了一种在零样本和少样本固定指令设置下评估基础模型和语言模型的方法论，该方法可扩展至其他模态。我们提出了评估方法论、用于MERA评估的开源代码库，以及包含提交系统的排行榜。我们评估了开源语言模型作为基线，发现其仍远未达到人类水平。我们公开发布MERA，旨在指导未来研究、预测突破性模型特征、标准化评估流程，并应对潜在的社会风险。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日