Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation of Russian-language Architectures (MERA), a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains and is designed as a black-box test to ensure the exclusion of data leakage. The paper introduces a methodology to evaluate FMs and LMs in zero- and few-shot fixed instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find that they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential societal drawbacks.
翻译:过去几年,人工智能研究领域最显著的进展之一是基础模型(FMs)的发展,其中以语言模型(LMs)的崛起为代表。随着模型规模增大,语言模型在可度量方面表现出性能提升,并发展出新的定性特征。然而,尽管研究者日益关注且语言模型应用快速增长,其能力、局限性和相关风险仍需深入理解。为解决这些问题,我们提出了开放式的俄语架构多模态评估基准(MERA),这是一个面向俄语的基础模型评估新指令基准。该基准涵盖11个技能领域的21项生成模型评估任务,采用黑盒测试设计以确保排除数据泄露。本文介绍了一种在零样本和少样本固定指令设置下评估基础模型和语言模型的方法论,该方法可扩展至其他模态。我们提出了评估方法论、用于MERA评估的开源代码库,以及包含提交系统的排行榜。我们评估了开源语言模型作为基线,发现其仍远未达到人类水平。我们公开发布MERA,旨在指导未来研究、预测突破性模型特征、标准化评估流程,并应对潜在的社会风险。