Generative AI models have impressive performance on many Natural Language Processing tasks such as language understanding, reasoning and language generation. One of the most important questions that is being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative Large Language Models (LLMs) are restricted to English and it is unclear how capable these models are at understanding and generating other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 8 diverse tasks and 33 typologically diverse languages. We also compare the performance of generative LLMs to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and discuss some of the reasons why generative LLMs are currently not optimal for all languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.
翻译:生成式AI模型在诸多自然语言处理任务中表现卓越,例如语言理解、推理及语言生成。当前AI学界关注的核心问题之一,是这些模型的能力边界与局限性。显然,生成式AI的评估面临巨大挑战。现有针对生成式大语言模型(LLMs)的研究大多局限于英语,而这类模型对其他语言的理解与生成能力尚不清晰。我们首次提出生成式LLMs的全面基准测试——MEGA,该评测覆盖8项多样化任务与33种类型学差异显著的语言,并基于标准NLP基准进行评估。同时,我们将生成式LLMs的性能与当前最先进(SOTA)非自回归模型进行对比,以揭示生成式模型相较于前代LLMs的优势。通过深入分析模型在不同语言上的表现,我们探讨了生成式LLMs当前尚未在所有语言中达到最优效果的原因。此外,我们构建了面向多语言场景的生成式LLMs评估框架,并为该领域的未来发展指明了方向。