Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called ``Real-world questions'' (RWQ), comprising 20,772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.

翻译：尽管大语言模型具备复杂能力，但在有效评估方面仍面临重大障碍。本文首先重新审视了主流评估方法——多项选择问答（MCQA），该方法可进行直接的准确率衡量。通过对24个模型在11个基准测试上的全面评估，我们揭示了MCQA的若干潜在缺陷，例如MCQA评估与实际场景中生成开放式回答之间的不一致性。为此，我们引入了一个RWQ-Elo评分系统，让24个LLM（包括GPT-4、GPT-3.5、Google-Gemini-Pro和LLaMA-1/-2）以双人对战形式参与竞争，并由GPT-4担任评判员。每个LLM随后将获得一个Elo评分。该系统旨在模拟真实应用场景，为此我们编制了一个名为"真实世界问题"（RWQ）的新基准，包含20,772个真实用户查询。此外，我们深入分析了该系统的特性，并将其与AlpacaEval和MT-Bench等先前的排行榜进行了比较。分析结果表明，我们的RWQ-Elo系统具有稳定性、新模型注册的可行性，以及重塑LLM排行榜的潜力。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日