Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called ``Real-world questions'' (RWQ), comprising 20,772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.
翻译:尽管大语言模型具备复杂能力,但在有效评估方面仍面临重大障碍。本文首先重新审视了主流评估方法——多项选择问答(MCQA),该方法可进行直接的准确率衡量。通过对24个模型在11个基准测试上的全面评估,我们揭示了MCQA的若干潜在缺陷,例如MCQA评估与实际场景中生成开放式回答之间的不一致性。为此,我们引入了一个RWQ-Elo评分系统,让24个LLM(包括GPT-4、GPT-3.5、Google-Gemini-Pro和LLaMA-1/-2)以双人对战形式参与竞争,并由GPT-4担任评判员。每个LLM随后将获得一个Elo评分。该系统旨在模拟真实应用场景,为此我们编制了一个名为"真实世界问题"(RWQ)的新基准,包含20,772个真实用户查询。此外,我们深入分析了该系统的特性,并将其与AlpacaEval和MT-Bench等先前的排行榜进行了比较。分析结果表明,我们的RWQ-Elo系统具有稳定性、新模型注册的可行性,以及重塑LLM排行榜的潜力。