Recently, the evaluation of Large Language Models has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with. As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion. In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4, with different scoring methods and ranking systems. We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results. We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at https://github.com/llmeval .
翻译:近年来,大型语言模型的评估已成为热门研究领域。LLM评估的三个关键问题是"评估什么、在哪里评估以及如何评估"。然而,现有研究主要关注前两个问题,即测试中让LLM执行哪些任务以及处理何种类型的知识。至于第三个问题——使用何种标准、评估者类型、评分方法及排序方式,相关讨论尚不充分。本文通过对比人工评估与自动评估中的多种标准,利用现场、众包、公开标注者和GPT-4,结合不同评分方法与排序系统,对评估方法进行了分析。我们提出了新数据集LLMEval,并对20个LLM进行了评估。共有2186人参与,产生了243337条人工标注和57511条自动评估结果。我们对不同设置进行了比较分析,得出了10项结论,可为未来LLM评估提供参考。数据集与结果已在https://github.com/llmeval 公开。