The unprecedented performance of large language models (LLMs) requires comprehensive and accurate evaluation. We argue that for LLMs evaluation, benchmarks need to be comprehensive and systematic. To this end, we propose the ZhuJiu benchmark, which has the following strengths: (1) Multi-dimensional ability coverage: We comprehensively evaluate LLMs across 7 ability dimensions covering 51 tasks. Especially, we also propose a new benchmark that focuses on knowledge ability of LLMs. (2) Multi-faceted evaluation methods collaboration: We use 3 different yet complementary evaluation methods to comprehensively evaluate LLMs, which can ensure the authority and accuracy of the evaluation results. (3) Comprehensive Chinese benchmark: ZhuJiu is the pioneering benchmark that fully assesses LLMs in Chinese, while also providing equally robust evaluation abilities in English. (4) Avoiding potential data leakage: To avoid data leakage, we construct evaluation data specifically for 37 tasks. We evaluate 10 current mainstream LLMs and conduct an in-depth discussion and analysis of their results. The ZhuJiu benchmark and open-participation leaderboard are publicly released at http://www.zhujiu-benchmark.com/ and we also provide a demo video at https://youtu.be/qypkJ89L1Ic.
翻译:大语言模型前所未有的性能需要全面且准确的评估。我们认为,针对大语言模型的评估,基准测试必须兼具全面性与系统性。为此,我们提出煮酒基准测试,其具有以下优势:(1)多维度能力覆盖:我们从7个能力维度全面评估大语言模型,涵盖51项任务。特别地,我们还提出了一项专注于大语言模型知识能力的新基准。(2)多层面评估方法协作:我们采用三种不同且互补的评估方法对大语言模型进行全面评估,确保评估结果的权威性与准确性。(3)全面的中文基准:煮酒是首项全面评估中文大语言模型的基准测试,同时在英文上也具备同等稳健的评估能力。(4)规避潜在数据泄露:为避免数据泄露,我们专门为37项任务构建了评估数据。我们评估了10个当前主流大语言模型,并对其结果进行了深入讨论与分析。煮酒基准测试及开放参与排行榜已公开发布于 http://www.zhujiu-benchmark.com/,同时我们提供了演示视频,链接为 https://youtu.be/qypkJ89L1Ic。