With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 Chinese LLMs have validated CLEVA's efficacy.
翻译:随着中文大语言模型(LLMs)的不断涌现,如何评估模型能力已成为日益重要的问题。当前中文大语言模型评估面临三大挑战:缺乏能够全面评估模型性能的中文基准、提示(prompting)流程缺乏标准化与可比性、以及普遍存在的污染风险。我们提出CLEVA,这是一个专为全面评估中文大语言模型而设计的用户友好型平台。该平台采用标准化流程,从多维度评估大语言模型性能,并定期更新竞争性排行榜。为缓解污染问题,CLEVA整理了大量新数据,并开发了采样策略,确保每轮排行榜使用唯一的数据子集。借助仅需数次鼠标点击和模型API即可操作的简易界面,用户能以极少量代码完成全面评估。基于23个中文大语言模型的大规模实验验证了CLEVA的有效性。