Recent advances in large language models (LLMs) have shown significant promise, yet their evaluation raises concerns, particularly regarding data contamination due to the lack of access to proprietary training data. To address this issue, we present C$^2$LEVA, a comprehensive bilingual benchmark featuring systematic contamination prevention. C$^2$LEVA firstly offers a holistic evaluation encompassing 22 tasks, each targeting a specific application or ability of LLMs, and secondly a trustworthy assessment due to our contamination-free tasks, ensured by a systematic contamination prevention strategy that fully automates test data renewal and enforces data protection during benchmark data release. Our large-scale evaluation of 15 open-source and proprietary models demonstrates the effectiveness of C$^2$LEVA.
翻译:近期大语言模型(LLM)的发展展现出显著潜力,但其评估工作引发了担忧,尤其是由于无法获取专有训练数据而导致的数据污染问题。为解决此问题,我们提出了C$^2$LEVA,一个具备系统性污染防护的综合性双语基准。C$^2$LEVA首先提供了一个涵盖22项任务的整体评估框架,每项任务均针对LLM的特定应用或能力;其次,得益于我们无污染的任务设计,它能够提供可信的评估结果。这一无污染特性通过一套系统性的污染预防策略得以保证,该策略实现了测试数据的全自动更新,并在基准数据发布期间强制执行数据保护。我们对15个开源及专有模型进行的大规模评估验证了C$^2$LEVA的有效性。