Hundreds of benchmarks dedicated to evaluating large models have been presented over the past few years. However, most of them remain closed-ended and are prone to overfitting due to the potential data contamination. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation. In this paper, we introduce MACEval, a Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define new metrics to quantify performance longitudinally. MACEval employs an interactive and autonomous evaluation mode, utilizing role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 23 large models demonstrate the effectiveness of MACEval, which also lightens the evaluation process and reduces a considerable amount of overhead. We hope that MACEval can broaden future directions of large model evaluation. Project page: https://github.com/zijianchen98/MACEval.
翻译:过去几年间,已涌现出数百个专门用于评估大模型的基准测试。然而,其中大多数仍属于封闭式评估,且由于潜在的数据污染问题容易导致过拟合。此外,当前基准测试的规模和范围不断扩大,加之其评估指标具有瞬时性,以及高度依赖人工的构建流程,给及时维护与适应带来了重大挑战。本文提出MACEval,一种用于大模型动态评估的多智能体持续评估网络,并定义了新的指标以纵向量化模型性能。MACEval采用交互式自主评估模式,通过角色分配、过程内数据生成以及基于级联智能体网络的评估路由来实现。在23个大模型上的大量实验验证了MACEval的有效性,该方法同时简化了评估流程并显著降低了开销。我们希望MACEval能为大模型评估的未来发展方向提供更广阔的思路。项目页面:https://github.com/zijianchen98/MACEval。