Recently, various Large Language Models (LLMs) evaluation datasets have emerged, but most of them have issues with distorted rankings and difficulty in model capabilities analysis. Addressing these concerns, this paper introduces ANGO, a Chinese multi-choice question evaluation benchmark. ANGO proposes Keypoint categorization standard for the first time, each question in ANGO can correspond to multiple keypoints, effectively enhancing interpretability of evaluation results. Base on performance of real humans, we build a quantifiable question difficulty standard and divide ANGO questions into 9 difficulty levels, which provide more precise guidance for model training. To minimize data leakage impact and fully leverage ANGO's innovative features, we have engineered exclusive sampling strategies and a new evaluation framework that support swift testset iteration. Our experiments demonstrate that ANGO poses a stronger challenge to models and reveals more details in evaluation result compared to existing benchmarks.
翻译:近期涌现出多种大型语言模型(LLMs)评估数据集,但多数存在排名失真与模型能力分析困难等问题。针对这些挑战,本文提出ANGO——一个中文多选题评估基准。ANGO首次提出关键点分类标准,每个题目可对应多个关键点,有效提升了评估结果的可解释性。基于真人答题表现,我们构建了可量化的题目难度标准,并将ANGO题目划分为9个难度等级,为模型训练提供更精准的指导。为最大限度降低数据泄露影响并充分发挥ANGO的创新特性,我们设计了专属采样策略与新评估框架,支持测试集的快速迭代。实验表明,与现有基准相比,ANGO对模型构成更强挑战,且能揭示更详尽的评估结果细节。