Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.
翻译:大语言模型通常使用人工构建的基准进行评估,其前提是更高的分数隐含地反映了更强的类人性能。然而,人们日益担忧大语言模型可能因数据泄露而“博弈”这些基准,从而获得高分,却在人类看来简单的任务上表现不佳。为实质性解决该问题,我们创建了GAOKAO-Eval——一个基于中国高考的综合性基准,并对高考前发布的代表性模型进行了“闭卷”评估。与普遍共识相反,即使在解决了数据泄露和全面性问题后,GAOKAO-Eval仍揭示高分依然无法真正反映与人类对齐的能力。为了更好地理解这种不匹配,我们引入认知心理学中的Rasch模型来分析大语言模型的评分模式,并识别出两个关键差异:1)在不同难度问题上表现出的异常一致性,以及2)在相似难度问题上表现的高方差性。此外,我们发现教师对大语言模型生成答案的评分存在不一致性,且错误模式反复出现。我们发现这些现象与OpenAI o1背后的动机有充分关联,并且o1的“推理即难度”方法可以缓解这种不匹配。这些结果表明,GAOKAO-Eval能够揭示当前基准未能捕捉到的大语言模型能力局限,并突显了进行更与大语言模型对齐的难度分析的必要性。