The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.
翻译:大语言模型(LLM)通过基准测试进行评估已十分普遍,然而不同排行榜之间的不一致性以及顶尖模型间较差的区分度,引发了人们对这些基准能否准确反映模型真实能力的担忧。本文对基准测试的有效性进行了批判性分析,利用多种模型的测试结果审视了主流的知名LLM基准。我们首先提出了面向项目反应理论的伪孪生网络(PSN-IRT),这是一个增强的项目反应理论框架,它在基于IRT的架构中纳入了一组丰富的项目参数。PSN-IRT可用于对项目特征和模型能力进行准确可靠的估计。基于PSN-IRT,我们对包含41,871个项目的11个LLM基准进行了广泛分析,揭示了它们在测量质量上存在显著且多样的缺陷。此外,我们证明了利用PSN-IRT能够构建更小规模的基准测试集,同时保持与人类偏好更强的一致性。