Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.
翻译:基准测试通过建立标准化的评估框架,能够系统性地评估大语言模型的性能,从而促进客观比较并推动该领域的发展。然而,现有基准测试未能区分问题难度,限制了其有效区分模型能力的作用。为应对这一局限,我们提出了RankLLM,一个旨在量化问题难度与模型能力的新框架。RankLLM引入难度作为主要区分标准,从而实现对大语言模型能力更细粒度的评估。RankLLM的核心机制促进了模型与问题之间的双向分数传播。其核心思想是:模型正确回答问题则获得能力分数,而问题在挑战模型时其难度分数则增加。利用该框架,我们在多个领域的35,550个问题上评估了30个模型。RankLLM与人类判断的一致性达到90%,并持续优于IRT等强基线方法。此外,它还展现出强大的稳定性、快速的收敛性以及高计算效率,使其成为大规模、难度感知的大语言模型评估的实用解决方案。