HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai,Zhihai Wang,Jinghang Wang,Boyu Yang,Xiaogang Li,Xiang Xu,Bohan Wang,Peng Wang,Xingzhe Wu,Anfeng Li,Qiyuan Feng,Yuhao Zhou,Shoulin Han,Wenjie Luo,Yiyuan Li,Yaxuan Wang,Ruixian Luo,Guojie Lin,Peiyao Xiao,Chengliang Xu,Ben Wang,Zeyu Wang,Zichao Chen,Jianan Ye,Yijie Hu,Jialong Chen,Zongwen Shen,Yuliang Xu,An Yang,Bowen Yu,Dayiheng Liu,Junyang Lin,Hu Wei,Que Shen,Bing Zhao

from arxiv, 14 pages, 10 figures

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified

翻译：人类终极考试（HLE）已成为评估前沿大语言模型在跨领域复杂问题上的广泛基准。然而，社区主导的分析指出HLE包含相当数量的噪声题目，可能使评估结果产生偏差并扭曲跨模型比较。为应对这一挑战，我们提出HLE-Verified——一个经过验证与修订的HLE版本，具备透明的验证协议和细粒度错误分类体系。我们采用两阶段“验证-修复”工作流程构建了经过认证的基准。第一阶段中，每个题目均通过领域专家评审与基于模型的交叉验证，对问题描述和最终答案进行二元验证，最终获得641个已验证题目。第二阶段中，存在缺陷但可修复的题目在严格约束下进行修订（保留原始评估意图），通过双独立专家修复、模型辅助审核及最终裁定，得到1,170个修订认证题目。其余689个题目作为已标注的不确定集合发布，其中明确标注了不确定性来源与专业领域标签以供后续完善。我们在HLE和HLE-Verified上评估了七个前沿语言模型，发现在HLE-Verified上平均绝对准确率提升7-10个百分点。这种提升在原始问题描述和/或参考答案存在错误的题目上尤为显著，增幅达30-40个百分点。进一步分析表明，模型置信度与问题描述或参考答案中的错误存在强相关性，印证了我们修订工作的有效性。总体而言，HLE-Verified通过降低标注噪声、实现更可靠的模型能力度量，提升了HLE式评估的效能。数据发布于：https://github.com/SKYLENAGE-AI/HLE-Verified