Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified
翻译:人类终极考试(HLE)已成为评估前沿大语言模型在跨领域复杂问题上的广泛基准。然而,社区主导的分析指出HLE包含相当数量的噪声题目,可能使评估结果产生偏差并扭曲跨模型比较。为应对这一挑战,我们提出HLE-Verified——一个经过验证与修订的HLE版本,具备透明的验证协议和细粒度错误分类体系。我们采用两阶段“验证-修复”工作流程构建了经过认证的基准。第一阶段中,每个题目均通过领域专家评审与基于模型的交叉验证,对问题描述和最终答案进行二元验证,最终获得641个已验证题目。第二阶段中,存在缺陷但可修复的题目在严格约束下进行修订(保留原始评估意图),通过双独立专家修复、模型辅助审核及最终裁定,得到1,170个修订认证题目。其余689个题目作为已标注的不确定集合发布,其中明确标注了不确定性来源与专业领域标签以供后续完善。我们在HLE和HLE-Verified上评估了七个前沿语言模型,发现在HLE-Verified上平均绝对准确率提升7-10个百分点。这种提升在原始问题描述和/或参考答案存在错误的题目上尤为显著,增幅达30-40个百分点。进一步分析表明,模型置信度与问题描述或参考答案中的错误存在强相关性,印证了我们修订工作的有效性。总体而言,HLE-Verified通过降低标注噪声、实现更可靠的模型能力度量,提升了HLE式评估的效能。数据发布于:https://github.com/SKYLENAGE-AI/HLE-Verified