Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu,Xin Ma,Yuxin Ma,Yongchang Peng,Duo Wang,Zhoufutu Wen,Ge Zhang,Kaiyuan Zhang,Xinyu Chen,Tianci He,Jiani Hou,Liang Hu,Ziyun Huang,Yongzhe Hui,Jianpeng Jiao,Chennan Ju,Yingru Kong,Yiran Li,Mengyun Liu,Luyao Ma,Fei Ni,Yiqing Ni,Yueyan Qiu,Yanle Ren,Zilin Shi,Zaiyuan Wang,Wenjie Yue,Shiyu Zhang,Xinyi Zhang,Kaiwen Zhao,Zhenwei Zhu,Shanshan Wu,Qi Zhao,Wenhao Huang

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

翻译：随着大型语言模型在传统基准测试上表现出性能平台期，一个关键挑战仍然存在：评估它们在表征真正专家级认知的复杂开放式任务中的熟练程度。现有框架存在领域覆盖狭窄、依赖通用型任务或自我评估偏差等问题。为弥补这一差距，我们提出了XpertBench，这是一个旨在跨真实专业领域评估大型语言模型的高保真基准。XpertBench包含涵盖金融、医疗、法律服务、教育以及双轨研究（STEM与人文学科）等80个类别中的1,346个精心策划的任务。这些任务源自领域专家（包括精英机构的研究人员和具有广泛临床或工业经验的从业者）提交的1,000多份提案，确保了卓越的生态效度。每个任务使用详细的量规（大多包含15-40个加权检查点）来评估专业严谨性。为促进可扩展且与人类对齐的评估，我们引入了ShotJudge，这是一种新颖的评估范式，采用经过专家少样本示例校准的大型语言模型裁判，以减轻自我奖励偏差。我们对最先进大型语言模型的实证评估揭示了一个显著的性能天花板：即使领先模型也仅能达到约66%的峰值成功率，平均得分约为55%。模型还表现出领域特异性差异，在定量推理与语言综合方面显示出非重叠优势。这些发现凸显了当前人工智能系统中存在显著的“专家差距”，并将XpertBench确立为从通用助手向专业协作伙伴过渡的关键工具。