Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose \textbf{JADE}, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to HealthBench and DR.BENCH, covering medical and 10-domain professional evaluation settings. Code and data are available at https://github.com/smiling-world/JADE.
翻译:评估智能体在开放式专业任务上的表现面临严谨性与灵活性之间的根本矛盾。静态评分标准提供严谨可复现的评估,但无法适应多样化的有效回答策略;而“大模型即评判者”方法虽然能适应个体回答,却存在不稳定性和偏差。人类专家通过将领域立足原则与动态、声明级别的评估相结合来解决这一矛盾。受此启发,我们提出**JADE**,一个双层评估框架。第一层将专家知识编码为预定义的评估技能集合,提供稳定的评估标准。第二层则执行面向报告特定、声明级别的评估,以灵活评估多样化的推理策略,并引入证据依赖门控机制,使基于被推翻声明的结论无效化。在BizBench上的实验表明,JADE提升了评估稳定性,并揭示了整体性LLM评估器遗漏的关键智能体故障模式。我们进一步证明了其与专家编写的评分标准高度对齐,并能有效迁移至涵盖医学及10个领域专业评估设置的HealthBench和DR.BENCH。代码与数据可在https://github.com/smiling-world/JADE获取。