We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.
翻译:本研究对HealthBench医疗人工智能评估数据集中的医生评估分歧进行分解,以探究差异来源及其可解释性特征。评分标准身份解释了15.8%的"符合/不符合"标签方差,但仅能解释3.6-6.9%的分歧方差;医生身份仅贡献2.4%。占主导地位的81.8%病例级残差无法通过HealthBench元数据标签(z = -0.22, p = 0.83)、规范性评分标准语言(伪R² = 1.2%)、医学专业(300组Tukey检验均不显著)、表面特征分诊(AUC = 0.58)或嵌入表示(AUC = 0.485)得到解释。分歧与回答质量呈倒U型关系(AUC = 0.689),证实医生对明显优劣的回答达成共识,而对临界案例存在分歧。经医生验证的不确定性分类显示:可约不确定性(缺失语境、表述模糊)使分歧几率提升超两倍(OR = 2.55, p < 10⁻²⁴),而不可约不确定性(真实医学模糊性)无显著影响(OR = 1.01, p = 0.90),但前者仅能解释约3%的总方差。医疗AI评估中的一致性上限主要源于结构性因素,但可约/不可约不确定性的解离现象表明:在非固有临床模糊性场景中,通过填补评估信息缺口可降低分歧,这为可操作的评估设计改进指明了方向。