We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.
翻译:我们对HealthBench医疗AI评估数据集中的医生分歧进行分解,以理解差异的来源及其可解释性特征。评分标准身份解释了15.8%的"符合/不符合"标签方差,但仅解释3.6-6.9%的分歧方差;医生身份仅解释2.4%。占主导地位的81.8%病例级残差无法通过HealthBench的元数据标签(z = -0.22, p = 0.83)、规范性评分标准语言(伪R^2 = 1.2%)、医学专业(300组Tukey检验均不显著)、表面特征分诊(AUC = 0.58)或嵌入表示(AUC = 0.485)来降低。分歧与完成质量呈倒U型关系(AUC = 0.689),证实医生对明显优劣的输出意见一致,但对临界病例存在分歧。经医生验证的不确定性分类显示:可减少的不确定性(缺失上下文、模糊表述)使分歧几率增加超过两倍(OR = 2.55, p < 10^(-24)),而不可减少的不确定性(真实医学模糊性)无显著影响(OR = 1.01, p = 0.90),但即使前者也仅解释约3%的总方差。因此医疗AI评估中的一致性上限主要源于结构性因素,但可减少/不可减少不确定性的解离表明:在评估场景中填补信息缺口可降低非临床固有模糊性引起的分歧,这为可操作的评估设计改进指明了方向。