Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored rubrics discriminated effectively between high- and low-quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician-LLM ranking agreement (tau: 0.42-0.46) matched or exceeded clinician-clinician agreement (tau: 0.38-0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician-authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter-rater agreement studies. Conclusion. Case-specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician-authored rubrics establish the baseline against which LLM rubrics are validated.

翻译：目的。临床AI文档系统需要兼具临床有效性、经济可行性和对迭代变更敏感性的评估方法。要求每项评分实例均需专家评审的方法过于缓慢且成本高昂，难以实现安全的迭代部署。我们提出一种基于案例特定、由临床医生撰写的评分准则方法用于临床AI评估，并探究LLM生成的评分准则能否接近临床医生间的一致性水平。材料与方法。20名临床医生针对823个临床案例（736个真实案例，87个合成案例）撰写了1,646项评分准则，覆盖初级诊疗、精神科、肿瘤科和行为健康领域。通过验证基于LLM的评分代理系统能否持续将临床医生偏好的输出评为优于被否决输出，对每项准则进行效度验证。评估了七版面向临床医生的电子病历嵌入AI代理系统在所有案例上的表现。结果。临床医生撰写的评分准则能有效区分高质量与低质量输出（中位分数差距：82.9%），且评分稳定性高（中位波动范围：0.00%）。中位分数从84%提升至95%。在后续实验中，临床医生与LLM的排名一致性（tau值：0.42-0.46）达到或超过临床医生间的一致性（tau值：0.38-0.43），这归因于天花板压缩效应与LLM评分准则的改进。讨论。这种趋同现象支持将LLM评分准则与临床医生撰写的准则结合使用。以约千分之一的成本，LLM评分准则能实现大幅扩展的评估覆盖范围，同时持续保留临床撰写的准则以确保评估基于专家判断。天花板压缩效应对未来评分者间一致性研究构成了方法学挑战。结论。案例特定评分准则为临床AI评估提供了一条既能保留专家判断，又能以三个数量级降低成本实现自动化的路径。临床医生撰写的评分准则建立了验证LLM评分准则的基准线。