Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

Medical diagnostics is a high-stakes and complex domain that is critical to patient care. However, current evaluations of large language models (LLMs) remain limited in capturing key challenges of clinical diagnostic scenarios. Most rely on benchmarks derived from public exams, raising contamination bias that can inflate performance, and they overlook the confounded nature of real consultations beyond textbook cases. Recent dynamic evaluations offer a promising alternative, but often remain insufficient for diagnosis-oriented benchmarking, with limited coverage of clinically grounded confounders and trustworthiness beyond accuracy. To address these gaps, we propose DyReMe, a dynamic benchmark for medical diagnostics that provides a controlled and scalable stress test of diagnostic robustness. Unlike static exam-style questions, DyReMe generates fresh, consultation-style cases that incorporate clinically grounded confounders, such as differential diagnoses and common misdiagnosis factors. It also varies expression styles to capture heterogeneous patient-style descriptions. Beyond accuracy, DyReMe evaluates LLMs on three additional clinically relevant dimensions: veracity, helpfulness, and consistency. Our experiments show that this dynamic approach yields more challenging assessments and exposes substantial weaknesses of stateof-the-art LLMs under clinically confounded diagnostic settings. These findings highlight the urgent need for evaluation frameworks that better assess trustworthy medical diagnostics 1 under clinically grounded confounders.

翻译：摘要：医学诊断是一项高风险的复杂领域，对患者护理至关重要。然而，当前对大语言模型（LLM）的评估在捕捉临床诊断场景的关键挑战方面仍存在局限。多数评估依赖源自公开考试的基准，这引发了可能夸大性能的污染偏差，且忽视了真实诊疗中超越教科书案例的混杂本质。近年来的动态评估提供了有前景的替代方案，但往往不足以构建面向诊断的基准，对临床相关混杂因素的覆盖有限，且对准确性之外的可靠性评估不足。为弥补这些不足，我们提出DyReMe——一个用于医学诊断的动态基准，可对诊断鲁棒性进行可控且可扩展的应力测试。与静态考试式问题不同，DyReMe生成全新的、模拟诊疗场景的案例，融入临床相关的混杂因素（如鉴别诊断与常见误诊因子），并改变表达风格以捕捉异质性患者描述方式。除准确性外，DyReMe还从三个额外临床相关维度评估LLM：真实性、有用性与一致性。实验表明，这种动态方法能提供更具挑战性的评估，并揭示当前最先进LLM在临床混杂诊断场景下的重大缺陷。这些发现凸显了亟需构建能更好评估临床混杂因子下可靠医学诊断的评估框架¹。 ¹本文中“诊断可靠性”指在临床混杂因素干扰下，模型诊断结果的可信度与稳定性。