MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications

Praveenkumar Kanithi,Clément Christophe,Marco AF Pimentel,Tathagata Raha,Prateek Munjal,Nada Saadi,Hamza A Javed,Svetlana Maslenkova,Nasir Hayat,Ronnie Rajan,Shadab Khan

from arxiv, Technical report

While Large Language Models (LLMs) achieve superhuman performance on standardized medical licensing exams, these static benchmarks have become saturated and increasingly disconnected from the functional requirements of clinical workflows. To bridge the gap between theoretical capability and verified utility, we introduce MEDIC, a comprehensive evaluation framework establishing leading indicators across various clinical dimensions. Beyond standard question-answering, we assess operational capabilities using deterministic execution protocols and a novel Cross-Examination Framework (CEF), which quantifies information fidelity and hallucination rates without reliance on reference texts. Our evaluation across a heterogeneous task suite exposes critical performance trade-offs: we identify a significant knowledge-execution gap, where proficiency in static retrieval does not predict success in operational tasks such as clinical calculation or SQL generation. Furthermore, we observe a divergence between passive safety (refusal) and active safety (error detection), revealing that models fine-tuned for high refusal rates often fail to reliably audit clinical documentation for factual accuracy. These findings demonstrate that no single architecture dominates across all dimensions, highlighting the necessity of a portfolio approach to clinical model deployment. As part of this investigation, we released a public leaderboard on Hugging Face.\footnote{https://huggingface.co/spaces/m42-health/MEDIC-Benchmark}

翻译：尽管大语言模型（LLMs）在标准化医学执照考试中已展现出超人类表现，但这些静态基准测试已趋于饱和，且与临床工作流程的功能需求日益脱节。为弥合理论能力与已验证实用性之间的差距，我们提出了MEDIC——一个建立跨临床多维度领先指标的综合评估框架。除标准问答任务外，我们采用确定性执行协议和新型交叉检验框架（CEF）评估操作能力，该框架无需依赖参考文本即可量化信息保真度与幻觉率。通过异构任务集的评估，我们揭示了关键的性能权衡：发现显著的知识-执行差距，即静态检索能力无法预测临床计算或SQL生成等操作任务的成功率。此外，我们观察到被动安全性（拒绝回答）与主动安全性（错误检测）之间存在分化，表明针对高拒绝率微调的模型往往无法可靠审核临床文档的事实准确性。这些发现证明没有单一架构能在所有维度占优，凸显了临床模型部署中组合策略的必要性。作为本研究的一部分，我们在Hugging Face平台发布了公开排行榜。\footnote{https://huggingface.co/spaces/m42-health/MEDIC-Benchmark}