A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Siyuan Yan,Xieji Li,Dan Mo,Philipp Tschandl,Yiwen Jiang,Zhonghua Wang,Ming Hu,Lie Ju,Cristina Vico-Alonso,Yizhen Zheng,Jiahe Liu,Juexiao Zhou,Camilla Chello,Jen G. Cheung,Julien Anriot,Luc Thomas,Clare Primiero,Gin Tan,Aik Beng Ng,Simon See,Xiaoying Tang,Albert Ip,Xiaoyang Liao,Adrian Bowling,Martin Haskett,Shuang Zhao,Monika Janda,H. Peter Soyer,Victoria Mar,Harald Kittler,Zongyuan Ge

from arxiv, reports

Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

翻译：医学基础模型在受控基准测试中展现出潜力，但因其对任务特定微调的依赖，广泛部署仍受阻碍。本文提出DermFM-Zero，一种通过掩码潜在建模和对比学习在超过400万多模态数据点上训练的皮肤病学视觉-语言基础模型。我们在涵盖零样本诊断和多模态检索的20个基准测试中评估DermFM-Zero，在无需任务特定适配的情况下实现了最先进的性能。我们进一步在涉及超过1100名临床医生的三项跨国阅片研究中评估其零样本能力。在初级诊疗场景中，AI辅助使全科医生对98种皮肤病的鉴别诊断准确率提升近一倍。在专科诊疗场景中，该模型在多模态皮肤癌评估中显著优于委员会认证的皮肤科医生。在协作工作流程中，AI辅助使非专家在提升诊疗方案适当性的同时超越了未受辅助的专家。最后，我们证明DermFM-Zero的潜在表征具有可解释性：稀疏自编码器以无监督方式解耦出具有临床意义的特征概念，其表现优于预定义词汇表方法，并能针对性抑制伪影引起的偏差，从而在不重新训练的情况下增强模型鲁棒性。这些发现表明，基础模型能够提供有效、安全且透明的零样本临床决策支持。