Trustworthy and Fair SkinGPT-R1 for Democratizing Dermatological Reasoning across Diverse Ethnicities

Yuhao Shen,Zhangtianyi Chen,Yuanhao He,Yan Xu,Shuping Zhang,Liyuan Sun,Zijian Wang,Yinghao Zhu,Yuyuan Yang,Jiahe Qian,Ziwen Wang,Xinyuan Zhang,Wenbin Liu,Zongyuan Ge,Tao Lu,Siyuan Yan,Juexiao Zhou

The clinical translation of dermatological AI is hindered by opaque reasoning and systematic performance disparities across skin tones. Here we present SkinGPT-R1, a multimodal large language model that integrates chain-of-thought diagnostic reasoning with a fairness-aware mixture-of-experts architecture for interpretable and equitable skin disease diagnosis. Through parameter-efficient adaptation of a frozen reasoning backbone, SkinGPT-R1 generates structured diagnostic reports comprising visual findings, differential reasoning, and final diagnosis. Across seven external datasets spanning diverse pathologies and imaging conditions, SkinGPT-R1 achieves state-of-the-art accuracy on six benchmarks, including 82.50\% on a challenging 40-class long-tail classification task (+19.30\% over leading baselines). Blinded evaluation by five board-certified dermatologists on 1,000 phenotypically balanced cases yields a mean score of 3.6 out of 5, with the highest ratings in safety (3.8) and reasoning coherence (3.6), indicating that the generated rationales are clinically safe, logically grounded, and suitable for supporting diagnostic decision-making. Critically, SkinGPT-R1 mitigates algorithmic bias across the full Fitzpatrick spectrum, achieving a robust worst-group performance of 41.40\% on the Fitz17k benchmark and a five-fold relative improvement in lower-bound accuracy on the DDI dataset compared to standard multimodal baselines. These results establish a framework for trustworthy, fair, and explainable AI-assisted dermatological diagnosis.

翻译：皮肤科人工智能的临床转化因推理过程不透明及不同肤色间系统性性能差异而受阻。本文提出SkinGPT-R1——一种集成思维链诊断推理与公平感知专家混合架构的多模态大语言模型，旨在实现可解释且公平的皮肤病诊断。通过对冻结推理主干进行参数高效适配，SkinGPT-R1能生成包含视觉发现、鉴别推理和最终诊断的结构化诊断报告。在涵盖不同病理类型与成像条件的七个外部数据集上，SkinGPT-R1在六项基准测试中达到最先进准确率，其中在包含40个类别的挑战性长尾分类任务上取得82.50%的准确率（较领先基线提升19.30%）。五位认证皮肤科医师对1000例表型平衡病例的盲法评估显示平均得分3.6/5分，其中安全性（3.8分）与推理连贯性（3.6分）评分最高，表明生成的理由具有临床安全性、逻辑基础性，适用于辅助诊断决策。关键的是，SkinGPT-R1在完整Fitzpatrick光谱范围内缓解了算法偏见：在Fitz17k基准测试中实现41.40%的稳健最差组性能，在DDI数据集上较标准多模态基线获得五倍的相对下界准确率提升。这些成果为可信、公平、可解释的AI辅助皮肤科诊断建立了框架。