A Generative AI Approach for Reducing Skin Tone Bias in Skin Cancer Classification

Skin cancer is one of the most common cancers worldwide and early detection is critical for effective treatment. However, current AI diagnostic tools are often trained on datasets dominated by lighter skin tones, leading to reduced accuracy and fairness for people with darker skin. The International Skin Imaging Collaboration (ISIC) dataset, one of the most widely used benchmarks, contains over 70% light skin images while dark skins fewer than 8%. This imbalance poses a significant barrier to equitable healthcare delivery and highlights the urgent need for methods that address demographic diversity in medical imaging. This paper addresses this challenge of skin tone imbalance in automated skin cancer detection using dermoscopic images. To overcome this, we present a generative augmentation pipeline that fine-tunes a pre-trained Stable Diffusion model using Low-Rank Adaptation (LoRA) on the image dark-skin subset of the ISIC dataset and generates synthetic dermoscopic images conditioned on lesion type and skin tone. In this study, we investigated the utility of these images on two downstream tasks: lesion segmentation and binary classification. For segmentation, models trained on the augmented dataset and evaluated on held-out real images show consistent improvements in IoU, Dice coefficient, and boundary accuracy. These evalutions provides the verification of Generated dataset. For classification, an EfficientNet-B0 model trained on the augmented dataset achieved 92.14% accuracy. This paper demonstrates that synthetic data augmentation with Generative AI integration can substantially reduce bias with increase fairness in conventional dermatological diagnostics and open challenges for future directions.

翻译：皮肤癌是全球最常见的癌症之一，早期检测对于有效治疗至关重要。然而，当前的人工智能诊断工具通常基于以浅肤色为主的数据集进行训练，导致对深肤色人群的诊断准确性和公平性降低。国际皮肤影像协作组织（ISIC）数据集作为最广泛使用的基准之一，包含超过70%的浅肤色图像，而深肤色图像不足8%。这种不平衡对公平的医疗服务构成了重大障碍，并凸显了解决医学影像中人口统计学多样性问题的迫切需求。本文针对皮肤镜图像自动癌症检测中的肤色不平衡问题，提出了一种解决方案。为此，我们设计了一个生成式增强流程：首先在ISIC数据集的深肤色图像子集上，使用低秩自适应（LoRA）方法对预训练的Stable Diffusion模型进行微调，然后根据病变类型和肤色条件生成合成皮肤镜图像。在本研究中，我们评估了这些合成图像在两个下游任务中的效用：病变分割和二元分类。在分割任务中，使用增强数据集训练的模型在预留的真实图像上进行评估，在交并比（IoU）、戴斯系数和边界精度方面均显示出持续改进。这些评估结果验证了生成数据集的有效性。在分类任务中，使用增强数据集训练的EfficientNet-B0模型达到了92.14%的准确率。本文证明，结合生成式人工智能的合成数据增强方法能够显著减少传统皮肤病学诊断中的偏见，提高公平性，并为未来研究方向提出了新的挑战。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Frederick Klauschen -人工智能在癌症研究和诊断中的应用，附视频与Slides

专知会员服务

23+阅读 · 2023年1月13日

不可错过！斯坦福《人工智能医学健康》课程，全面阐述AI在医学的应用，附Slides

专知会员服务

50+阅读 · 2022年10月24日

《FUTURE-AI: 医学影像中可信人工智能的指导原则和共识建议》巴塞罗那大学等47页综述

专知会员服务

16+阅读 · 2022年7月28日

深度学习在癌症诊断、预后和治疗选择中的应用

专知会员服务

56+阅读 · 2022年6月18日