Domain shifts in dermoscopic skin cancer datasets: Evaluation of essential limitations for clinical translation

The limited ability of Convolutional Neural Networks to generalize to images from previously unseen domains is a major limitation, in particular, for safety-critical clinical tasks such as dermoscopic skin cancer classification. In order to translate CNN-based applications into the clinic, it is essential that they are able to adapt to domain shifts. Such new conditions can arise through the use of different image acquisition systems or varying lighting conditions. In dermoscopy, shifts can also occur as a change in patient age or occurence of rare lesion localizations (e.g. palms). These are not prominently represented in most training datasets and can therefore lead to a decrease in performance. In order to verify the generalizability of classification models in real world clinical settings it is crucial to have access to data which mimics such domain shifts. To our knowledge no dermoscopic image dataset exists where such domain shifts are properly described and quantified. We therefore grouped publicly available images from ISIC archive based on their metadata (e.g. acquisition location, lesion localization, patient age) to generate meaningful domains. To verify that these domains are in fact distinct, we used multiple quantification measures to estimate the presence and intensity of domain shifts. Additionally, we analyzed the performance on these domains with and without an unsupervised domain adaptation technique. We observed that in most of our grouped domains, domain shifts in fact exist. Based on our results, we believe these datasets to be helpful for testing the generalization capabilities of dermoscopic skin cancer classifiers.

翻译：卷积神经网络对未见域图像的泛化能力有限是一大局限，尤其对于皮肤镜皮肤癌分类等安全关键型临床任务而言。为了实现基于CNN的临床应用，它们必须能够适应域偏移。这些新条件可能源于使用不同的图像采集系统或变化的照明条件。在皮肤镜检查中，偏移还可能因患者年龄变化或罕见病变部位（如手掌）的出现而发生。这些因素在大多数训练数据集中并未突出体现，因此可能导致性能下降。为了验证分类模型在真实临床环境中的泛化能力，获取能够模拟此类域偏移的数据至关重要。据我们所知，目前尚无皮肤镜图像数据集对这类域偏移进行充分描述和量化。因此，我们基于元数据（如采集部位、病变位置、患者年龄）对ISIC档案中的公开图像进行分组，以生成有意义的域。为验证这些域的实际差异性，我们采用多种量化指标评估域偏移的存在性和强度。此外，我们分析了在应用与未应用无监督域自适应技术时这些域上的性能表现。我们观察到，在大多数分组域中确实存在域偏移。基于这些结果，我们认为这些数据集有助于测试皮肤镜皮肤癌分类器的泛化能力。