Domain shifts in dermoscopic skin cancer datasets: Evaluation of essential limitations for clinical translation

The limited ability of Convolutional Neural Networks to generalize to images from previously unseen domains is a major limitation, in particular, for safety-critical clinical tasks such as dermoscopic skin cancer classification. In order to translate CNN-based applications into the clinic, it is essential that they are able to adapt to domain shifts. Such new conditions can arise through the use of different image acquisition systems or varying lighting conditions. In dermoscopy, shifts can also occur as a change in patient age or occurence of rare lesion localizations (e.g. palms). These are not prominently represented in most training datasets and can therefore lead to a decrease in performance. In order to verify the generalizability of classification models in real world clinical settings it is crucial to have access to data which mimics such domain shifts. To our knowledge no dermoscopic image dataset exists where such domain shifts are properly described and quantified. We therefore grouped publicly available images from ISIC archive based on their metadata (e.g. acquisition location, lesion localization, patient age) to generate meaningful domains. To verify that these domains are in fact distinct, we used multiple quantification measures to estimate the presence and intensity of domain shifts. Additionally, we analyzed the performance on these domains with and without an unsupervised domain adaptation technique. We observed that in most of our grouped domains, domain shifts in fact exist. Based on our results, we believe these datasets to be helpful for testing the generalization capabilities of dermoscopic skin cancer classifiers.

翻译：卷积神经网络对来自未知域图像的泛化能力有限是一个主要局限，尤其对于皮肤镜皮肤癌分类等安全关键型临床任务。为了将基于CNN的应用转化为临床实践，它们必须能够适应域偏移。此类新情况可能源于使用不同的图像采集系统或变化的光照条件。在皮肤镜检查中，偏移也可能因患者年龄变化或罕见病变部位（如手掌）出现而发生。这些因素在大多数训练数据集中并不突出，因此可能导致性能下降。为了验证分类模型在真实临床环境中的泛化能力，获取模拟此类域偏移的数据至关重要。据我们所知，目前尚无皮肤镜图像数据集对这类域偏移进行充分描述和量化。因此，我们基于元数据（如采集部位、病变位置、患者年龄）对ISIC存档中的公开可用图像进行分组，以生成有意义的域。为验证这些域确实存在差异，我们采用多种量化指标评估域偏移的存在性和强度。此外，我们分析了在未使用和使用无监督域适应技术的情况下模型在这些域上的性能。我们观察到，在大多数分组域中确实存在域偏移。基于这些结果，我们认为这些数据集有助于测试皮肤镜皮肤癌分类器的泛化能力。