The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of two popular dermatological image datasets: DermaMNIST and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
翻译:深度学习在皮肤病学任务中取得的显著进展使我们更接近实现与人类专家相当的诊断准确性。然而,尽管大规模数据集在开发可靠的深度神经网络模型中扮演着关键角色,其中数据的质量及其正确使用至关重要。多种因素可能影响数据质量,例如重复样本的存在、训练-测试划分中的数据泄露、错误标注的图像以及缺乏明确的测试划分。本文对两个流行的皮肤病图像数据集——DermaMNIST和Fitzpatrick17k进行了细致分析,揭示了这些数据质量问题,评估了这些问题对基准测试结果的影响,并提出了数据集修正方案。除了确保我们分析的可复现性外,通过公开我们的分析流程及配套代码,我们旨在鼓励类似的探索,并促进识别和解决其他大数据集中潜在的数据质量问题。