Background: Health datasets from clinical sources do not reflect the breadth and diversity of disease in the real world, impacting research, medical education, and artificial intelligence (AI) tool development. Dermatology is a suitable area to develop and test a new and scalable method to create representative health datasets. Methods: We used Google Search advertisements to invite contributions to an open access dataset of images of dermatology conditions, demographic and symptom information. With informed contributor consent, we describe and release this dataset containing 10,408 images from 5,033 contributions from internet users in the United States over 8 months starting March 2023. The dataset includes dermatologist condition labels as well as estimated Fitzpatrick Skin Type (eFST) and Monk Skin Tone (eMST) labels for the images. Results: We received a median of 22 submissions/day (IQR 14-30). Female (66.72%) and younger (52% < age 40) contributors had a higher representation in the dataset compared to the US population, and 32.6% of contributors reported a non-White racial or ethnic identity. Over 97.5% of contributions were genuine images of skin conditions. Dermatologist confidence in assigning a differential diagnosis increased with the number of available variables, and showed a weaker correlation with image sharpness (Spearman's P values <0.001 and 0.01 respectively). Most contributions were short-duration (54% with onset < 7 days ago ) and 89% were allergic, infectious, or inflammatory conditions. eFST and eMST distributions reflected the geographical origin of the dataset. The dataset is available at github.com/google-research-datasets/scin . Conclusion: Search ads are effective at crowdsourcing images of health conditions. The SCIN dataset bridges important gaps in the availability of representative images of common skin conditions.
翻译:背景:临床来源的健康数据集无法反映真实世界中疾病的广度和多样性,这影响了研究、医学教育和人工智能(AI)工具开发。皮肤病学是开发和测试一种可扩展的新方法以创建代表性健康数据集的合适领域。方法:我们利用谷歌搜索广告邀请用户贡献皮肤病状况图像、人口统计学信息和症状信息,以构建一个开放获取数据集。在获得贡献者知情同意后,我们描述并发布了该数据集,其中包含2023年3月起8个月期间来自美国互联网用户的5,033份贡献中的10,408张图像。该数据集包括皮肤科医生诊断标签,以及图像估算的非帕特里克皮肤类型(eFST)和蒙克肤色(eMST)标签。结果:我们平均每天收到22份提交(四分位距14-30)。与美国人口相比,女性(66.72%)和较年轻(52%年龄<40岁)的贡献者在数据集中的比例更高,32.6%的贡献者报告为非白人族裔或民族身份。超过97.5%的贡献是真实的皮肤状况图像。皮肤科医生给出鉴别诊断的信心随可用变量数量增加而提升,但与图像清晰度的相关性较弱(Spearman P值分别<0.001和0.01)。大多数贡献为短期病程(54%的发病时间<7天),89%为过敏性、感染性或炎症性状况。eFST和eMST分布反映了数据集的地理来源。该数据集可在github.com/google-research-datasets/scin获取。结论:搜索广告能有效众包健康状况图像。SCIN数据集填补了常见皮肤状况代表性图像可用性的重要空白。