A lot of deep learning (DL) research these days is mainly focused on improving quantitative metrics regardless of other factors. In human-centered applications, like skin lesion classification in dermatology, DL-driven clinical decision support systems are still in their infancy due to the limited transparency of their decision-making process. Moreover, the lack of procedures that can explain the behavior of trained DL algorithms leads to almost no trust from clinical physicians. To diagnose skin lesions, dermatologists rely on visual assessment of the disease and the data gathered from the patient's anamnesis. Data-driven algorithms dealing with multi-modal data are limited by the separation of feature-level and decision-level fusion procedures required by convolutional architectures. To address this issue, we enable single-stage multi-modal data fusion via the attention mechanism of transformer-based architectures to aid in diagnosing skin diseases. Our method beats other state-of-the-art single- and multi-modal DL architectures in image-rich and patient-data-rich environments. Additionally, the choice of the architecture enables native interpretability support for the classification task both in the image and metadata domain with no additional modifications necessary.
翻译:当前大量深度学习研究主要致力于提升定量指标,而忽视了其他因素。在以人为本的应用场景中,如皮肤科领域的皮肤病变分类,由于深度学习驱动的临床决策支持系统决策过程透明度有限,此类系统仍处于发展初期。此外,缺乏能够解释已训练深度学习算法行为的方法,导致临床医生对其几乎不信任。皮肤科医生诊断皮肤病变时,需依赖对疾病的视觉评估及患者病史数据。处理多模态数据的深度学习算法受限于卷积架构所需的特征级与决策级融合流程的分离性。为解决该问题,我们通过基于Transformer架构的注意力机制实现单阶段多模态数据融合,以辅助皮肤病诊断。在图像丰富及患者数据丰富的环境中,本方法优于其他最先进的单模态与多模态深度学习架构。此外,所选架构原生支持分类任务在图像及元数据领域的可解释性,无需额外修改。