基于视觉Transformer的肺部疾病检测：机器学习方法比较研究 (Lung Disease Detection with Vision Transformers: A Comparative Study of Machine Learning Methods)

Recent advancements in medical image analysis have predominantly relied on Convolutional Neural Networks (CNNs), achieving impressive performance in chest X-ray classification tasks, such as the 92% AUC reported by AutoThorax-Net and the 88% AUC achieved by ChexNet in classifcation tasks. However, in the medical field, even small improvements in accuracy can have significant clinical implications. This study explores the application of Vision Transformers (ViT), a state-of-the-art architecture in machine learning, to chest X-ray analysis, aiming to push the boundaries of diagnostic accuracy. I present a comparative analysis of two ViT-based approaches: one utilizing full chest X-ray images and another focusing on segmented lung regions. Experiments demonstrate that both methods surpass the performance of traditional CNN-based models, with the full-image ViT achieving up to 97.83% accuracy and the lung-segmented ViT reaching 96.58% accuracy in classifcation of diseases on three label and AUC of 94.54% when label numbers are increased to eight. Notably, the full-image approach showed superior performance across all metrics, including precision, recall, F1 score, and AUC-ROC. These findings suggest that Vision Transformers can effectively capture relevant features from chest X-rays without the need for explicit lung segmentation, potentially simplifying the preprocessing pipeline while maintaining high accuracy. This research contributes to the growing body of evidence supporting the efficacy of transformer-based architectures in medical image analysis and highlights their potential to enhance diagnostic precision in clinical settings.

翻译：近期医学影像分析领域的进展主要依赖于卷积神经网络（CNN），在胸部X光分类任务中取得了令人瞩目的性能，例如AutoThorax-Net报告的92% AUC值以及ChexNet在分类任务中实现的88% AUC值。然而在医学领域，即使准确率的微小提升也可能产生重要的临床意义。本研究探索了机器学习领域最先进的视觉Transformer（ViT）架构在胸部X光分析中的应用，旨在突破诊断准确率的边界。本文提出了两种基于ViT方法的对比分析：一种使用完整胸部X光图像，另一种聚焦于分割后的肺部区域。实验表明两种方法均超越了传统基于CNN模型的性能，其中全图像ViT在三分类任务中达到97.83%的准确率，而肺部区域分割ViT达到96.58%的准确率；当标签数量增加至八类时，AUC值达到94.54%。值得注意的是，全图像方法在所有评估指标（包括精确率、召回率、F1分数和AUC-ROC）上均表现出更优性能。这些发现表明视觉Transformer能够有效捕获胸部X光中的相关特征，无需显式的肺部区域分割，这可能在保持高准确率的同时简化预处理流程。本研究为支持基于Transformer架构在医学影像分析中的有效性提供了新的证据，并凸显了其在临床环境中提升诊断精度的潜力。