Significantly Improving Zero-Shot X-ray Pathology Classification via Fine-tuning Pre-trained Image-Text Encoders

Deep neural networks have been successfully adopted to diverse domains including pathology classification based on medical images. However, large-scale and high-quality data to train powerful neural networks are rare in the medical domain as the labeling must be done by qualified experts. Researchers recently tackled this problem with some success by taking advantage of models pre-trained on large-scale general domain data. Specifically, researchers took contrastive image-text encoders (e.g., CLIP) and fine-tuned it with chest X-ray images and paired reports to perform zero-shot pathology classification, thus completely removing the need for pathology-annotated images to train a classification model. Existing studies, however, fine-tuned the pre-trained model with the same contrastive learning objective, and failed to exploit the multi-labeled nature of medical image-report pairs. In this paper, we propose a new fine-tuning strategy based on sentence sampling and positive pair loss relaxation for improving the downstream zero-shot pathology classification performance, which can be applied to any pre-trained contrastive image-text encoders. Our method consistently showed dramatically improved zero-shot pathology classification performance on four different chest X-ray datasets and 3 different pre-trained models (5.77% average AUROC increase). In particular, fine-tuning CLIP with our method showed much comparable or marginally outperformed to board-certified radiologists (0.619 vs 0.625 in F1 score and 0.530 vs 0.544 in MCC) in zero-shot classification of five prominent diseases from the CheXpert dataset.

翻译：深度神经网络已成功应用于包括基于医学图像的病理分类在内的多个领域。然而，在医学领域中，用于训练强大神经网络的大规模高质量数据十分稀缺，因为标注必须由合格的专家完成。研究者近期通过利用在大规模通用领域数据上预训练的模型，在一定程度上解决了这一问题。具体而言，研究者采用对比式图文编码器（如CLIP），并通过胸部X光图像及其配对报告对其进行微调，以执行零样本病理分类，从而完全消除了对病理标注图像进行训练的需求。然而，现有研究在微调预训练模型时仍采用相同的对比学习目标，未能充分利用医学图像-报告配对的多标签特性。本文提出一种新的微调策略，基于句子采样和正样本对损失松弛，以提升下游零样本病理分类性能，该策略可应用于任何预训练对比式图文编码器。我们的方法在四个不同的胸部X光数据集和三种不同的预训练模型上，持续显著提升了零样本病理分类性能（平均AUROC提升5.77%）。特别地，通过我们的方法微调CLIP，在CheXpert数据集的五种主要疾病零样本分类中，其表现与经委员会认证的放射科医生相当或略优（F1分数：0.619 vs 0.625；MCC：0.530 vs 0.544）。