Deep neural networks have shown impressive performance for image-based disease detection. Performance is commonly evaluated through clinical validation on independent test sets to demonstrate clinically acceptable accuracy. Reporting good performance metrics on test sets, however, is not always a sufficient indication of the generalizability and robustness of an algorithm. In particular, when the test data is drawn from the same distribution as the training data, the iid test set performance can be an unreliable estimate of the accuracy on new data. In this paper, we employ stress testing to assess model robustness and subgroup performance disparities in disease detection models. We design progressive stress testing using five different bidirectional and unidirectional image perturbations with six different severity levels. As a use case, we apply stress tests to measure the robustness of disease detection models for chest X-ray and skin lesion images, and demonstrate the importance of studying class and domain-specific model behaviour. Our experiments indicate that some models may yield more robust and equitable performance than others. We also find that pretraining characteristics play an important role in downstream robustness. We conclude that progressive stress testing is a viable and important tool and should become standard practice in the clinical validation of image-based disease detection models.
翻译:深度神经网络在基于图像的疾病检测中展现了显著性能。其性能通常通过独立测试集的临床验证来评估,以证明临床可接受的准确性。然而,仅报告测试集上的良好性能指标并不足以充分表明算法的泛化性和鲁棒性。特别是当测试数据与训练数据分布相同时,独立同分布测试集性能可能无法可靠地估计新数据上的准确性。本文采用压力测试来评估疾病检测模型的鲁棒性及子组性能差异。我们设计了渐进式压力测试,使用五种不同的双向和单向图像扰动,每种扰动设置六种不同严重等级。以胸部X光片和皮肤病变图像为例,应用压力测试测量疾病检测模型的鲁棒性,并证明了研究类别特定与领域特定模型行为的重要性。实验表明,某些模型可能比其他模型具有更鲁棒且更公平的性能。我们还发现预训练特征在下游鲁棒性中起重要作用。我们得出结论:渐进式压力测试是一种可行且重要的工具,应成为基于图像的疾病检测模型临床验证的标准实践。