Medical imaging diagnosis increasingly relies on Machine Learning (ML) models. This is a task that is often hampered by severely imbalanced datasets, where positive cases can be quite rare. Their use is further compromised by their limited interpretability, which is becoming increasingly important. While post-hoc interpretability techniques such as SHAP and LIME have been used with some success on so-called black box models, the use of inherently understandable models makes such endeavors more fruitful. This paper addresses these issues by demonstrating how a relatively new synthetic data generation technique, STEM, can be used to produce data to train models produced by Grammatical Evolution (GE) that are inherently understandable. STEM is a recently introduced combination of the Synthetic Minority Oversampling Technique (SMOTE), Edited Nearest Neighbour (ENN), and Mixup; it has previously been successfully used to tackle both between class and within class imbalance issues. We test our technique on the Digital Database for Screening Mammography (DDSM) and the Wisconsin Breast Cancer (WBC) datasets and compare Area Under the Curve (AUC) results with an ensemble of the top three performing classifiers from a set of eight standard ML classifiers with varying degrees of interpretability. We demonstrate that the GE-derived models present the best AUC while still maintaining interpretable solutions.
翻译:医学影像诊断日益依赖机器学习模型,却常受困于严重不平衡的数据集——阳性病例往往极为罕见。这类模型的应用还因其有限的可解释性而进一步受限,而可解释性正变得愈发重要。尽管SHAP和LIME等事后可解释技术已在所谓"黑箱模型"上取得一定成效,但采用本质可理解模型更能让此类探索富有成效。本文通过展示如何运用新型合成数据生成技术STEM来生成训练数据,使语法演化产生的模型具备本质可理解性,从而解决上述问题。STEM是近期提出的合成少数类过采样技术、编辑最近邻与混合插值的组合方法,此前已成功应对类间与类内不平衡问题。我们在数字乳腺摄影筛查数据库和威斯康星乳腺癌数据集上测试该技术,将曲线下面积结果与八种标准机器学习分类器中前三名最优分类器集成(具有不同可解释性)进行对比。结果表明,语法演化衍生模型在保持可解释解的同时,呈现最佳AUC性能。