The use of machine learning algorithms in healthcare can amplify social injustices and health inequities. While the exacerbation of biases can occur and compound during the problem selection, data collection, and outcome definition, this research pertains to some generalizability impediments that occur during the development and the post-deployment of machine learning classification algorithms. Using the Framingham coronary heart disease data as a case study, we show how to effectively select a probability cutoff to convert a regression model for a dichotomous variable into a classifier. We then compare the sampling distribution of the predictive performance of eight machine learning classification algorithms under four training/testing scenarios to test their generalizability and their potential to perpetuate biases. We show that both the Extreme Gradient Boosting, and Support Vector Machine are flawed when trained on an unbalanced dataset. We introduced and show that the double discriminant scoring of type I is the most generalizable as it consistently outperforms the other classification algorithms regardless of the training/testing scenario. Finally, we introduce a methodology to extract an optimal variable hierarchy for a classification algorithm, and illustrate it on the overall, male and female Framingham coronary heart disease data.
翻译:机器学习算法在医疗保健中的应用可能加剧社会不公与健康不平等。尽管在问题选择、数据收集和结果定义过程中可能发生并累积偏见加剧现象,本研究重点关注机器学习分类算法在开发及部署后阶段出现的一些泛化能力障碍。以弗拉明汉冠心病数据为例,我们展示了如何有效选择概率阈值将二分类变量回归模型转化为分类器。随后,我们比较了八种机器学习分类算法在四种训练/测试场景下的预测性能抽样分布,以检验其泛化能力及延续偏见的可能性。研究表明,极端梯度提升和支持向量机在不平衡数据集上训练时存在缺陷。我们引入并证明第一类双判别评分具有最强泛化能力——无论训练/测试场景如何,该算法始终优于其他分类算法。最后,我们提出了一种为分类算法提取最优变量层次结构的方法,并在弗拉明汉冠心病总体、男性和女性数据集上进行了实例验证。