Comparison of Machine Learning Classification Algorithms and Application to the Framingham Heart Study

The use of machine learning algorithms in healthcare can amplify social injustices and health inequities. While the exacerbation of biases can occur and compound during the problem selection, data collection, and outcome definition, this research pertains to some generalizability impediments that occur during the development and the post-deployment of machine learning classification algorithms. Using the Framingham coronary heart disease data as a case study, we show how to effectively select a probability cutoff to convert a regression model for a dichotomous variable into a classifier. We then compare the sampling distribution of the predictive performance of eight machine learning classification algorithms under four training/testing scenarios to test their generalizability and their potential to perpetuate biases. We show that both the Extreme Gradient Boosting, and Support Vector Machine are flawed when trained on an unbalanced dataset. We introduced and show that the double discriminant scoring of type I is the most generalizable as it consistently outperforms the other classification algorithms regardless of the training/testing scenario. Finally, we introduce a methodology to extract an optimal variable hierarchy for a classification algorithm, and illustrate it on the overall, male and female Framingham coronary heart disease data.

翻译：机器学习算法在医疗保健中的应用可能加剧社会不公与健康不平等。尽管在问题选择、数据收集和结果定义过程中可能发生并累积偏见加剧现象，本研究重点关注机器学习分类算法在开发及部署后阶段出现的一些泛化能力障碍。以弗拉明汉冠心病数据为例，我们展示了如何有效选择概率阈值将二分类变量回归模型转化为分类器。随后，我们比较了八种机器学习分类算法在四种训练/测试场景下的预测性能抽样分布，以检验其泛化能力及延续偏见的可能性。研究表明，极端梯度提升和支持向量机在不平衡数据集上训练时存在缺陷。我们引入并证明第一类双判别评分具有最强泛化能力——无论训练/测试场景如何，该算法始终优于其他分类算法。最后，我们提出了一种为分类算法提取最优变量层次结构的方法，并在弗拉明汉冠心病总体、男性和女性数据集上进行了实例验证。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日