Currently, many researchers and analysts are working toward medical diagnosis enhancement for various diseases. Heart disease is one of the common diseases that can be considered a significant cause of mortality worldwide. Early detection of heart disease significantly helps in reducing the risk of heart failure. Consequently, the Centers for Disease Control and Prevention (CDC) conducts a health-related telephone survey yearly from over 400,000 participants. However, several concerns arise regarding the reliability of the data in predicting heart disease and whether all of the survey questions are strongly related. This study aims to utilize several machine learning techniques, such as support vector machines and logistic regression, to investigate the accuracy of the CDC's heart disease survey in the United States. Furthermore, we use various feature selection methods to identify the most relevant subset of questions that can be utilized to forecast heart conditions. To reach a robust conclusion, we perform stability analysis by randomly sampling the data 300 times. The experimental results show that the survey data can be useful up to 80% in terms of predicting heart disease, which significantly improves the diagnostic process before bloodwork and tests. In addition, the amount of time spent conducting the survey can be reduced by 77% while maintaining the same level of performance.
翻译:目前,众多研究人员和分析师正致力于提升各类疾病的医学诊断水平。心脏病作为一种常见疾病,可被视为全球范围内的主要死因之一。早期发现心脏病对降低心力衰竭风险具有显著帮助。为此,美国疾病控制与预防中心每年对超过40万名参与者开展与健康相关的电话调查。然而,关于该数据在预测心脏病方面的可靠性,以及所有调查问题是否均具有强相关性,仍存在诸多疑虑。本研究旨在运用支持向量机、逻辑回归等若干机器学习技术,探究美国疾控中心心脏病调查数据的准确性。此外,我们采用多种特征选择方法,以识别可用于预测心脏状况的最具相关性的问题子集。为得出稳健结论,我们通过随机抽样数据300次进行稳定性分析。实验结果表明,该调查数据在预测心脏病方面可达80%的有效性,从而在血液检测及常规检查前显著改善诊断流程。同时,在保持同等性能水平的前提下,开展调查所花费的时间可减少77%。