With the prevailing efforts to combat the coronavirus disease 2019 (COVID-19) pandemic, there are still uncertainties that are yet to be discovered about its spread, future impact, and resurgence. In this paper, we present a three-stage data-driven approach to distill the hidden information about COVID-19. The first stage employs a Bayesian network structure learning method to identify the causal relationships among COVID-19 symptoms and their intrinsic demographic variables. As a second stage, the output from the Bayesian network structure learning, serves as a useful guide to train an unsupervised machine learning (ML) algorithm that uncovers the similarities in patients' symptoms through clustering. The final stage then leverages the labels obtained from clustering to train a demographic symptom identification (DSID) model which predicts a patient's symptom class and the corresponding demographic probability distribution. We applied our method on the COVID-19 dataset obtained from the Centers for Disease Control and Prevention (CDC) in the United States. Results from the experiments show a testing accuracy of 99.99%, as against the 41.15% accuracy of a heuristic ML method. This strongly reveals the viability of our Bayesian network and ML approach in understanding the relationship between the virus symptoms, and providing insights on patients' stratification towards reducing the severity of the virus.
翻译:随着抗击2019冠状病毒病(COVID-19)大流行的持续努力,关于其传播、未来影响及复发仍存在诸多尚未明晰的不确定性。本文提出一种三阶段数据驱动方法,以提炼关于COVID-19的隐含信息。第一阶段采用贝叶斯网络结构学习方法,识别COVID-19症状与其内在人口统计学变量之间的因果关系。第二阶段将贝叶斯网络结构学习的输出作为指导,训练一种无监督机器学习(ML)算法,通过聚类揭示患者症状间的相似性。最终阶段利用聚类获得的标签训练人口统计学症状识别(DSID)模型,该模型可预测患者的症状类别及相应的人口统计学概率分布。我们将该方法应用于美国疾病控制与预防中心(CDC)提供的COVID-19数据集。实验结果显示测试准确率达到99.99%,而启发式ML方法的准确率仅为41.15%。这有力证明了我们提出的贝叶斯网络与机器学习方法在理解病毒症状间关联、为降低病毒严重性的患者分层提供洞见方面具有显著可行性。