Researchers use Twitter and sentiment analysis to predict Cardiovascular Disease (CVD) risk. We developed a new dictionary of CVD-related keywords by analyzing emotions expressed in tweets. Tweets from eighteen US states, including the Appalachian region, were collected. Using the VADER model for sentiment analysis, users were classified as potentially at CVD risk. Machine Learning (ML) models were employed to classify individuals' CVD risk and applied to a CDC dataset with demographic information to make the comparison. Performance evaluation metrics such as Test Accuracy, Precision, Recall, F1 score, Mathew's Correlation Coefficient (MCC), and Cohen's Kappa (CK) score were considered. Results demonstrated that analyzing tweets' emotions surpassed the predictive power of demographic data alone, enabling the identification of individuals at potential risk of developing CVD. This research highlights the potential of Natural Language Processing (NLP) and ML techniques in using tweets to identify individuals with CVD risks, providing an alternative approach to traditional demographic information for public health monitoring.
翻译:研究人员利用推特及情感分析技术进行心血管疾病(CVD)风险预测。我们通过分析推文中表达的情感,开发了一套新型的CVD相关关键词词典。收集了包括阿巴拉契亚地区在内的美国十八个州的推文。采用VADER模型进行情感分析后,将用户划分为潜在CVD风险人群。利用机器学习(ML)模型对个体CVD风险进行分类,并应用于包含人口统计学信息的CDC数据集进行对比分析。评估指标包括测试准确率、精确率、召回率、F1分数、马修斯相关系数(MCC)和科恩卡帕(CK)得分。结果表明,分析推文情感的能力超越了单纯人口统计学数据的预测效果,能够识别出具有CVD潜在患病风险的个体。本研究凸显了自然语言处理(NLP)与机器学习技术在利用推文识别CVD风险人群方面的潜力,为公共卫生监测提供了传统人口统计学信息之外的替代方案。