Investigating Causal links from Observed Features in the first COVID-19 Waves in California

Determining who is at risk from a disease is important in order to protect vulnerable subpopulations during an outbreak. We are currently in a SARS-COV-2 (commonly referred to as COVID-19) pandemic which has had a massive impact across the world, with some communities and individuals seen to have a higher risk of severe outcomes and death from the disease compared to others. These risks are compounded for people of lower socioeconomic status, those who have limited access to health care, higher rates of chronic diseases, such as hypertension, diabetes (type-2), obesity, likely due to the chronic stress of these types of living conditions. Essential workers are also at a higher risk of COVID-19 due to having higher rates of exposure due to the nature of their work. In this study we determine the important features of the pandemic in California in terms of cumulative cases and deaths per 100,000 of population up to the date of 5 July, 2021 (the date of analysis) using Pearson correlation coefficients between population demographic features and cumulative cases and deaths. The most highly correlated features, based on the absolute value of their Pearson Correlation Coefficients in relation to cases or deaths per 100,000, were used to create regression models in two ways: using the top 5 features and using the top 20 features filtered out to limit interactions between features. These models were used to determine a) the most significant features out of these subsets and b) features that approximate different potential forces on COVID-19 cases and deaths (especially in the case of the latter set). Additionally, co-correlations, defined as demographic features not within a given input feature set for the regression models but which are strongly correlated with the features included within, were calculated for all features.

翻译：确定疾病风险人群对于疫情期间保护易感亚群至关重要。当前我们正经历SARS-COV-2（通常称为COVID-19）大流行，该疫情对全球造成巨大影响，部分社区与个体出现严重结局和死亡的风险显著高于其他群体。社会经济地位较低者、医疗保障不足者、慢性病（如高血压、2型糖尿病、肥胖症）高发群体因长期处于此类生活环境的慢性压力，其风险进一步加剧。基础工作者因职业特性暴露风险更高，同样面临COVID-19感染高危。本研究以截至2021年7月5日（分析日期）加利福尼亚州每10万人口累积病例数与死亡数为指标，通过人口统计学特征与累积病例数/死亡数的皮尔逊相关系数，确定疫情关键特征。基于与每10万病例数或死亡数皮尔逊相关系数绝对值高度相关的特征，分别采用两类方法构建回归模型：纳入前5个特征，以及排除特征间相互作用后筛选的前20个特征。这些模型用于确定：a) 特征子集中的最关键特征；b) 表征影响COVID-19病例数与死亡数的不同潜在作用力特征（尤其针对后一特征集）。此外，还计算所有特征的共相关性——即未包含在回归模型输入特征集内、但与所纳入特征高度相关的人口统计学特征。