The Coronavirus Disease 2019 (COVID-19) has a profound impact on global health and economy, making it crucial to build accurate and interpretable data-driven predictive models for COVID-19 cases to improve policy making. The extremely large scale of the pandemic and the intrinsically changing transmission characteristics pose great challenges for effective COVID-19 case prediction. To address this challenge, we propose a novel hybrid model in which the interpretability of the Autoregressive model (AR) and the predictive power of the long short-term memory neural networks (LSTM) join forces. The proposed hybrid model is formalized as a neural network with an architecture that connects two composing model blocks, of which the relative contribution is decided data-adaptively in the training procedure. We demonstrate the favorable performance of the hybrid model over its two component models as well as other popular predictive models through comprehensive numerical studies on two data sources under multiple evaluation metrics. Specifically, in county-level data of 8 California counties, our hybrid model achieves 4.173% MAPE on average, outperforming the composing AR (5.629%) and LSTM (4.934%). In country-level datasets, our hybrid model outperforms the widely-used predictive models - AR, LSTM, SVM, Gradient Boosting, and Random Forest - in predicting COVID-19 cases in 8 countries around the world. In addition, we illustrate the interpretability of our proposed hybrid model, a key feature not shared by most black-box predictive models for COVID-19 cases. Our study provides a new and promising direction for building effective and interpretable data-driven models, which could have significant implications for public health policy making and control of the current and potential future pandemics.
翻译:新型冠状病毒肺炎(COVID-19)对全球健康和经济产生了深远影响,因此构建准确且可解释的数据驱动型COVID-19病例预测模型对于改进政策制定至关重要。大流行的超大规模及不断变化的传播特性给有效预测COVID-19病例带来了巨大挑战。为应对这一挑战,我们提出了一种新型混合模型,该模型将自回归模型(AR)的可解释性与长短期记忆神经网络(LSTM)的预测能力相结合。所提出的混合模型形式化为一个神经网络,其架构连接两个组成模型模块,各模块的相对贡献在训练过程中根据数据自适应确定。通过对两个数据源在多个评估指标下的综合数值研究,我们证明了该混合模型在性能上优于其两个组成模型及其他主流预测模型。具体而言,在加州8个县的县级数据中,我们的混合模型平均MAPE为4.173%,优于构成模型AR(5.629%)和LSTM(4.934%)。在国家级数据集中,我们的混合模型在预测全球8个国家的COVID-19病例时优于广泛使用的预测模型——AR、LSTM、SVM、梯度提升和随机森林。此外,我们展示了所提出混合模型的可解释性,这是大多数COVID-19病例黑箱预测模型所不具备的关键特性。本研究为构建有效且可解释的数据驱动模型提供了新的有前景方向,对当前及未来潜在大流行的公共卫生政策制定与控制具有重要意义。