Improving Covariance-Regularized Discriminant Analysis for EHR-based Predictive Analytics of Diseases

Linear Discriminant Analysis (LDA) is a well-known technique for feature extraction and dimension reduction. The performance of classical LDA, however, significantly degrades on the High Dimension Low Sample Size (HDLSS) data for the ill-posed inverse problem. Existing approaches for HDLSS data classification typically assume the data in question are with Gaussian distribution and deal the HDLSS classification problem with regularization. However, these assumptions are too strict to hold in many emerging real-life applications, such as enabling personalized predictive analysis using Electronic Health Records (EHRs) data collected from an extremely limited number of patients who have been diagnosed with or without the target disease for prediction. In this paper, we revised the problem of predictive analysis of disease using personal EHR data and LDA classifier. To fill the gap, in this paper, we first studied an analytical model that understands the accuracy of LDA for classifying data with arbitrary distribution. The model gives a theoretical upper bound of LDA error rate that is controlled by two factors: (1) the statistical convergence rate of (inverse) covariance matrix estimators and (2) the divergence of the training/testing datasets to fitted distributions. To this end, we could lower the error rate by balancing the two factors for better classification performance. Hereby, we further proposed a novel LDA classifier De-Sparse that leverages De-sparsified Graphical Lasso to improve the estimation of LDA, which outperforms state-of-the-art LDA approaches developed for HDLSS data. Such advances and effectiveness are further demonstrated by both theoretical analysis and extensive experiments on EHR datasets.

翻译：线性判别分析（LDA）是一种广为人知的特征提取与降维技术。然而，在处理高维小样本（HDLSS）数据时，经典LDA的性能会因病态逆问题而显著下降。现有针对HDLSS数据分类的方法通常假设数据服从高斯分布，并通过正则化处理分类难题。但这些假设在诸多新兴实际应用中过于严格，例如利用极少数确诊或未确诊目标疾病的患者电子健康记录（EHR）数据实现个性化预测分析。本文重新审视了基于个人EHR数据与LDA分类器的疾病预测分析问题。为弥补现有研究空白，我们首先建立了一个解析模型来理解LDA对任意分布数据的分类精度。该模型给出了LDA错误率的理论上界，该上界由两个因素控制：（1）协方差矩阵（及其逆矩阵）估计量的统计收敛速率；（2）训练/测试数据集与拟合分布之间的散度。由此，可通过平衡这两个因素来降低错误率，从而提升分类性能。基于此，我们进一步提出新型LDA分类器De-Sparse，该分类器利用去稀疏化图形套索改进LDA估计，性能优于当前最先进的HDLSS数据LDA方法。最终通过理论分析与EHR数据集上的大量实验验证了其有效性与优越性。