Directly Handling Missing Data in Linear Discriminant Analysis for Enhancing Classification Accuracy and Interpretability

As the adoption of Artificial Intelligence (AI) models expands into critical real-world applications, ensuring the explainability of these models becomes paramount, particularly in sensitive fields such as medicine and finance. Linear Discriminant Analysis (LDA) remains a popular choice for classification due to its interpretable nature, derived from its capacity to model class distributions and enhance class separation through linear combinations of features. However, real-world datasets often suffer from incomplete data, posing substantial challenges for both classification accuracy and model interpretability. In this paper, we introduce a novel and robust classification method, termed Weighted missing Linear Discriminant Analysis (WLDA), which extends LDA to handle datasets with missing values without the need for imputation. Our approach innovatively incorporates a weight matrix that penalizes missing entries, thereby refining parameter estimation directly on incomplete data. This methodology not only preserves the interpretability of LDA but also significantly enhances classification performance in scenarios plagued by missing data. We conduct an in-depth theoretical analysis to establish the properties of WLDA and thoroughly evaluate its explainability. Experimental results across various datasets demonstrate that WLDA consistently outperforms traditional methods, especially in challenging environments where missing values are prevalent in both training and test datasets. This advancement provides a critical tool for improving classification accuracy and maintaining model transparency in the face of incomplete data.

翻译：随着人工智能模型在关键现实应用中的推广，确保这些模型的可解释性变得至关重要，尤其是在医学和金融等敏感领域。线性判别分析因其可解释性而成为分类的常用方法，这源于其建模类别分布并通过特征的线性组合增强类别分离的能力。然而，现实数据集常存在数据不完整的问题，这对分类准确性和模型可解释性均构成重大挑战。本文提出一种新颖且鲁棒的分类方法，称为加权缺失线性判别分析，该方法扩展了LDA以处理含缺失值的数据集，无需进行插补。我们的方法创新性地引入了一个惩罚缺失项的权重矩阵，从而直接在非完整数据上优化参数估计。此方法不仅保留了LDA的可解释性，还在缺失数据普遍存在的场景中显著提升了分类性能。我们进行了深入的理论分析以确立WLDA的性质，并全面评估其可解释性。多个数据集的实验结果表明，WLDA始终优于传统方法，尤其在训练和测试数据集中缺失值普遍存在的挑战性环境中表现突出。这一进展为在不完整数据面前提升分类准确性和保持模型透明度提供了关键工具。