Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can pose a significant challenge in estimating correlation coefficients. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two common missing patterns: random and monotone. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. We recommend using DPER, a direct parameter estimation approach, for plotting the correlation matrix based on its performance in the experiments.
翻译:相关性矩阵可视化对于理解数据集中变量间的关系至关重要,但缺失数据会严重干扰相关系数的估计。本文比较了多种缺失数据处理方法对相关性图的影响,重点关注随机缺失与单调缺失两种常见模式。我们旨在为研究人员和从业者创建和分析相关性图提供实用策略与建议。实验结果表明,尽管插补法常用于处理缺失数据,但使用插补数据绘制相关性矩阵可能导致对特征间关系的严重误导性推断。基于实验结果中的表现,我们推荐使用直接参数估计方法DPER绘制相关性矩阵。