Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can pose a significant challenge in estimating correlation coefficients. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two common missing patterns: random and monotone. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. We recommend using DPER, a direct parameter estimation approach, for plotting the correlation matrix based on its performance in the experiments.
翻译:相关性矩阵可视化是理解数据集中变量间关系的关键,但缺失数据会显著影响相关系数的估计。本文比较了不同缺失数据处理方法对相关性图的影响,重点关注两种常见缺失模式:随机缺失和单调缺失。我们旨在为研究人员和实践者提供创建和分析相关性图的实用策略与建议。实验结果表明,虽然插补法常用于处理缺失数据,但使用插补数据绘制相关性矩阵可能会导致对特征间关系的推断产生显著误导。基于实验性能,我们建议使用DPER(直接参数估计方法)来绘制相关性矩阵。