Blockwise Principal Component Analysis for monotone missing data imputation and dimensionality reduction

Monotone missing data is a common problem in data analysis. However, imputation combined with dimensionality reduction can be computationally expensive, especially with the increasing size of datasets. To address this issue, we propose a Blockwise principal component analysis Imputation (BPI) framework for dimensionality reduction and imputation of monotone missing data. The framework conducts Principal Component Analysis (PCA) on the observed part of each monotone block of the data and then imputes on merging the obtained principal components using a chosen imputation technique. BPI can work with various imputation techniques and can significantly reduce imputation time compared to conducting dimensionality reduction after imputation. This makes it a practical and efficient approach for large datasets with monotone missing data. Our experiments validate the improvement in speed. In addition, our experiments also show that while applying MICE imputation directly on missing data may not yield convergence, applying BPI with MICE for the data may lead to convergence.

翻译：单调缺失数据是数据分析中的常见问题。然而，插补与降维相结合的方法计算成本较高，尤其在数据集规模不断增大的情况下。为解决此问题，我们提出一种分块主成分分析插补（BPI）框架，用于对单调缺失数据进行降维与插补。该框架对数据各单调分块的可观测部分进行主成分分析（PCA），然后通过所选插补技术合并所获得的主成分完成插补。BPI可与多种插补技术协同工作，相比先插补后降维的方法，能显著缩短插补时间。因此，对于存在单调缺失数据的大规模数据集，这是一种实用且高效的方案。实验验证了其速度提升效果。此外，实验还表明，直接对缺失数据应用MICE插补可能无法收敛，而采用BPI与MICE组合处理数据则可实现收敛。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日