PCA引导的分位数抽样：在大规模子抽样中保持数据结构 (PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling)

We introduce Principal Component Analysis guided Quantile Sampling (PCA QS), a novel sampling framework designed to preserve both the statistical and geometric structure of large scale datasets. Unlike conventional PCA, which reduces dimensionality at the cost of interpretability, PCA QS retains the original feature space while using leading principal components solely to guide a quantile based stratification scheme. This principled design ensures that sampling remains representative without distorting the underlying data semantics. We establish rigorous theoretical guarantees, deriving convergence rates for empirical quantiles, Kullback Leibler divergence, and Wasserstein distance, thus quantifying the distributional fidelity of PCA QS samples. Practical guidelines for selecting the number of principal components, quantile bins, and sampling rates are provided based on these results. Extensive empirical studies on both synthetic and real-world datasets show that PCA QS consistently outperforms simple random sampling, yielding better structure preservation and improved downstream model performance. Together, these contributions position PCA QS as a scalable, interpretable, and theoretically grounded solution for efficient data summarization in modern machine learning workflows.

翻译：我们提出了一种主成分分析引导的分位数抽样（PCA QS）框架，这是一种新颖的抽样方法，旨在同时保持大规模数据集的统计与几何结构。与传统的PCA通过牺牲可解释性来降低维度不同，PCA QS保留了原始特征空间，仅利用前导主成分来指导基于分位数的分层方案。这种原则性设计确保了抽样具有代表性，同时不会扭曲底层数据的语义。我们建立了严格的理论保证，推导了经验分位数、Kullback-Leibler散度和Wasserstein距离的收敛速率，从而量化了PCA QS样本的分布保真度。基于这些结果，我们为选择主成分数量、分位数箱和抽样率提供了实用指南。在合成和真实数据集上的大量实证研究表明，PCA QS始终优于简单随机抽样，实现了更好的结构保持和更优的下游模型性能。这些贡献共同使PCA QS成为一种可扩展、可解释且理论依据充分的解决方案，适用于现代机器学习工作流中的高效数据摘要。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

UnHiPPO：面向不确定性的状态空间模型初始化方法

专知会员服务

11+阅读 · 2025年6月6日

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

专知会员服务

30+阅读 · 2024年3月10日

UTC: 用于视觉对话的任务间对比学习的统一Transformer

专知会员服务

14+阅读 · 2022年5月4日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日