Principal Component Analysis (PCA) and its nonlinear extension Kernel PCA (KPCA) are widely used across science and industry for data analysis and dimensionality reduction. Modern deep learning tools have achieved great empirical success, but a framework for deep principal component analysis is still lacking. Here we develop a deep kernel PCA methodology (DKPCA) to extract multiple levels of the most informative components of the data. Our scheme can effectively identify new hierarchical variables, called deep principal components, capturing the main characteristics of high-dimensional data through a simple and interpretable numerical optimization. We couple the principal components of multiple KPCA levels, theoretically showing that DKPCA creates both forward and backward dependency across levels, which has not been explored in kernel methods and yet is crucial to extract more informative features. Various experimental evaluations on multiple data types show that DKPCA finds more efficient and disentangled representations with higher explained variance in fewer principal components, compared to the shallow KPCA. We demonstrate that our method allows for effective hierarchical data exploration, with the ability to separate the key generative factors of the input data both for large datasets and when few training samples are available. Overall, DKPCA can facilitate the extraction of useful patterns from high-dimensional data by learning more informative features organized in different levels, giving diversified aspects to explore the variation factors in the data, while maintaining a simple mathematical formulation.
翻译:主成分分析(PCA)及其非线性扩展核主成分分析(KPCA)广泛应用于科学和工业领域的数据分析与降维。现代深度学习工具取得了显著的实证成功,但深层主成分分析的理论框架仍付阙如。本文提出了一种深度核主成分分析方法(DKPCA),用于提取数据中多个层次最具信息量的成分。该方案能有效识别新型层次变量(称为深度主成分),通过简洁可解释的数值优化捕获高维数据的主要特征。我们将多个KPCA层次的主成分进行耦合,从理论上证明DKPCA在不同层次间建立了前向与后向依赖关系——这在核方法中尚未被探索,但对提取更具信息量的特征至关重要。在多种数据类型上的实验评估表明,与浅层KPCA相比,DKPCA能以更少的主成分捕获更高解释方差的更高效、可解耦的表示。我们证明该方法能够实现有效的层次化数据探索,无论是处理大规模数据集还是少量训练样本,都能分离输入数据的关键生成因子。总体而言,DKPCA通过在不同层次上学习更具信息量的特征,能够促进从高维数据中提取有用模式,以多元视角探索数据中的变异因素,同时保持简洁的数学表达形式。