Feature extraction and selection at the presence of nonlinear dependencies among the data is a fundamental challenge in unsupervised learning. We propose using a Gram-Schmidt (GS) type orthogonalization process over function spaces to detect and map out such dependencies. Specifically, by applying the GS process over some family of functions, we construct a series of covariance matrices that can either be used to identify new large-variance directions, or to remove those dependencies from known directions. In the former case, we provide information-theoretic guarantees in terms of entropy reduction. In the latter, we provide precise conditions by which the chosen function family eliminates existing redundancy in the data. Each approach provides both a feature extraction and a feature selection algorithm. Our feature extraction methods are linear, and can be seen as natural generalization of principal component analysis (PCA). We provide experimental results for synthetic and real-world benchmark datasets which show superior performance over state-of-the-art (linear) feature extraction and selection algorithms. Surprisingly, our linear feature extraction algorithms are comparable and often outperform several important nonlinear feature extraction methods such as autoencoders, kernel PCA, and UMAP. Furthermore, one of our feature selection algorithms strictly generalizes a recent Fourier-based feature selection mechanism (Heidari et al., IEEE Transactions on Information Theory, 2022), yet at significantly reduced complexity.
翻译:数据间存在非线性依赖时的特征提取与选择是无监督学习中的基础性挑战。我们提出在函数空间上采用Gram-Schmidt(GS)型正交化过程来检测并映射此类依赖关系。具体而言,通过对某函数族应用GS过程,我们构建了一系列协方差矩阵,这些矩阵既可用于识别新的高方差方向,也可用于从已知方向中移除这些依赖关系。在前一种情况下,我们提供了熵减少意义上的信息理论保证;在后一种情况下,我们给出了所选函数族消除数据中现有冗余的精确条件。每种方法都同时提供了特征提取算法和特征选择算法。我们的特征提取方法是线性的,可视为主成分分析(PCA)的自然推广。我们在合成数据集和真实世界基准数据集上的实验结果表明,该方法优于当前最先进的(线性)特征提取与选择算法。令人惊讶的是,我们的线性特征提取算法与自编码器、核PCA、UMAP等重要非线性特征提取方法相比具有相当甚至更优的性能。此外,我们提出的某个特征选择算法严格推广了近期基于傅里叶的特征选择机制(Heidari等人,IEEE信息论汇刊,2022),且计算复杂度显著降低。