Identifying low-dimensional latent structures within high-dimensional data has long been a central topic in the machine learning community, driven by the need for data compression, storage, transmission, and deeper data understanding. Traditional methods, such as principal component analysis (PCA) and autoencoders (AE), operate in an unsupervised manner, ignoring label information even when it is available. In this work, we introduce a unified method capable of learning latent spaces in both unsupervised and supervised settings. We formulate the problem as a nonlinear multiple-response regression within an index model context. By applying the generalized Stein's lemma, the latent space can be estimated without knowing the nonlinear link functions. Our method can be viewed as a nonlinear generalization of PCA. Moreover, unlike AE and other neural network methods that operate as "black boxes", our approach not only offers better interpretability but also reduces computational complexity while providing strong theoretical guarantees. Comprehensive numerical experiments and real data analyses demonstrate the superior performance of our method.
翻译:识别高维数据中的低维潜在结构长期以来一直是机器学习领域的核心课题,其驱动力源于数据压缩、存储、传输及更深层次数据理解的需求。传统方法,如主成分分析(PCA)和自编码器(AE),以无监督方式运行,即使在标签信息可用时也忽略其作用。在本研究中,我们提出了一种能够在无监督与有监督两种设置下学习潜在空间的统一方法。我们将该问题表述为索引模型框架下的非线性多响应回归问题。通过应用广义Stein引理,可以在无需知晓非线性链接函数的情况下估计潜在空间。我们的方法可视为PCA的非线性推广。此外,与AE及其他作为“黑箱”运行的神经网络方法不同,我们的方法不仅提供了更好的可解释性,而且在提供坚实理论保证的同时降低了计算复杂度。全面的数值实验与真实数据分析证明了本方法的优越性能。