Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original data, irrespective of the retained dimension or the specific mapping. This can be achieved by learning a distributional model that matches the conditional distribution of data given its low-dimensional latent variables. Motivated by this, we propose Distributional Principal Autoencoder (DPA) that consists of an encoder that maps high-dimensional data to low-dimensional latent variables and a decoder that maps the latent variables back to the data space. For reducing the dimension, the DPA encoder aims to minimise the unexplained variability of the data with an adaptive choice of the latent dimension. For reconstructing data, the DPA decoder aims to match the conditional distribution of all data that are mapped to a certain latent value, thus ensuring that the reconstructed data retains the original data distribution. Our numerical results on climate data, single-cell data, and image benchmarks demonstrate the practical feasibility and success of the approach in reconstructing the original distribution of the data. DPA embeddings are shown to preserve meaningful structures of data such as the seasonal cycle for precipitations and cell types for gene expression.
翻译:降维技术通常会在信息损失的意义上导致重建数据与原始数据不完全相同。然而,我们认为无论保留的维度或具体映射如何,都有可能使重建数据与原始数据同分布。这可以通过学习一个与给定低维潜变量的数据条件分布相匹配的分布模型来实现。受此启发,我们提出了分布主自编码器(Distributional Principal Autoencoder, DPA),它由将高维数据映射到低维潜变量的编码器和将潜变量映射回数据空间的解码器组成。为了降低维度,DPA编码器旨在通过自适应选择潜变量维度来最小化数据的未解释变异性。为了重建数据,DPA解码器旨在匹配所有映射到特定潜变量值的数据的条件分布,从而确保重建数据保留原始数据分布。我们在气候数据、单细胞数据和图像基准上的数值结果证明了该方法在重建数据原始分布方面的实际可行性和成功性。DPA嵌入被证明能够保留数据的有意义结构,例如降水量的季节周期和基因表达中的细胞类型。