We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
翻译:我们提出了一种基于随机森林的自编码原则性方法。我们的策略建立在非参数统计和谱图理论基础之上,通过学习模型的最优低维嵌入来表征数据中的关系。我们通过约束优化、分裂重标记和最近邻回归为解码问题提供了精确和近似解。这些方法有效逆转了压缩流程,利用集成中构成树学习到的分裂规则,建立了从嵌入空间返回输入空间的映射。在常见正则性假设下,所得解码器具有普遍一致性。该流程适用于监督或无监督模型,为条件分布或联合分布提供了观察窗口。我们展示了该自编码器的多种应用,包括用于可视化、压缩、聚类和去噪的强大新工具。实验表明我们的方法在表格数据、图像数据和基因组数据等多种场景中具有易用性和实用性。