The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Common dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction method, RF-PHATE, combining information learned from the random forest model with the function-learning capabilities of autoencoders. Through quantitative assessment of various autoencoder architectures, we identify that networks that reconstruct random forest proximities are more robust for the embedding extension problem. Furthermore, by leveraging proximity-based prototypes, we achieve a 40% reduction in training time without compromising extension quality. Our method does not require label information for out-of-sample points, thus serving as a semi-supervised method, and can achieve consistent quality using only 10% of the training data.
翻译:监督式降维的价值在于其能够揭示数据特征与标签之间有意义的联系。常见的降维方法会嵌入一组固定的潜在点,但无法泛化到未见过的测试集。本文针对基于随机森林的监督式降维方法RF-PHATE提出了一种样本外扩展方法,该方法将随机森林模型学习到的信息与自编码器的函数学习能力相结合。通过对多种自编码器架构进行定量评估,我们发现重建随机森林邻近性的网络在嵌入扩展问题上具有更强的鲁棒性。此外,通过利用基于邻近性的原型,我们在不降低扩展质量的情况下将训练时间减少了40%。本方法无需样本外点的标签信息,因此可作为半监督方法使用,且仅需10%的训练数据即可保持一致的扩展质量。