Variational Autoencoders and their many variants have displayed impressive ability to perform dimensionality reduction, often achieving state-of-the-art performance. Many current methods however, struggle to learn good representations in High Dimensional, Low Sample Size (HDLSS) tasks, which is an inherently challenging setting. We address this challenge by using an ensemble of lightweight VAEs to learn posteriors over subsets of the feature-space, which get aggregated into a joint posterior in a novel divide-and-conquer approach. Specifically, we present an alternative factorisation of the joint posterior that induces a form of implicit data augmentation that yields greater sample efficiency. Through a series of experiments on eight real-world datasets, we show that our method learns better latent representations in HDLSS settings, which leads to higher accuracy in a downstream classification task. Furthermore, we verify that our approach has a positive effect on disentanglement and achieves a lower estimated Total Correlation on learnt representations. Finally, we show that our approach is robust to partial features at inference, exhibiting little performance degradation even with most features missing.
翻译:变分自编码器及其众多变体在降维任务中展现出卓越性能,常达到当前最优水平。然而,现有方法在高维小样本(HDLSS)任务中难以学习有效的表示,这本质上是一个极具挑战性的场景。我们通过集成轻量级变分自编码器,对特征空间子集学习后验分布,再以新颖的分治策略将其聚合为联合后验,从而解决这一挑战。具体而言,我们提出一种替代性的联合后验分解方式,该分解能产生隐式数据增强,从而提升样本效率。通过在八个真实数据集上的系列实验,我们证明该方法在HDLSS场景下能学习更优的潜在表示,进而提升下游分类任务的准确率。此外,我们验证了该方法对解耦学习具有积极影响,并能在学习到的表示上获得更低的估计总相关性。最后,研究表明该方法对推理时的部分特征缺失具有鲁棒性,即使大部分特征缺失,性能下降也极小。