Disentanglement is the endeavour to use machine learning to divide information about a dataset into meaningful fragments. In practice these fragments are representation (sub)spaces, often the set of channels in the latent space of a variational autoencoder (VAE). Assessments of disentanglement predominantly employ metrics that are coarse-grained at the model level, but this approach can obscure much about the process of information fragmentation. Here we propose to study the learned channels in aggregate, as the fragments of information learned by an ensemble of repeat training runs. Additionally, we depart from prior work where measures of similarity between individual subspaces neglected the nature of data embeddings as probability distributions. Instead, we view representation subspaces as communication channels that perform a soft clustering of the data; consequently, we generalize two classic information-theoretic measures of similarity between clustering assignments to compare representation spaces. We develop a lightweight method of estimation based on fingerprinting representation subspaces by their ability to distinguish dataset samples, allowing us to identify, analyze, and leverage meaningful structure in ensembles of VAEs trained on synthetic and natural datasets. Using this fully unsupervised pipeline we identify "hotspots" in the space of information fragments: groups of nearly identical representation subspaces that appear repeatedly in an ensemble of VAEs, particularly as regularization is increased. Finally, we leverage the proposed methodology to achieve ensemble learning with VAEs, boosting the information content of a set of weak learners -- a capability not possible with previous methods of assessing channel similarity.
翻译:解耦旨在利用机器学习将数据集的信息划分为有意义的片段。实践中这些片段表现为表示(子)空间,通常是变分自编码器(VAE)潜在空间中的通道集合。现有的解耦评估主要采用模型层面的粗粒度度量,但这种方法可能掩盖信息碎片化过程的诸多细节。本文提出通过聚合重复训练集成的学习通道,将其视为机器学习获得的信息片段集合进行研究。此外,我们突破了先前工作中忽略数据嵌入作为概率分布特性的子空间相似性度量范式,将表示子空间视为执行数据软聚类的通信通道;据此,我们推广了两种经典的聚类分配信息论相似性度量方法,用于比较表示空间。我们开发了一种基于表示子空间样本区分能力的轻量化指纹估计方法,使其能够识别、分析和利用合成与自然数据集上训练的VAE集成中的有意义结构。通过这种完全无监督的流程,我们在信息片段空间中发现了"热点区域":即在VAE集成中反复出现的、近乎相同的表示子空间群组,这种现象在正则化增强时尤为显著。最后,我们运用所提出的方法实现了VAE集成学习,有效提升了弱学习器集合的信息容量——这是以往通道相似性评估方法无法实现的能力。