Although the variational autoencoder (VAE) and its conditional extension (CVAE) are capable of state-of-the-art results across multiple domains, their precise behavior is still not fully understood, particularly in the context of data (like images) that lie on or near a low-dimensional manifold. For example, while prior work has suggested that the globally optimal VAE solution can learn the correct manifold dimension, a necessary (but not sufficient) condition for producing samples from the true data distribution, this has never been rigorously proven. Moreover, it remains unclear how such considerations would change when various types of conditioning variables are introduced, or when the data support is extended to a union of manifolds (e.g., as is likely the case for MNIST digits and related). In this work, we address these points by first proving that VAE global minima are indeed capable of recovering the correct manifold dimension. We then extend this result to more general CVAEs, demonstrating practical scenarios whereby the conditioning variables allow the model to adaptively learn manifolds of varying dimension across samples. Our analyses, which have practical implications for various CVAE design choices, are also supported by numerical results on both synthetic and real-world datasets.
翻译:尽管变分自编码器(VAE)及其条件扩展(CVAE)在多个领域取得了最先进的结果,但其精确行为仍未被完全理解,尤其是在数据(如图像)位于或接近低维流形的情境下。例如,虽然先前的研究表明全局最优的VAE解能够学习正确的流形维度——这是从真实数据分布生成样本的必要(但非充分)条件——但这一点从未被严格证明。此外,当引入各种类型的条件变量,或当数据支撑集扩展至流形并集时(如MNIST手写数字及相关数据可能的情况),现有认知将如何变化仍不明确。本研究首先证明VAE全局最小值确实能够恢复正确的流形维度,进而将此结论推广至更一般的CVAE,通过实际场景展示条件变量使模型能够自适应地学习不同样本间维度可变的流形。我们的分析对各类CVAE设计选择具有实际指导意义,并在合成与真实数据集上通过数值结果得到佐证。