Although the variational autoencoder (VAE) and its conditional extension (CVAE) are capable of state-of-the-art results across multiple domains, their precise behavior is still not fully understood, particularly in the context of data (like images) that lie on or near a low-dimensional manifold. For example, while prior work has suggested that the globally optimal VAE solution can learn the correct manifold dimension, a necessary (but not sufficient) condition for producing samples from the true data distribution, this has never been rigorously proven. Moreover, it remains unclear how such considerations would change when various types of conditioning variables are introduced, or when the data support is extended to a union of manifolds (e.g., as is likely the case for MNIST digits and related). In this work, we address these points by first proving that VAE global minima are indeed capable of recovering the correct manifold dimension. We then extend this result to more general CVAEs, demonstrating practical scenarios whereby the conditioning variables allow the model to adaptively learn manifolds of varying dimension across samples. Our analyses, which have practical implications for various CVAE design choices, are also supported by numerical results on both synthetic and real-world datasets.
翻译:尽管变分自编码器及其条件扩展版本在多个领域已达到先进水平,但其精确行为仍未完全理解,特别是在数据(如图像)位于或接近低维流形的情境中。例如,尽管先前研究指出全局最优的变分自编码器解能够学习正确的流形维数——这是从真实数据分布中生成样本的必要但不充分条件,但这一结论从未经过严格证明。此外,当引入不同类型的条件变量,或数据支撑扩展至流形并集(如MNIST数字及相关数据中可能的情况)时,此类考量将如何变化仍不明确。本研究通过以下方式解决这些问题:首先证明变分自编码器的全局最小值确实能够恢复正确的流形维数。随后我们将此结果推广至更一般的条件变分自编码器,论证在实际场景中,条件变量使模型能够自适应地学习不同样本间维数各异的流形。我们的分析对各种条件变分自编码器设计选择具有实际意义,并得到合成数据集及真实世界数据集的数值结果支持。