Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces, achieving a 20% improvement on FID Score for ImageNet 256.
翻译:变分自编码器(VAEs)中的离散潜在瓶颈具有高比特效率,并可通过自回归离散分布建模,从而利用Transformer实现参数高效的多模态搜索。然而,离散随机变量无法实现精确的可微分参数化;因此,离散VAE通常依赖近似方法,例如Gumbel-Softmax重参数化或直通梯度估计,或采用高方差的免梯度方法(如REINFORCE),这些方法在图像重建等高维任务中成效有限。受策略搜索中常用技术的启发,我们提出了一种离散VAE训练框架,该框架利用非参数编码器的自然梯度来更新参数编码器,无需重参数化过程。我们的方法结合自动步长调整与基于Transformer的编码器,可扩展至ImageNet等具有挑战性的数据集,并在从紧凑潜在空间重建高维数据方面优于近似重参数化方法与基于量化的离散自编码器,在ImageNet 256数据集上实现了FID分数20%的性能提升。