We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE. When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.
翻译:本文提出生成式无限词汇Transformer(GIVT),该模型生成包含实数值的向量序列,而非有限词汇表中的离散词元。为此,我们对仅解码器架构的Transformer提出了两项意外简单的修改:1)在输入层,将有限词汇查找表替换为输入向量的线性投影;2)在输出层,将对数几率预测(通常映射为分类分布)替换为多元高斯混合模型的参数。受VQ-GAN和MaskGIT图像生成范式的启发——此类方法使用Transformer建模VQ-VAE的离散潜在序列——我们采用GIVT建模VAE未经量化的实值潜在序列。在基于迭代掩码建模的条件图像生成任务中,GIVT取得了与MaskGIT相当的结果;而当应用于因果建模时,我们的方法性能优于VQ-GAN和MaskGIT。最后,将本方法应用于基于VAE变体UViM框架的全景分割和深度估计任务时,我们在图像生成领域之外也取得了具有竞争力的结果。