We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework
翻译:我们提出了生成式无限词汇变换器(GIVT),其生成的是具有实值条目的向量序列,而非来自有限词汇表的离散令牌。为此,我们对仅解码器变换器提出了两个极其简单的修改:1)在输入端,我们用输入向量的线性投影取代有限词汇查找表;2)在输出端,我们用多元高斯混合模型的参数取代通常映射到类别分布的对数几率预测。受VQ-GAN和MaskGIT图像生成范式的启发(其中变换器用于建模VQ-VAE的离散潜变量序列),我们使用GIVT对β-VAE的非量化实值潜变量序列进行建模。在类别条件图像生成任务中,GIVT的性能优于VQ-GAN(及其改进变体)和MaskGIT,并与近期潜扩散模型的性能相当。最后,我们将GIVT应用于UViM框架的VAE变体进行全景分割和深度估计,在图像生成领域之外也取得了优异成果。