Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im
翻译:图结构的场景描述可有效用于生成模型中,以控制生成图像的构成。以往的方法分别结合图卷积网络和对抗方法进行布局预测与图像生成。本工作展示如何利用多头注意力编码图信息,以及使用基于Transformer的潜在空间模型进行图像生成,可在无需使用对抗模型的情况下提升采样数据质量,并随后在训练稳定性方面获得优势。具体而言,所提方法完全基于Transformer架构:既用于将场景图编码为中间对象布局,又用于将这些布局解码为图像,并经过由向量量化变分自编码器学习的低维空间。与现有最优方法相比,我们的方法在图像质量上表现出提升,且基于同一场景图生成的不同样本具有更高多样性。我们在三个公开数据集(Visual Genome、COCO、CLEVR)上评估了该方法。在COCO和Visual Genome数据集上,分别取得了13.7和12.8的Inception Score,以及52.3和60.3的FID分数。通过消融实验评估各贡献组件的影响。代码已开源:https://github.com/perceivelab/trf-sg2im