In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/
翻译:近年来,基于扩散的生成模型凭借其逼真的生成结果和广泛的个性化应用,在视觉和音频生成领域获得了极大关注。与文本到图像或文本到音频生成领域的显著进展相比,音频到视觉或视觉到音频生成的研究进展相对缓慢。当前的视听生成方法通常依赖于大型语言模型或可组合扩散模型。本文并未设计另一个庞大的视听生成模型,而是退后一步,展示了一个简单轻量的生成Transformer——该架构在多模态生成领域尚未得到充分探索——能够在图像到音频生成任务上取得优异结果。该Transformer在离散的音频与视觉向量量化生成对抗网络空间中进行操作,并以掩码去噪方式进行训练。训练完成后,无需任何额外训练或修改即可直接部署无分类器引导机制以获得更佳性能。由于该Transformer模型具有模态对称性,它也可直接应用于音频到图像生成及协同生成任务。实验表明,我们的简单方法超越了近期的图像到音频生成方法。生成音频样本可在 https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/ 查看。